WHAT IS AN ARTIFICIAL NEURAL NETWORK?
- Genaro Pimienta
- Jan 1
- 6 min read
Updated: Jan 17
ARTIFICIAL NEURAL NETWORKS
Artificial neural networks—also known as deep learning networks—are a type of machine learning algorithm, which can extract exceedingly complex features from input data.
Examples of input data features are the words in a sentence or the patterns in an image.
In analogy to the human brain’s cognitive processes, deep learning networks acquire the ability to identify feature patterns (learn) by modulating the activation state of their artificial neurons.
Artificial neurons are the nodes, which hold together a neural network by assembling in three types of layers:
Input layer — receives numeric representations of input data
Hidden layer — extracts feature patterns from input data
Output layer — computes a prediction or classification task from the patterns extracted by the hidden layers
When embedded in the same layer, artificial neurons are independent (disconnected) from each other. Instead, a dense connectivity is formed between artificial neurons from adjacent layers.
Neural networks can be shallow (one or two hidden layers) (Figure 1) or deep (three or more hidden layers) (Figure 2).

Figure 1. Shallow neural network. Unlike the neurons in the hidden and output layers, input nodes (light blue) lack an activation function. In the hidden layers, ∑ refers to the summation of the input signals, which have been modulated by neuron-specific weights and biases. ReLu is the rectified linear unit activation function, often used in feedforward networks. In the output layer, the symbol in the pink circle refers to a sigmoid activation function, used in regression tasks. The weights and biases are indicated on the periphery for simplicity.
Feedforward neural networks are the prototypical deep neural network commonly used in regression and classification tasks (Figure 2).

Figure 2. Deep neural network. With the exception of the circles labeled softmax in the output layer, the symbols in this figure are the same as in Figure 1. Softmax is an activation function used for probability classification tasks.
THE PERCEPTRON
The perceptron, an algorithm invented in the early 1940s, is a binary classifier based on one artificial neuron. It is considered the smallest version of a neural network, also referred to as single-layer neural network (Figure 3).

Figure 3. The perceptron is a standalone artificial neuron. The equation in the middle circle (the perceptron), specifies the input signal weighted summation ∑ of nth inputs (xiWi), modulated by a bias (B) constant. Since perceptrons are binary classifiers, the activation function used is a Heaviside step, which provides a 0 or 1 (yes or no) output.
Rarely used nowadays, the perceptron is useful as a concept, when explaining how artificial neurons work (Figure 3).
An artificial neuron (perceptron) is a processing unit that performs three consecutive mathematical operations on a collection of incoming numerical inputs:
Summation of the input signals
Modulation of the input signal summation with two constants: a weight and a bias
Transformation of the resulting weighted summation with an activation function
The weight and a bias constants, which control neuron activation strength, are fine-tuned during model training.
Activation functions restrict the artificial neuron's outgoing signal (the weighted summation) to a defined range of values (Figure 4). These ranges are specific to the layer in which the neuron is embedded. For example, the ReLu function, which sets a 0 to infinite range, is normally used in hidden layers. The softmax function, which computes classification probabilities, is commonly used in the output layer (Figure 2). Perceptrons, which perform a simple yes/no prediction, use a Heaviside step activation function (Figure 3).
The most common functions are:
Rectified linear unit (ReLU) — 0 to infinite
Sigmoid function — 0 to 1
Hyperbolic tanh (tanh) — -1 to 1
Softmax — 0 to 1

Figure 4. Commonly used activation functions.
MODEL TRAINING
Model training consists of adjusting the weight and bias values, which modulate each of the neurons embedded in a neural network. These adjustments are fine-tuned iteratively until the neural network (the model) learns to recognize a set of labeled features (e.g., words in a sentence or colors in an image). Each iteration is referred to as epoch.
Model training comprises four steps:
Input forward propagation
Loss function calculation
Backpropagation
Gradient descent
Step 1. Input forward propagation — During forward propagation, numerical information from an input dataset is propagated through the hidden layers in a neural network, prompting the output layer to compute a task (e.g., classification or prediction).
Step 2. Loss function — The loss function calculates how far-off is the output layer's prediction, compared to a labeled target (e.g., the picture of a cat). The type of loss function used to calculate the prediction error depends on the model’s task:
Mean square error — used in regression tasks with continuous numerical values
Cross-entropy — used in multiclass classification tasks
Step 3. Backpropagation — Using partial derivatives and the chain rule, the backpropagation method computes changes to the artificial neuron weights and biases in the neural network, aimed at reducing the prediction error (loss). This process is named backpropagation because the calculations start in the output layer and proceed backward, until the first hidden layer is reached (Figure 5).
Step 4. Gradient descent — The gradient descent method leverages the update rule to adjust the weights and biases in the neural network, as per the values calculated in the backpropagation step (Figure 6).

Figure 5. Error Backpropagation. Shown in this figure is a simplified neural network, which consists of two hidden layers (L), each with one artificial neuron (a). The rightmost derivative (𝜹Loss/𝜹W(L)) calculates the gradient of the loss function (Loss) with the respect to the weight W(L) for the artificial neuron in the output layer a(L). Using the chain rule, the first two partial derivatives from the output layer, 𝜹a(L)/𝜹Z(L)*𝜹Loss/𝜹a(L) are concatenated to subsequent gradient loss calculations (𝜹Loss/𝜹W): a(L-1) and a(L-2). Symbols: Z (weight), B (bias), L (layer), and a (artificial neuron).

Figure 6. Gradient descent. The gradient descent method seeks to reduce the prediction error (loss) in the neural network by applying the changes to the weights (∆W) and biases (∆B), specific to each artificial neuron calculated during background propagation. To change these values, the gradient descent method leverages the update rule equation: ∆W=W - 𝛈(𝜹Loss/𝜹W). In this equation, the learning rate (𝜼 ) refers to the stride, with which a weight value is adjusted (∆W ).
If interested in expanding your understanding, check out the links below.
IBM
3Blue1Brown
DeepBean
EXPLODING AND VANISHING GRADIENTS
During model training, imperfections in the input dataset or neural network architecture design can result in exponentially high (exploding) or low (vanishing) weight values (Figure 7). Exploding and vanishing gradients have a negative impact on neural network performance.
A careful optimization of model hyperparameters can limit the occurrence of exploding and vanishing gradients. Important hyperparameters are the following:
Activation function
Number of hidden layers
Number of nodes in a hidden layer
Number of epochs
Learning rate
Number of batches
Batch normalization
Weight regularization

Figure 7. Model stability. Batch normalization adjusts high weight values by scaling them across batches. Regularization on the other hand, penalizes high weight values. The dropout approach controls model stability by removing (dropping) each iteration round a random set of artificial neurons. L1 and L2 are also regularization techniques: L1 resets high weights to 0, whereas L2 adds a penalty to them.
UNFIT MODELS
Underfit (undertrained) and overfit (overtrained) deep learning models result from poorly planned model training.
Underfit models, which loose prediction accuracy, result from simplistic training datasets and an insufficient number of epochs (training iterations).
Overfit models learn to predict the training data, but are unable to process new information. This happens when the training dataset is cluttered (poorly preprocessed). It can also happen when too many epochs are used (Figure 9).

Figure 9. Underfit and overfit models. Underfit models are inaccurate, and tend to output unreliable predictions. Overfit models on the other hand, are "overtrained", and unable to perform well when exposed to unlabeled (new) data.
FURTHER LEARNING
The following playlists, which I have arranged in order of content depth, are a companion to this blogpost. Check out their hyperlinks!
Deep Learning Fundamentals — deeplizard
AI Fundamentals — IBM Technology
Neural Networks — 3Blue1Brown
Machine learning — DeepBean
Fro those who are unfamiliar with artificial neural network jargon, I have put together a list of the abbreviations used throughout the text.
Activated neuron (a)
Convolutional neural network (CNN)
Feedforward neural network (FNN)
Gated recurrent unit (GRU)
Target (labeled) prediction (y)
Layer (L)
Long short-term memory (LSTM)
Predicted output (ŷ)
Pre-activated signal (Z)
Rectified linear unit (ReLU)
Recurrent neural network (RNN)
Target-decoy competition (TDC)
Target-decoy strategy (TDS)
Weight (W)
Weighted summation (∑)
GPR Updated 01/01/2025
Comments