RECURRENT NEURAL NETWORKS

Genaro Pimienta
Oct 14, 2024
4 min read

Updated: Jan 4

In my previous blogpost entitled WHAT IS AN ARTIFICIAL NEURAL NETWORK?, I explained that feedforward networks (FFN) are the prototypical deep learning algorithm. I also mentioned that FNNs are used in regression and classification tasks, and that their performance is modulated by activation functions and numeric constants (weights and biases), which are fine-tuned during model training (Figure 1).

Figure 1. Deep neural network. With the exception of the circles labeled softmax in the output layer, the symbols in this figure are the same as in Figure 1. Softmax is an activation function used for probability classification tasks.

In this blogpost I discuss recurrent neural networks (RNN), which are a specialized form of FFN.

RECURRENT NEURAL NETWORKS

RNNs are type of deep learning architecture used to interpret patterns from sequential data, such as words in a sentence or paragraph. RNNs are often used in language translation and word prediction.

When exposed to sequential data, the FNN embedded in the RNN operates as a recurrent unit: it interprets one word at a time in a sentence (or paragraph) (Figure 2).

Figure 2. RNNs are feedforward networks with a hidden state and a feedback loop. Two activation functions are used in RNNs. Neuron activation in the feedforward network is transformed with a ReLu function. The hidden state, updated each term, is transformed with a tanh function. The input (xt) is a word in a sentence or a time point in a time series. ŷt is the output. The subscript (t) indicates that the input is a sequence of terms, which are ingested one at a time.

UNROLLED RECURRENT NEURAL NETWORK

To understand its function, an RNN is visualized in its unrolled form. (This is a conceptual abstraction.)

As it iterates (recurs) across the sequential string of words, the unrolled recurrent unit uses a feedback loop to recycle information from the letters it interpreted in previous steps. The flow of information recycled by the feedback loop is regulated by a Tanh function (Figure 3).

Every time the unrolled recurrent unit advances along the sequential input, the context and meaning of previous words is accumulated in a hidden state. Using this hidden state as a form of memory of previous events, the recurrent unit ensures term dependency among non-contiguous words in a sentence. The flow of information recycled by the feedback loop is regulated by a Tanh function (Figure 3).

Figure 3. The unrolled RNN.

SHORT-TERM MEMORY

As explained in my previous blogpost WHAT IS AN ARTIFICIAL NEURAL NETWORK?, fine-tuning a neural network, if not done carefully, can result in vanishing or exploding gradients. When this happens, model training is compromised.

RNNs are impacted by vanishing gradients due to their architectures. The performance of an RNN drops proportionally to the length of the input dataset. Vanishing gradients affect the first terms in a sequence the most. As the recurrent unit advances along a string of words, its hidden state looses information about the first terms it interpreted. We call this loss of performance short-term memory.

To overcome short-term memory in RNNs three types of RNN architecture have been put in place: bidirectional, long short-term memory (LSTM), and gated recurrent units (GRU). I explain these three types of RNN below.

BIDIRECTIONAL

Bidirectional RNNs overcome the short-term memory problem by implementing hidden states that flow in opposite directions (Figure 3). When training a bidirectional RNN, gradient loss is minimized by the presence of hidden states that receive information from the two ends of the string of words.

Figure 4. In a bidirectional RNN the hidden state is regulated by hiddent states (hf and hb) in both directions.

LONG SHORT-TERM MEMORY (LSTM)

To assure that the unrolling recurrent unit only captures relevant information, LSTM RNNs use three “information gates” (Figure 5).

Forget gate
Input gate
Output gate

These gates are regulated by the status of a “memory cell state”, which reflects the information gathered by hidden state, as it moves along the sequential input.

Figure 5. In a LSTM, the unrolling recurrent unit operates as a memory cell, which uses three different gates to regulate the information used by the hidden state (ht).

GATED RECURRENT UNITS (GRU)

GRUs are optimized LSTMs that feature two information gates (reset gate and update gate). The reset gate determines if the information contained in the hidden state from a previous instance is inherited by the next one. The input gate, on the other hand, controls which information from an input instance is ingested by the network (Figure 6).