ENCODER-DECODER NEURAL NETWORKS

Genaro Pimienta
Jan 4
5 min read

Updated: Jan 28

In my previous blogpost, I explained what recurrent neural networks (RNNs) are. To read more about this topic, click on the following hyperlink: RECURRENT NEURAL NETWORKS

RNNs are a specialized form of the classic feed forward network, which I explained in WHAT IS AN ARTIFICIAL NEURAL NETWORK?. Used to translate words or summarize text, RNNs

ingest sequential inputs, such as the words in a sentence or paragraph (Figure 1).

Figure 1. RNNs are feedforward networks with a hidden state and a feedback loop. Two activation functions are used in RNNs. Neuron activation in the feedforward network is transformed with a ReLu function. The hidden state, updated each term, is transformed with a tanh function. The input (xt) is a word in a sentence or a time point in a time series. ŷt is the output. The subscript (t) indicates that the input is a sequence of terms, which are ingested one at a time.

RNN algorithms loose predictive power in a manner proportional to the length of a sentence or paragraph. This happens because during RNN model training, backpropagation leads to vanishing or exploding gradients. I explained what vanishing and exploding gradients are in WHAT IS AN ARTIFICIAL NEURAL NETWORK?.

In RECURRENT NEURAL NETWORKS, I also discussed how the canonical RNN architecture can be stabilized in three ways. A stabilized RNN can be bidirectional (Figure 2), long short-term memory (LSTM) (Figure 3), or gated recurrent unit (GRU) (Figure 4). The bidirectional strategy is often coupled to the LSTM and GRU approaches, resulting in bidirectional-LSTM or bidirectional-GRU RNNs.

Figure 2. In a bidirectional RNN, the hidden state is regulated by hidden states (hf and hb) in both directions.

Figure 3. In a LSTM, the unrolling recurrent unit operates as a memory cell, which uses three different gates to regulate the information used by the hidden state (ht).

Figure 4. GRUs are an optimized type of LSTM RRN that has two gates instead of three.

RNNs are the main component of sequence-to-sequence (seq2seq) deep learning models, which are used to translate long stretches of text in an automated manner. Let's learn about the seq2seq architecture in the following lines.

Sequence-to-sequence models

The seq2seq architecture was proposed in 2014 and represents the basis of neural machine translation, which is the current paradigm in automated language processing applications.

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation— 2014

Sequence to Sequence Learning with Neural Networks— 2014

Later, in 2015, the attention mechanism was added to the seq2seq architecture.

Effective Approaches to Attention-based Neural Machine Translation — 2015

Neural Machine Translation by Jointly Learning to Align and Translate — 2015

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation — 2016

The attention mechanism captures the semantic context and position of each word in an input sentence. Capturing this information is important when the number and order of the words in the input and target sentences are dissimilar. Without the semantic context of a word in a sentence, the accuracy of the translation task is compromised.

An example is the English sentence "Childhood shapes destiny, like the proteome dictates phenotype," which in German reads: "Die Kindheit prägt das Shicksal, so wie das Proteome den Phänotyp bestimmt." In the source (English) and target (German) sentences, the number of words and their order in the sentence are different.

Sequence-to-sequence architecture

A prototypical seq2seq algorithm has three components:

Encoder
Attention layer
Decoder

The encoder and decoder are typically bidirectional-GRU RNNs. The attention layer, on the other hand, is composed of a context vector and a collection of alignment weights (one for each word in the input text)—Figure 5.

Figure 5. Schematic illustration of the sequence2sequence architecture. Individual hidden states for each word in the input (source) sentence (hs), individual hidden state for each word in the target sentence (ht), unrolled RNN in the encoder when ingesting one word (term) at a time (x), unrolled RNN in the decoder when ingesting one word (term) at a time (y), alignment weight vector for each word in the input (source) sentence (as). The softmax activation function outputs a 0 to 100% probability value.

Steps in seq2seq sentence translation

How a seq2seq model translates a sequence of words is explained in five steps depicted together in Figure 5. Let's go over these steps in the following lines.

Step 1. The encoder and context vector

The encoder, which is normally a bidirectional-GRU RNN, ingests an input sentence by processing one word at a time. As the bidirectional-GRU RNN unrolls, the hidden state (hs) is updated. (The "s" in hs denotes "source," referring to the input dataset.)

To remember how an RNN unrolls, visit my previous blogpost: RECURRENT NEURAL NETWORKS.

Once the entire input is processed, the RNN outputs a final hidden state (hfinal), which feeds into the context vector. The context vector summarizes the important aspects of each word present in the input sentence.

Step 1. The encoder and context vector

Step 2. The alignment mechanism

The decoder uses the final input hidden state (hfinal) from the encoder to infer the first translated word (y1) and to update its target (t) hidden state (ht). For the translation task to continue, the individual hs vectors in the context vector are aligned to ht in the decoder. By aligning the individual hs to ht, the decoder learns to "pay attention" to the relevant semantic information contained in the context vector. Word by word, the alignment step generates alignment score values for each of the individual hs in the context vector.

Step 2. The alignment mechanism

Step 3: The alignment weight vectors

The alignment score values, which are weight vectors (as) , are modified by a softmax activation function. Because a softmax function outputs a value that ranges from 0 to 1, the weight vectors represent probability values (see WHAT IS AN ARTIFICIAL NEURAL NETWORK?).

Step 3. The alignment weight vectors

Step 4. The weighted context vector

To update the context vector, the alignment weight vectors are multiplied by their corresponding hidden state (hs) in the context vector, which were previously generated by the encoder. This results in a weighted context vector.

Step 4. The weighted context vector

Step 5. The attention mechanism

Every time the decoder translates a word, it queries the weighted context vector, which is updated at each step (step 4). From the weighted context vector, the decoder infers which information in hfinal, it must pay attention to every time it translates a new word in the sentence.

Step 5. The attention mechanism

While efficient in automated translation tasks, RNN-based seq2seq algorithms are computationally demanding and, as mentioned above, prone to instability during model training. To overcome these limitations, Google Research developed in 2017 a new type of encoder and decoder named transformer and implemented it in the seq2seq architecture. Transformers surpass RNNs in computational efficiency and stability, and have a powerful attention mechanism.

Attention Is All You Need— 2017

The transformer-based seq2seq architecture has revolutionized the way natural language data is processed. Popular tools like ChatGPT are based on transformers.

I will explain what transformers are in my next blogpost. In the mean time, checkout the following videos:

Large language models explained briefly

https://www.3blue1brown.com/?v=mini-llm

Transformers, the tech behind LLMs

https://www.3blue1brown.com/?v=gpt

Attention in transformers, step-by-step

https://www.3blue1brown.com/?v=attention

Stay tuned!

GPR

Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images.