TRANSFORMERS

Genaro Pimienta
Jan 22
6 min read

Updated: Feb 13

The deep learning architecture named Transformer first appeared in the literature 2017. You can check out the publication by clicking on this hyperlink: Attention Is All You Need — 2017

Conceived to overcome the limitations of the natural language processing models of the day, the Transformer architecture went from making text translation and next-word prediction more accurate, to inspiring the development of chatGPT (OpenAI). (GPT stands for Generative Pre-trained Transformer.)

Language Models are Few-Shot Learners— 2020

After chatGTP, many more Transformer-based large language models (LLM) were developed, such as Claude (Anthropic), Copilot (Microsoft), Gemini (Google), and Llama (Meta).

Transformer-based large language models (LLMs) gave birth to the AI era we live in. LLMs are trained using NVIDIA's Graphics processing unit (GPU) technology and the ever-increasing amount of publicly available online text and images (news papers, books, and social media).

But understanding how a Transformer model works is not as easy as having chatGPT write love letters for you.

Below I describe, in a non-mathematical manner, the building blocks of the prototype Transformer model published in 2017.

I broke down the narrative in four sections:

The encoder-decoder architecture
The Transformer block
The Attention mechanism
The Vanilla Transformer

Reading my previous blogposts will help you understand the concepts I discuss below.

WHAT IS AN ARTIFICIAL NEURAL NETWORK?

RECURRENT NEURAL NETWORKS

ENCODER-DECODER NEURAL NETWORKS

THE ENCODER-DECODER ARCHITECTURE

In neural machine translation models, the sequence-to-sequence (seq2seq) architecture is the preferred scaffold (Figure 1).

***Figure 1. The seq2seq architecture.***

Before the Transformers were introduced, the encoder and decoder in seq2seq models were recurrent neural networks (RNNs) connected by an attention layer.

RNNs ingest sequential inputs in a stepwise fashion (Figure 2). This makes RNNs an ideal component in language processing models. Examples of sequential data are:

Words in a text
Nucleotides in a DNA sequence
Data points in a weather forecast time series

But the stepwise mechanism in RNNs comes at a cost. For RNNs are unable to parallelize and loose performance proportionally to the length of the sequential input—a phenomena named long-term dependence. See RECURRENT NEURAL NETWORKS.

***Figure 2. The unrolled recurrent unit concept to explain RNNs***

To overcome long-term dependency, two types of RNNs are available:

Bidirectional Gate Regulated Unit (GRU) RNN
Bidirectional Long-term Short Memory (LSTM) RNN

But while good at recovering long-term information, the GRU and LSTM RNNs fail to parallelize data processing because they must still ingest one term at a time.

THE ATTENTION BLOCK

The Transformer architecture was designed to overcome parallelization issues in seq2seq models.

Instead of RNNs, Transformers use an attention block (also named attention layer), which includes multiple self-attention heads and a multilayer perceptron (Figure 3).

See WHAT IS AN ARTIFICIAL NEURAL NETWORK? for a better explanation of multilayer perceptrons, also called feedforward networks.

In Transformer models, language processing is driven by the self-attention mechanism, which resides in the attention head and is expressed by the following equation:

The three N-dimensional matrices—query (Q), key (K), and value (V)—contain tunable weights that modify the words in the input dataset. See Box 1 for an explanation of the Query:Key:Value concept.

The values in the Q, K, and V matrices are adjusted when the model is trained using the backpropagation approach. I explained what backpropagation is in WHAT IS AN ARTIFICIAL NEURAL NETWORK?.

In the self-attention equation the superscript T indicates that the K matrix is transposed, whereas the √dK divisor is the square root of the number of dimensions in K. Scaling the K∙Q values by √dK stabilizes the model during backpropagation.

In the following section I describe, step-by-step, how a Transformer processes an input text.

THE ATTENTION MECHANISM

Step 1. Embedding Representation.

The first step in the attention mechanism is to tokenize the input dataset. In the example shown below the sentence "Three sad tigers ate wheat" is broken down to five tokens: one per word.

The tokens are next converted to a n-dimensional numeric vector by passing them through an embedding layer. Composed of the vocabulary used to train the Transformer model, the embedding layer is a matrix populated with word-specific numeric values.

From the vectorized tokens an embedding representation is produced, which contains positional encoding information.

Step 2. Self-Attention Scores.

In Step 2 the embedded tokens are modified by the weight values in the query (Wq) and key (Wk) matrices. As the embedded representation passes through the Wq and Wk matrices, the token vectors are enriched with information pertaining their semantic interconnections.

Self-attention scores result from the matrix dot product of the Wq- and Wk-modified values (Q∙K) in the embedding representation. To preserve model stability, the Q∙K values are normalized by √dk. Recall the attention mechanism equation:

Step 3. Attention Pattern.

An attention pattern is obtained after normalizing the self-attention scores with a softmax function. The normalized scores are attention weights, which represent probability values that range from 0 to 1. The softmax function converts negative values to zero and is expressed by the following equation:

Step 4. Self-Attention Output.

The last step in the attention mechanism is to modify the softmax-normalized weights in the attention pattern with the value matrix (Wv). The self-attention output is the weighted sum of the Wv-modified attention weights for each embedding vector.

Step 5. Updated Embedding.

Each self-attention output is added to its corresponding embedding vector. This procedure updates the embedding representation by adding semantic information to the words in the input sentence. Also called residual stream, the self-attention output added to the initial embedding vector helps to stabilize the model.

Step 6. Multilayer Perceptron.

To extract complex semantic information, each token in the updated embedding is channeled through a multilayer perceptron (MLP). Also called feedforward networks, MLPs are prototypical deep learning models, which I spoke about in: WHAT IS AN ARTIFICIAL NEURAL NETWORK?.

The attention head has one MLP per token vector in the updated embedding. While processed simultaneously, each token vector is modified separately by its designated MLP. Like in Step 5, the MLP output is used to update the embedding representation.

Step 7. Unembedding Matrix.

An unembedding matrix (Wu) converts the MLP's output vector into a numeric format. Like the embedding matrix used to tokenize the word vectors, Wu contains numeric tags for the each word in the corpora used to train the Transformer model.

Word probability scores are obtained by normalizing the values in the Wu-modified matrix with the softmax function. The highest probability score is the "next word" predicted or translated in the Transformer output.

***Step 7. Unembedding matrix and next word prediction.***

THE VANILLA TRANSFORMER

Having explained how the attention block processes an embedding representation, we can now look at the vanilla Transformer architecture in full form.

In the vanilla Transformer model the encoder is the attention block I described above; and its output is passed onto the the decoder via a cross-attention mechanism.

The decoder in the vanilla Transformer has two different attention blocks. The topmost attention block receives the encoder output and passes it through its multi-head attention mechanism and token-specific MLPs. After translating the first word from the input sentence, the topmost attention block feeds this information to the lowermost attention head, which has a masked multi-head attention mechanism.

While not explained here in detail, the masked multi-head attention mechanism masks the words in the sentence that lay ahead of the one being translated.

The task in the hypothetical example shown in Figure 4 is to translate the sentence "Three sad tigers ate wheat" (English) to "Tres tristes tigres tragaban trigo" (Spanish).

In Figure 4 below, the word "tres" has been translated. The words "tigres tragaban trigo" lay ahead of "tristes," which is in the process of being translated.

***Figure 4. Vanilla Transformer architecture.***

I want to finish this blogpost by pointing out that the chatbots (chatGPT) and search engines (Gemini) are built from Transformers smaller than the vanilla architecture I show in Figure 4.

For example, BERT (Google's first Transformer), is an encoder-only model. ChatGPT on the other hand, is an encoder-only model.

Each of these models is built from billions of parameters. That's what makes them large language models.

This is all for now. In my next blogpost I will talk about Transformer models used in MS-proteomics.

Stay tuned!

GPR

Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images.

TRANSFORMERS

Recent Posts

Comments