Cogito ergo sum

Part-of-Speech Tagging & its Methods (5)3 min read

The Transformer

The Transformer (Vaswani et al. 2017) was introduced in 2017 as the first sequence transduction model wholly based on self-attention. Transformers can be seen as a replacement for recurrent layers used in encoder-decoder architectures. They are the most robust approach for solving sequence-to-sequence tasks while handling long-range dependencies. In addition, the utilization of parallelization makes them faster and more effective than traditional RNNs. The transformer’s architecture stacks self-attention and point-wise, fully connected layers for both the encoder and decoder. The encoder component comprises six stacked identical layers, each with two sub-layers. The first is a multi-head self-attention, whereas the second is a simple, position-wise, fully connected feed-forward network. A residual connection followed by a normalization layer is added to each sub-layer. The architecture’s sub-layers and embedding layers produce an output of 512 dimensions.

Like the encoder, the decoder comprises six identical layers. The difference is that the decoder has a third multi-head attention sub-layer that operates over the encoder stack’s output. In addition, the self-attention sub-layer is modified to ensure that the predictions for position i can depend only on the known outputs at positions before i. The diagram below demonstrates the encode-decoder structure of the transformer architecture.

The encoder-decoder structure of the Transformer architecture, adapted from (Vaswani et al. 2017)

Attention is the key to the transformers’ success with sequence-to-sequence tasks. Within the current architecture, the attention is a function that maps a query Q, and a set of key-value pairs (K,V), all vectors, to an output computed as a weighted sum of all values. A compatibility function of the query with the corresponding key computes the weight assigned to each value.
The transformers utilize multi-head attention where the queries, keys, and values are linearly projected h times, and then scaled dot-product attention is applied h times in parallel. Afterward, h times outputs are concatenated and projected again.
Scaled dot-product attention is a function where the input consists of queries, keys, and values of dimension d_k and d_v, respectively. Practically, the attention function is computed on the queries Q, keys K, and values V matrices, more formally:

(1)   \begin{equation*}\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} \right)V\end{equation*}


The softmax function is applied here to obtain the weights of the values. The scaling factor \frac{1}{\sqrt{d_k}} is added to counteract the effect of the results of the softmax function pushed in regions where it has extremely small gradients as a result of the dot products for large values of d_k grow large in magnitude.

The advent of transformer-based architectures served as a catalyst for the emergence of large language models (LLMs), revolutionizing the landscape of AI across multiple domains like computer vision, robotics, and NLP. GPT2 (Radford et al. 2019), BERT (Kenton et al. 2019), RoBERTa (Liu et al. 2019), XLM-RoBERTa (Conneau et al. 2020), UDify (Kondratyuk et al. 2019) and most recently Falcon LLM (Penedo et al. 2023) are all examples of recent LLMs with very powerful capabilities. However, despite the fact that it has become a standard to train all those LLMs on multiple human languages, giving them the character of being multilingual or cross-lingual models. They do not always necessarily achieve state-of-the-art results when performing specific downstream tasks like POS tagging compared to monolingual models, especially within LRLs. Nevertheless, they are very adaptive, and they can deliver better results when fine-tuned on monolingual data.

The XLM-RoBERTa (Conneau et al. 2020) is a transformer-based LLM that is trained on textual data from 100 languages, including Northern Kurdish. The model aims to offer improved cross-lingual performance and generalization capabilities across a wide variety of languages, making it highly versatile for multilingual tasks. The model has two variants, base and large. The base variant has a smaller architecture, fewer parameters, and fewer transformer layers, and it is computationally less intensive than the large variant.

Bibliography

About the author

Peshmerge Morad

a machine learning & software engineer based in Germany, whose interests span multiple fields.

Add comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Cogito ergo sum