## Artificial Neural Networks

In addition to the rule-based methods, discriminative (CRF), generative (HMM), decision trees, and N-gram models for POS tagging, artificial neural networks (ANNs) are a common approach to tackle the task of POS tagging and other NLP tasks nowadays. ANNs can extract features automatically from the input text. Thus, there is no need for any predefined set of rules as in rule-based methods or any feature engineering as in CRFs and decision trees models. In addition, ANNs help capture long-term contextual dependencies in input data, leading to better results for POS tagging.

An ANN is a computational processing paradigm inspired by the structure and functionality of biological neural networks to learn from observational data and perform decision-making tasks. A biological neural network comprises a complex network of neurons that communicate through intricate patterns of electrical signals. The dendrites are responsible for receiving input signals, and based on these inputs, they initiate an output signal through the axon. The ANN operates on a similar principle, utilizing mathematical computations to process inputs.

### The Perceptron algorithm

The concept of neural networks can be traced back to 1943 when Warren McCulloch and Walter Pitts (McCulloch et al. 1943) laid the foundation for understanding how simple artificial neurons could mimic certain aspects of the brain’s functioning. They introduced a network that could solve a binary problem without active learning. Building on their work, in 1957, Frank Rosenblatt introduced the concept of the Perceptron, a single-layer neural network capable of learning linearly separable patterns by adapting its own weights (Rosenblatt 1958) figure 1.

A perceptron is the basic constituent unit of most ANNs and can be viewed as a simplified model of biological neurons. A perceptron has a fixed number of inputs and a single output. Each input has its corresponding weight indicating the influence that input has while computing the output. The output of a perceptron is produced by constructing a linear combination of the input values and their corresponding weights and then applying a nonlinear activation function (step function) such as ReLU, Sigmoid, Tahn, or Logistic.

### Recurrent Neural Networks

Recurrent Neural Networks (RNNs) (Rumelhart et al. 1986) are a class of feed-forward neural networks (NN) inspired by the cyclical connectivity of neurons in the human brain for handling ordered sequential data, something feed-forward neural networks like Convolutional neural networks (CNNs) fail to deal with.

An RNN takes as input a sequence of items like time series, speech, text, audio, video weather, or financial data and produces a fixed- or variable-size output.

RNNs have the concept of *memory* that helps them preserve the (hidden) states or information of previous inputs, which makes them very well suitable for the task of POS tagging since they always consider the data received previously for making any predictions. The memory concept is accomplished because the RNNs have loops; in other words, RNNs can be seen as multiple copies of the same network where each network passes information to the next network.

Formally, given a sequence , the RNN updates its recurrent hidden states at time by:

(1)

is any nonlinear function such as a logistic sigmoid. \

The recurrent hidden state from 1 can be updated according to the following:

(2)

The input sequence is parameterized by the weights matrix , while the hidden states are parameterized by the weights matrix . The are the weights matrix for the hidden units to the output units. and are the biases associated with the recurrent and feedforward layers.

The output at time can be calculated as follows:

(3)

Another important property of RNNs is that they share parameters across each network layer. While feed-forward networks have different weights across each node, RNNs share the same weight parameter within each network layer. Backpropagation and gradient descent are utilized to ensure the weights are adjusted accordingly. RNNs can use the *backpropagation through time* (BPTT) algorithm to determine the gradients. BPTT is different from traditional backpropagation because of its ability to work with sequential data. It sums the errors at each time step across each layer.

Despite the ability of RNNs to handle sequential data and variable-sized inputs, they can not take into account future input for decision-making as they are not bidirectional. In addition, they can not store information for a long time. They have difficulties with capturing long-term dependencies (Bengio et al. 1994). They suffer from the two common problems in deep networks, vanishing or exploding gradient problems where the gradients used to compute the weights updates may get very close to zero or grow exponentially, preventing the network from learning (Goodfellow et al. 2016). Thus, new variants of RNNs have been proposed to overcome those problems: bidirectional RNNs and Gated RNNs.

### Bidirectional RNNS

Bidirectional RNNS (BRNNs) (Schuster et al. 1997) are a variant of RNNs that use available input information from the past and the future for any specific time frame. BRNNs have two passes, a forward and a backward pass. This mechanism provides the output layer of the network with a complete past and future context for every point in the input sequence. The diagram below demonstrates the architecture of a bidirectional RNN across different time steps .

### Long Short-Term Memory

Long short-term memory (LSTM) (Hochreiter et al. 1997) is another efficient and effective variant of gated RNNs that extends the concept of memory to be able to remember new information, hold previous states or forget information in order to achieve effective processing of sequential data and long-term information access. The LSTM architecture consists of a set of recurrently connected memory blocks or memory cells. Each block contains one or multiple self-connected memory cells and three multiplicative units. The input, output, and forget units (gates) provide continuous analogous of write, read, and reset operations of each LSTM cell. Those gates control access to the memory cells. At each input state, a gate is responsible for deciding how much new input should be saved to the memory and how much current information should be forgotten.

The architecture of LSTM enables it to learn long-term dependencies, which is essential in problems like sequence labeling (POS tagging). An LSTM with a forget gate can be formally defined by the following:

(4)

, , and respectively refer to the forget, input, and output gates activation vectors at time step . and are the hidden state (output vector of LSTM) and cell state vectors at time step . The sigmoid and hyperbolic tangent functions are the . The represents the Hadamard product (element-wise product).

### Bidirectional LSTM

While LSTM models can capture long-term dependencies, they only process the data in one direction, and the data flows only forward. This means that potential combined contextual information from the past and the future will be missed. Therefore, bidirectional LSTM (BiLSTM) models (Graves et al. 2005) are introduced to mitigate this shortcoming. The BiLSTMs have two LSTM layers that process information in both directions, enabling them to capture contextual information surrounding a specific data item in a sequence, for example, a word in a given sentence. This property makes the BiLSTM very suitable for sequence labeling tasks such as POS tagging, where context from the previous and following words are crucial for determining the current context.

In contrast to approaches like decision trees where manual feature engineering is required, the BiLSTM models use word, sub-word, or character embeddings such as word2vec (Mikolov et al. 2013), Polyglot \ncite{al2013polyglot}, or fastText (Bojanowski et al. 2017). Word or sub-word embeddings provide a nuanced input representation capturing both semantic and syntactic elements.

### Gated recurrent unit

Gated recurrent unit (GRU) was proposed by (Cho et al. 2014) in 2014 as an alternative to LSTM as it proved to be computationally expensive (Goldberg 2017). However, the architecture of a GRU is not very different from LSTM; it is also based on gates but with substantially fewer gates and without a separate memory component. A GRU has only two gates: an update gate and a reset gate.

The update gate determines the information from the previous state that must be passed to the next state. In contrast, the reset gate determines how much information from the previous state must be forgotten. More formally:

(5)

The variables , , , , are the input vector, output vector, reset gate vector, candidate activation vector, and update gate vector, respectively. , are parameter matrices, while and are the sigmoid and hyperbolic tangent functions.