Part-of-Speech Tagging & its Methods (2)

Hidden Markov Models

A Hidden Markov Model (HMM) is the best-known type of probabilistic/statistical generative model that is used to solve the problem of finding the most likely sequence of states in a finite-state system, given a sequence of observable events or observations (referred to as observables), such as a sequence of input words. The input words are considered observed events, while the POS tags are considered hidden events. HMM is based on the Markov chain, a model that informs us about the probabilities of sequences of random variables or states whose values can be taken from some set. Those sets are finite, like words, tags, or symbols.

The HMM is a probabilistic finite-state formalism characterized by a tuple $\lambda_T(\pi,A,B)$ . T is a finite set of POS tags from a given tagset. $\pi$ is the initial tag’s probability given the first occurring word in sequence. $A$ is the transition matrix, while B is the emission matrix.
Matrix A contains transition probabilities, and matrix $B$ contains the observation likelihoods. The values of A represent tag transition probabilities $P(t_i|t_{i-1})$ of a tag occurring given the previous tag. The maximum likelihood estimate of this transition can be computed as follows:

(1) $\begin{equation*}\centering\textbf{Transitions probability: } P(t_i | t_{i-1}) = \frac{C(t_{i-1},t_i)}{C(t_{i-1})}\end{equation*}$

The values of matrix B represent the emission probabilities $P(w_i|t_i)$ . It is the probability of a POS tag $t$ associated with a given word $w$ .

(2) $\begin{equation*} \centering\textbf{Emission probability: } P(w_i | t_i) = \frac{C(t_i,w_i)}{C(t_i)}\end{equation*}$

A first-order HMM within the context of POS tagging instantiates two assumptions: (1) the Markov assumption and (2) the output independence assumption. Within the Markov assumption, the probability of any given POS tag only depends on the tag before. Formally put:

(3) $\begin{equation*} \centering\text{\textbf{Markov assumption:} } P(t_i | t_1…t_{i-1}) = P(t_i|t_{i-1})\end{equation*}$

While the Markov assumption tells us that only the previous tag matters when predicting the current tag. The second assumption ensures that the probability of a word $w_i$ depends only on the POS tag $t_i$ that produced the word; formally:

(4) $\begin{equation*} \centering\textbf{Output independence assumption: } P(w_i|t_1,…,t_i,t_T,w_1,…,w_i,…w_T) = P(w_i|t_i)\end{equation*}$

The training of an HMM for POS tagging depends on our resources according to two types of training: supervised or unsupervised. For supervised training, the HMM model is trained using an annotated dataset where the words and their corresponding POS tags are known. This allows for a direct estimation of transition probabilities between the tags and the emission probabilities of the words given the tags. Conversely, for unsupervised training, the HMM model is trained using an unannotated dataset, relying only on the words. Since there are no POS tags available in the dataset, the model treats them as hidden variables where their probabilities are estimated using the expectation-maximization algorithm. However, since the unsupervised HMM training is less accurate, we use supervised HMM training in this study.

Within HMM, our goal is to find the most likely POS tag $t_i$ for any given word $w_i$ ; however, the 2 gives us an answer that seems counter-intuitive to our goal, namely, if we have the tag $t_i$ , how likely is it that the input word would be $w_i$ . More generally, we want to
find the most likely sequence $\hat{T}$ out of hidden variables (POST tags) $T = t_1…t_n$ corresponding to the sequence of the observations (input words) $W = w_1…w_n$ .
Formally:

(5) $\begin{equation*} \centering\hat{T} = \underset{t_1 … t_n}{\arg\max} P(t_1… t_n|w_1 … w_n)\end{equation*}$

By applying Bayes’ rule and the two simplification 3 and 4 on 6, we end up with the following notation:

(6) $\begin{equation*} \centering\hat{T} = \underset{t_1 … t_n}{\arg\max} P(t_1… t_n|w_1 … w_n) \approx \underset{t_1 … t_n}{\arg\max} \prod_{i=1}^{n} P(w_i|t_i) P(t_i | t_{i-1})\end{equation*}$

Solving 6 and finding the value of $\hat{T}$ would require trying all the possible paths in the range of $T^N$ . Therefore, we would need $O(2NT^N)$ computations. We would need a tremendous amount of calculations for any simple corpus with a handful of tags. For example, if we have a tagset T of 50 tags and a dataset with 1000 sentences, each sentence averages 20 words. The upper limit of the number of computations for tagging this corpus would be $2*1000 * 20 * 50^{20}=3.8<em>10^{38}$ computations, which is fatally inefficient. \ The best solution for overcoming this inefficacy is the Viterbi algorithm (Viterbi 1967). The algorithm guarantees to find the optimal solution quickly, at most with a complexity of $O(N</em>T^2)$ . For the previous example, the Viterbi algorithm would need no more than $1000<em>20</em>50^2 = 5*10^{7}$ computations.

Conditional Random Field model

One of the prevalent challenges in NLP, in general, and within POS tagging, in particular, is the unknown words or more technically Out-of-Vocabulary words. New words are oftentimes added to languages in the open-class word category, and in the case of LRLs, the training data may not contain all words. This means that already-built models like HMM cannot recognize those newly added or unseen words. In addition, generative models, like HMM, cannot capture global knowledge (the entire sequence of words and their associated probabilities), because the transitions and the emissions are all local knowledge (the immediate context in which a word appears). We can solve this problem by using higher-order HMMs; however, this raises the computation cost needed for performing fast POS tagging.
% Moreover, within HMM, we optimize the joint probability of POS tags and words; calculating the conditional probability is more interesting and intuitive.
Therefore, the conditional random field (CRF) (Lafferty et al. 2001) model is introduced to overcome this problem.

CRF is a discriminative model that considers more realistically the inherent lack of complete datasets required to build a robust and wide-coverage statistical model. Given an input sequence (words) $W = w_1…w_n$ , the goal of a discriminative POS tagger is to find the most likely sequence of POS tags $T= t_1…t_n$ .
We want to find the sequence $\hat{T}$ among all possible tags $\tau$ .

(7) $\begin{equation*}\centering\hat{T} = \underset{T \in \tau}{\arg\max} P(T|W)\end{equation*}$

In addition, CRF can be considered a big version of what multinomial logistic regression does for a specific word.
CRF uses a set of features to represent each word in the sentence, such as the word itself, its previous and next words, its position in the sentence, and other linguistic features. Then it applies a set of weights to each feature to determine the probability of assigning a particular tag to the word. In addition, CRF utilizes the notion of feature function $F$ , which maps an entire input sequence $W$ (words) and an entire output sequence $T$ (POS tags) to a feature vector. The feature functions allow us to encode any dependency in the data. For example, we can have a function that outputs a one every time it encounters a noun following an adverb; otherwise, it is a zero. 7 describes the linear chain CRF, the most common CRF model used for NLP tasks. Linear chain CRF restricts the model to use only the current observed POS tag $t_{i}$ , the previous one $t_{i-1}$ , and the entire input sequence W at a specific position in the input sequence. This restriction makes the CRF more efficient in the task of POS tagging.

CRF computes the log-linear functions over a set of relevant features instead of computing the probability for each tag at each time step. A global probability for the whole tagset is computed by aggregating and normalizing the local features.

Formally, given K features, each equipped with $w_k$ weight, we can define our conditional probability as follows:

(8) $\begin{equation*}\centeringP(T|W) = \frac{1}{Z(W)} exp\left(\sum_{k=1}^{K} w_kF_k(W,T)\right)\end{equation*}$

(9) $\begin{equation*}\centeringZ(W) = \sum_{T' \in \tau} exp\left(\sum_{k=1}^{K} w_kF_k(W,T')\right)\end{equation*}$

(10) $\begin{equation*}\centeringF_k(W,T) = \sum_{i=1}^{n} f_k (t_{i-1},t_i,W,i)\end{equation*}$

10 demonstrates that each local feature $f_k$ at position $i$ only depends on information from the current POS tag $t_i$ , the previous $t{i-1}$ , the entire input words $W$ : $(t_{i-1},t_i,W,i)$

Part-of-Speech Tagging & its Methods (2)7 min read

Hidden Markov Models

Conditional Random Field model

Bibliography

Leave a Reply Cancel reply