Decision Trees model
The task of POS tagging can be seen as a multi-class classification task where a model is trained on annotated data to enable it to classify each token in any given sequence of tokens in testing data. In this context, the most straightforward supervised machine learning technique that can be used is decision trees. In addition to HMMs and CRFs, decision trees have proved to be an effective approach for performing the task of POS tagging due to their straightforward and clear structure for modeling real-world problems. The first decision trees POS tagger was introduced by (Schmid 2023).
A decision tree (DT) is a non-parametric supervised learning method that embodies a graphical representation of the process of decision-making or mapping observations to conclusions. Its structure resembles that of a tree, with internal nodes signifying decisions or tests on specific attributes, branches indicating outcomes of these tests, and leaf nodes representing class labels (for classification tasks) or predicted values (for regression tasks). A DT shines particularly in intricate decision-making scenarios that encompass numerous attributes or features.
For the task of POS tagging, a DT is used as a classification tree because the target variable is a discrete set (a finite set of POS tags in the human language) of values (Schmid 2023).
Since the DTs are supervised learning methods, it is required to define a set of features on which the DT will base its decision-making process. For this purpose, there are two important concepts when working with decision trees, Entropy and Information Gain. Entropy is the measure of impurity or disorder in the training data. It is used to quantify the uncertainty in the distribution of class labels across all POS tags.
More formally, let be the proportion of instances belonging to the class within a subset of the set of all available POS tags . The entropy would be:
(1)
On the other hand, the information gain (IG) is a metric used to measure the reduction in entropy that is achieved by partitioning the data based on a specific feature . It helps us determine how much knowledge a class label has gained after performing a split. Information gain is the difference between the entropy of the original subset and the weighted sum of entropies of the resulting subset. Formally, IG for feature is calculated as follows:
(2)
is the original subset of data. Values (F) are all possible values of feature , is the subset of data with feature having value . Finally, and are the entropies of subsets and .
Despite being easy to use and very effective in addressing classification and regression problems, DTs could suffer from known problems like overfitting, bias, instability, and high variance. Therefore, Random Forests and ExtraTrees (extremely randomized trees) are introduced to overcome those problems.
ExtraTrees is a variation of Random Forest trees, and its goal is to improve the latter by introducing more randomness during the construction of the DTs. ExtraTrees are characterized by random feature selection where at each split node, a random subset of features is considered for finding the best split rather than using all features. In addition, a random threshold is used during each split instead of finding the best threshold. This enhances the model’s robustness and increases diversity, automatically leading to more computational efficiency and less overfitting.
Random Forests trees are an ensemble method combining multiple decision trees to improve the model’s accuracy and reduce overfitting. Within Random Forests, we create a collection, or forest, of DTs, each trained on a different subset of the data and making independent predictions. The final prediction of the model is then determined by aggregating the predictions of all individual trees using majority voting in the case of classification.
N-gram model
Compared to the previously mentioned algorithms for POS tagging, N-gram models offer computational efficiency as they primarily rely on counting word and tag occurrences in the training data, making them faster in both training and prediction compared to other methods. Additionally, due to their probabilistic nature, they can easily handle previously unseen words by falling back on lower-order grams, which offers a form of built-in generalization.
An N-gram model is a probabilistic language model that is used to capture statistical relationships between N consecutive words in any given text. The relationship is addressed by modeling the likelihood of a particular word or sequence of words occurring, given the occurrence of previous words in a sentence or a whole text. The N-gram is based on the assumption that the probability of a word depends only on the preceding words in the sequence, making the probability distribution of the current word conditionally independent of words that are further back in the sequence. This is called the Markov assumption, which is shown in ?? (Jurafsky et al. 2023).
The N in the N-gram model represents the number of consecutive words to be considered when training the model. Unigram (1-gram), bigram (2-gram), and trigram (3-gram) are the most common N-gram models in the field of NLP. The Unigram model is a straightforward approach that assigns tags to tokens solely based on their most frequent POS tags seen during the training phase, independent of their neighbors in the sequence. While this often results in reduced tagging accuracy, it compensates with computational efficiency and simplicity of implementation. These characteristics make it an ideal baseline method for POS tagging.
For example, in a trigram model, the assumption is that the probability of a POS tag for the given words depends only on the two preceding words and their corresponding POS tags.
The training of an N-gram model entails estimating the probabilities of all words or sequences of words based on the frequencies of their occurrences in a corpus. It is worth noting that within n-gram models, we are not calculating the full history of each word; we try to approximate the history by using the preceding n-words. The most straightforward way of estimating the probabilities is using the Maximum Likelihood Estimation (MLE). The MLE is calculated by getting the count of . Then we normalize the counts by dividing them by the total count of the sequence , which is necessary because we want to keep the value of MLE between 0 and 1.
More formally:
(3)
The numerator and the denominator in 3 represent the observed frequency, and the ratio is called the relative frequency (Jurafsky et al. 2023).
However, in cases where the training data is small, there is a chance that a certain N-gram might not be found in the training data, resulting in a zero value in the numerator in 3. As a result, the probability will be zero. Therefore, Laplace smoothing is introduced to avoid such cases and to provide more robust and reasonable probability estimates for unseen N-grams. Laplace smoothing is the assumption that each n-gram in any given corpus occurs exactly one more time than it actually does. It involves adding a constant to the numerator and , which is the total number of unique words in the corpus to the denominator in 3.
Formally:
(4)
Despite being very effective in preventing zero probabilities, Laplace smoothing introduces a certain amount of artificial probability, which leads to an incorrect representation of the underlying language distribution. Therefore, other methods are added to mitigate those downsides, for example, Katz backoff, Good-Turing, or the Good-Turing backoff smoothing (Jurafsky et al. 2023).