Regardless of the employed tokenization and POS tagging methods, the evaluation of those methods can be expressed statistically using predefined evaluation metrics. The notion of evaluation in the context of POS tagging means evaluating the performance of a POS tagger by comparing its output to a ground truth or a gold standard. Both the output and gold standard are lists of tokens paired with POS tags. While the tags in the output are predicted by the tagger, in the gold standard, those tags are either assigned by a POS tagger or a human annotator.
In the following two subsections, we explain the evaluation of tokenization and POS tagging and what evaluation metrics we use in this study.
Tokenization evaluation metrics
In NLP pipelines, errors in the tokenization stage have a great impact on the task of POS tagging as it relies on accurate token boundaries to classify each token into its corresponding grammatical category (POS tag). Thus, the integrity of tokenization is critical to ensure optimal performance throughout our NLP pipeline. Therefore, evaluating the output of the tokenization methods helps us understand the mistakes the tokenizers make and enables us to reduce error propagation by detecting errors in the very first stages of the pipeline.
The success of any automated tokenization method is measured by the number of tokens it produces in comparison with the expected tokens number in the ground truth. Thus, we distinguish three three performance states of any tokenization method. Consider a tokenization method that takes in an input text , will be a list of output tokens upon applying to , and is a predefined number of expected tokens. The tokenization task will be considered perfect if , under-tokenized (tokens omission) if , or over-tokenized (tokens addition) if . More formally:
Moreover, for evaluation purposes, we distinguish between two types of tokenization evaluation:
- intrinsic evaluation and
- extrinsic evaluation.
Within the intrinsic evaluation, we want to evaluate the quality of the tokenization system in isolation from the later stages, POS tagging in our case. The intrinsic evaluation directly measures the tokenization system’s capabilities by comparing it to similar systems. However, there are no specific evaluation metrics made for tokenization evaluation. Nonetheless, treating the tokenization task as a machine translation or a natural language generation task gives us the space to evaluate the tokenization performance using the overlap of n-grams between the generated and the gold standard tokens. We follow the same approach of (Ahmadi 2020) by performing tokenization evaluation using the Bilingual Evaluation Understudy Score (BLEU).
BLEU (Papineni et al. 2002) is an n-gram precision-based measure for measuring the quality of generated texts. It is computed by multiplying the test corpus’s precision score’s geometric mean by an exponential brevity penalty factor, . In the context of machine translation, the brevity penalty factor is used to penalize the system for generating short yet accurate translations. Given a gold standard tokenized text with length and a tokenization system output with length , the brevity penalty and the BLEU score are calculated as follows:
(2)
(3)
We use because we want to calculate the metric for 4-grams. The represents the weight for each n-gram; since we are using , the weights will be . The represents the precision for each n-gram. BLEU is an objective measure of the system performance; it quantifies the overlap between the n-grams tokens generated by the tokenizer and those present in the gold standard. However, while being fast and simple to use, BLEU does not consider sentence meaning or structure. In addition, it must not be the only utilized evaluation metric as reported by (Reiter 2018). Therefore, we verify our obtained BLEU score by performing an extrinsic evaluation.
Within the extrinsic evaluation, we evaluate the tokenization system by measuring its impact on our whole NLP pipeline. In our case, the tokenization system’s quality greatly affects the POS tagger’s performance. Therefore, the tokenization correctness can also be determined by examining the F1 and accuracy scores of the POS tagger.
POS tagging evaluation metrics
Several evaluation metrics exist for the POS tagging task, such as accuracy, precision, recall, and F1 score.
Accuracy is the percentage of correctly tagged words in the sample. Precision is the proportion of correctly tagged words out of all words that the tagger labeled with a specific POS tag. The recall is the ratio of correctly tagged words out of all words that actually have that POS tag.
F1 score is an algebraic mix or the harmonic mean of precision and recall. It reflects the system’s robustness by highlighting the high-precision, low-recall, or low-precision high-recall trade-off.
The precision, recall, accuracy, and F1 score can be computed at a document, sentence, or tag level.
A text’s true, contextual interpretation and understanding can be more effectively measured at a higher level instead of a fine-grained level (Jatav et al. 2017; Chiche et al. 2022).
Formally, the metrics can be defined as follows:
(4)
(5)
(6)
(7)
TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative samples, respectively.
In multi-class classification scenarios, when we are dealing with training data with class imbalance, it is highly recommended to use macro-averaged metrics over micro. Thus, we mitigate the effect of some classes being over-represented while others are under-represented. The macro-averaged metrics entail calculating the metrics for every class independently and then averaging the results across all classes. For example. in any given textual dataset, the percentage of word tokens with noun tag will be very high, while some other tags such as ADJ might be very low. Therefore, we report the macro-averaged metrics because we want to make sure that each POS tag (class) is treated with equal importance.
In this work, we report only the macro-averaged F1 score and accuracy. Taking the 6 and 7, the averaged metrics for classes are defined as follows:
(8)
(9)
In addition to the aforementioned metrics, the confusion matrix can offer a comprehensive view of a model’s performance across all classes. It helps reveal correct and incorrect classifications and where exactly they occur. The confusion matrix provides a deep insight into the tagger performance by revealing how well the tagger handles under-represented classes in the dataset.