Just as part-of-speech tagging serves as a precursor for tasks like syntactic parsing, tokenization is a crucial task in NLP, and it is a prerequisite for POS tagging. Tokenization is the process of segmenting the input text into smaller, distinct units termed tokens. These tokens can encompass compound words, single words, sub-words, symbols, or other significant elements. At its most fundamental level, tokenization separates tokens using whitespace as a delimiter (Mitkov 2022).
In NLP, tokenization plays an instrumental role in determining the quality and quantity of data to be processed in subsequent tasks. The outcomes produced by various tokenization methods can exhibit significant disparities. Such variations arise not only due to intrinsic properties and design principles of the tokenization method in question but also because of the linguistic structures and idiosyncrasies inherent to different world languages. The implications of these disparities are particularly evident in tasks further down the NLP pipeline. For instance, differences in tokenization outputs can directly influence the data available for POS tagging and the performance of the task as well (Jurafsky et al. 2023).
Historically, tokenization followed a relatively straightforward approach, particularly in segmented languages like English or other European languages, where whitespaces served as a dependable indicator of word boundaries. Initial computational models utilized basic rules, making use of spaces, punctuation marks, and other typographical symbols to define tokens. Nonetheless, complexities emerged when dealing with languages lacking clear word boundary indicators, like Chinese, or languages with rich morphological compositions, such as Kurdish, Turkish, or Finnish. In the case of the latter languages, we need more advanced techniques to address the tokenization task in an effective way, ensuring the preservation of the underlying syntactic relationships among words (Jurafsky et al. 2023; Mitkov 2022).
The example below shows the output of the widely used NLTK word_tokenize tokenizer. We see how the clitic ‘m, the genitive marker ‘s (along with the apostrophes), and the period (.) are considered separate tokens despite not having whitespaces between them and the words they are attached to.
The core purpose of tokenization extends beyond mere text segmentation using delimiters like whitespaces. Its fundamental significance lies in converting unstructured data into a useful and interpretable format for machines. The significance of this process becomes apparent in subsequent NLP tasks like part-of-speech tagging, parsing, or named entity recognition as its accuracy significantly impacts the performance of those tasks (Jurafsky et al. 2023).
The recent surge of machine learning, and particularly the dominance of deep learning in NLP, has profoundly transformed the field of tokenization. Traditional rule-based tokenization approaches have faced competition from data-driven methods, harnessing extensive datasets to infer token boundaries. A notable shift in this paradigm has been the emergence of subword tokenization algorithms such as WordPiece (Schuster et al. 2012), Byte-Pair Encoding (BPE) (Sennrich et al. 2016), Unigram language model (Kudo 2018), and SentencePiece (Kudo et al. 2018).
Subword tokenization algorithms aim to tokenize text at a level that strikes a balance between individual characters and complete words, resulting in a more consistent representation of morphemes or semantically meaningful units. In addition, they played a vital role in the success of the state-of-the-art large language models (LLMs). For instance, Google’s BERT model (Kenton et al. 2019) uses the WordPiece tokenization, XLM-roBERTa (Conneau et al. 2020) uses SentencePiece, and OpenAI GPT-2 (Solaiman et al. 2019) uses BPE.
Tokenization methods
In this section, I will be listing some of the tokenization methods in the context of the project I did on POS tagging for Northern Kurdish, therefore, I will not be including all available tokenization methods/techniques.
Manual Tokenization
The manual tokenization, as the name suggests, is the process of manually tokenizing any given text. This method is mostly performed in pairs with the task of manually annotating tokens with the corresponding POS tags. Despite being very time-consuming, it is considered to have the best outcome because it is done by humans with good linguistic knowledge of the language. Therefore, the manually tokenized text can be considered the ground truth or the gold tokens that can be used for evaluating other automatic tokenization methods.
Kurdish Language Processing Toolkit tokenizer
The Kurdish Language Processing Toolkit (KLPT) is a Python-based NLP solution tailored for the Kurdish language (Ahmadi 2020). It has multiple modules for different NLP tasks like preprocessing, stemming, and transliteration. A pivotal component of KLPT is its tokenization module, referred to henceforth as the KLPT tokenizer. This module, rooted in the research of (Ahmadi 2020), is distinguished by its design, which encompasses nearly all the linguistic features of the Kurdish language that I will be explaining and showcasing in future posts.
Natural Language Toolkit Tokenizer
The Natural Language Toolkit (NLTK) is a comprehensive Python-based NLP library. Initially released in 2001, NLTK offers a suite of tools for linguistic data manipulation, including functionalities for classification, tokenization, stemming, tagging, parsing, and semantic reasoning (Bird et al. 2009). Within NLTK, the wordtokenize function has become, through the years, the de-facto tokenization method in NLP for HRLs like English. The NLTK word_tokenize tokenizer leverages the Punkt tokenizer models (Kiss et al. 2006), which uses unsupervised machine learning techniques to tokenize sentences across many languages.
Unigram tokenizer
In contrast to the previous tokenization method, the Unigram model is a data-driven subword segmentation technique introduced in the context of neural machine translation (Kudo 2018).
The Unigram tokenization model works by iteratively attempting to maximize the likelihood of the training data being segmented into subwords. The model starts with a large vocabulary of potential subwords and, during training, probabilistically removes less likely subwords. The result is a flexible and adaptable subword segmentation that can effectively handle morphological-rich languages and out-of-vocabulary terms.
Byte-Pair Encoding tokenizer
Byte Pair Encoding (BPE) is a subword tokenization method originally introduced for data compression. However, the authors of (Sennrich et al. 2016) adapted it for neural machine translation to address the issue of out-of-vocabulary words. BPE merges frequent pairs of characters iteratively in the training data, resulting in a fixed-size vocabulary of variable-length symbols that can represent any word in the dataset. This technique has been especially useful in languages with rich morphology or for tasks where a large vocabulary is impractical.
WordPiece tokenizer
Similar to the two previous tokenizers, the WordPiece tokenization model is a subword segmentation method originally developed to address challenges in machine translation (Schuster et al. 2012). The model operates by iteratively splitting words into smaller units based on frequency counts. Starting from individual characters, the model gradually combines pairs of units, prioritizing those that appear most frequently in the dataset. The process continues until a predefined vocabulary size is reached. This method allows for efficient handling of out-of-vocabulary words and is particularly adept at managing morphologically complex languages, providing a foundation for several subsequent state-of-the-art models in NLP.