In this post, I have promised to start a series of posts on the topic of POS tagging based on my master’s thesis. Therefore, this is the first post of this series.
You may wonder why it took me so long to post since it’s a matter of copy-pasting from Latex xD. Well, it is not that easy, especially if you want to do it correctly. Especially, if you want to preserve the citations and provide a neat bibliography of the content. So to be able to do that, I created a WordPress plugin to help me move all the citations into WordPress with minimal effort.
I have submitted the plugin to the WordPress Plugins Directory, once it is accepted, I will create a post about it as well, and everybody can use it for free 🙂
A POS tag is a word class (also known as a lexical tag or morphological class) that provides significant information about the word and its neighboring words in a sentence. POS tags capture morpho-syntactic behavior, and they are only discrete approximations for the behavior of words in sentences. They are useful cues for the sentence’s structure and meaning. POS tags can be categorized into two main categories: closed-class words and open-class words. In the case of English, prepositions and function words like of, it, and, or belong to the closed-class POS tags because we rarely add new prepositions to a language. On the other hand, open-class POS tags contain nouns, verbs, adjectives, and adverbs. Those tags are more dynamic, and as a language evolves, more words of this class could be added (Jurafsky et al. 2023).
The collection of all possible POS tags for any specific language used in a corpus is called a POS tagset. Thus, tagsets can be different depending on the language they belong to. Examples of those tagsets are the Penn Treebank (Marcus et al. 1993), the Claws tagset C7 (Rayson et al. 1998), and the UD (De Marneffe et al. 2021) tagset. Over the course of years, many attempts were made to establish a consistent, universal, and cross-lingual treebank/tagset annotation for many languages. (Petrov et al. 2012) introduced the first universal POS tagset (UPOS) consisting of twelve common POS tags for 22 languages. Based on the same work, (De Marneffe et al. 2021) introduced the UD for English, where they combined multiple English treebanks with extended POS tags (XPOS). The UD for English consists of seventeen UPOS tags, shown in the table below
POS tag | Description | |
---|---|---|
Open Class | noun | Words to describe persons, places, things, etc. |
propn | Proper noun: name of a person, organization, place, etc | |
verb | Words to describe actions and processes | |
adj | Adjective: noun modifiers describing properties | |
adv | Adverb: verb modifiers of time, place, manner | |
intj | Interjection: exclamation, greeting, yes/no response, etc. | |
Closed class | det | Determiner: marks noun phrase properties |
pron | Pronoun: a shorthand for referring to an entity or event | |
num | Numeral | |
cconj | Coordinating Conjunction: joins two phrases/clauses | |
sconj | Subordinating Conjunction: joins a main clause with a subordinate clause such as a sentential complement | |
aux | Auxiliary: helping verb marking tense, aspect, mood, etc | |
adp | Adposition (Preposition/Postposition): marks a noun’s spatial, temporal, or other relation | |
part | Particle: a function word that must be associated with another word | |
Other | punct | Punctuation |
sym | Symbols like $ or emoji | |
x | Other |
POS tagging is the process of assigning POS tags to each word/token in a given text in any language. POS tagging is a disambiguation task because words naturally are ambiguous and can have more than one correct tag depending on the context and their position in the sentence. The task is linguistically very important because it helps identify the grammatical structure of a sentence and the relationships between its constituent parts. This information can be used to analyze the language and gain insights into its structure. In addition, it enables studying the evolution of a language because it helps identify the changes in a language over time, such as word class distribution and changes in word usage.
Automatic POS tagging (hereafter referred to as POS tagging) is a task in NLP performed by a computer system called POS tagger, which is trained (with or without supervision) to automatically label all tokens in any given text with the corresponding POS tag. POS tagging serves many purposes in NLP applications, and it is traditionally considered a building block for other tasks such as NER, information extraction, spelling correction, text classification, NLG, and MT. However, nowadays, within end-to-end neural NLP schemes, unlike traditional pipeline architectures, those NLP tasks are oftentimes performed using one model without being dependent on POS tagging (Gimpel et al. 2010; Schmid 2022; Kanakaraddi et al. 2022; Mitkov 2022; Jurafsky et al. 2023).
POS tagging methods
The task of POS tagging can be seen as a multi-class classification task where a model is trained on annotated data to enable it to classify each token in any given sequence of tokens. There are multiple approaches to tackle the task of POS tagging. Generally, those approaches can be grouped into three categories: rule-based, statistical, and neural-based (Jurafsky et al. 2023; Kanakaraddi et al. 2022).
In the following subsections, we explain those approaches in more detail.
Rule-Based
Rule-based methods are historically the first approaches for POS tagging. They are based on a handwritten and predefined set of rules depending on the linguistic features, such as lexical, syntactical, and morphological information of the specific language the tagger is created for. Linguistic experts mostly create those rules (Kanakaraddi et al. 2022; Mitkov 2022; Jurafsky et al. 2023; Chiche et al. 2022). Examples of rule-based methods are constraint grammar tagging (Karlsson 1990) and transformation-based tagging. Brill’s tagger (Brill 1992) is the most commonly used transformation-based tagger that utilizes a hybrid approach of data-driven and rule-based. However, rule-based methods are limited and can not handle unknown or ambiguous words since all rules are predefined. Statistical and neural-based POS tagging methods overcome this limitation by relying on contextual clues.