Cogito ergo sum

Part-of-Speech Tagging & its Methods (1)6 min read

In this post, I have promised to start a series of posts on the topic of POS tagging based on my master’s thesis. Therefore, this is the first post of this series.
You may wonder why it took me so long to post since it’s a matter of copy-pasting from Latex xD. Well, it is not that easy, especially if you want to do it correctly. Especially, if you want to preserve the citations and provide a neat bibliography of the content. So to be able to do that, I created a WordPress plugin to help me move all the citations into WordPress with minimal effort.
I have submitted the plugin to the WordPress Plugins Directory, once it is accepted, I will create a post about it as well, and everybody can use it for free 🙂

A POS tag is a word class (also known as a lexical tag or morphological class) that provides significant information about the word and its neighboring words in a sentence. POS tags capture morpho-syntactic behavior, and they are only discrete approximations for the behavior of words in sentences. They are useful cues for the sentence’s structure and meaning. POS tags can be categorized into two main categories: closed-class words and open-class words. In the case of English, prepositions and function words like of, it, and, or belong to the closed-class POS tags because we rarely add new prepositions to a language. On the other hand, open-class POS tags contain nouns, verbs, adjectives, and adverbs. Those tags are more dynamic, and as a language evolves, more words of this class could be added (Jurafsky et al. 2023).

The collection of all possible POS tags for any specific language used in a corpus is called a POS tagset. Thus, tagsets can be different depending on the language they belong to. Examples of those tagsets are the Penn Treebank (Marcus et al. 1993), the Claws tagset C7 (Rayson et al. 1998), and the UD (De Marneffe et al. 2021) tagset. Over the course of years, many attempts were made to establish a consistent, universal, and cross-lingual treebank/tagset annotation for many languages. (Petrov et al. 2012) introduced the first universal POS tagset (UPOS) consisting of twelve common POS tags for 22 languages. Based on the same work, (De Marneffe et al. 2021) introduced the UD for English, where they combined multiple English treebanks with extended POS tags (XPOS). The UD for English consists of seventeen UPOS tags, shown in the table below

UDT English POS tagset defined by (De Marneffe et al. 2021; Petrov et al. 2012)
POS tag Description
Open Class noun Words to describe persons, places, things, etc.
propn Proper noun: name of a person, organization, place, etc
verb Words to describe actions and processes
adj Adjective: noun modifiers describing properties
adv Adverb: verb modifiers of time, place, manner
intj Interjection: exclamation, greeting, yes/no response, etc.
Closed class det Determiner: marks noun phrase properties
pron Pronoun: a shorthand for referring to an entity or event
num Numeral
cconj Coordinating Conjunction: joins two phrases/clauses
sconj Subordinating Conjunction: joins a main clause with a subordinate clause such as a sentential complement
aux Auxiliary: helping verb marking tense, aspect, mood, etc
adp Adposition (Preposition/Postposition): marks a noun’s spatial, temporal, or other relation
part Particle: a function word that must be associated with another word
Other punct Punctuation
sym Symbols like $ or emoji
x Other

POS tagging is the process of assigning POS tags to each word/token in a given text in any language. POS tagging is a disambiguation task because words naturally are ambiguous and can have more than one correct tag depending on the context and their position in the sentence. The task is linguistically very important because it helps identify the grammatical structure of a sentence and the relationships between its constituent parts. This information can be used to analyze the language and gain insights into its structure. In addition, it enables studying the evolution of a language because it helps identify the changes in a language over time, such as word class distribution and changes in word usage.

Automatic POS tagging (hereafter referred to as POS tagging) is a task in NLP performed by a computer system called POS tagger, which is trained (with or without supervision) to automatically label all tokens in any given text with the corresponding POS tag. POS tagging serves many purposes in NLP applications, and it is traditionally considered a building block for other tasks such as NER, information extraction, spelling correction, text classification, NLG, and MT. However, nowadays, within end-to-end neural NLP schemes, unlike traditional pipeline architectures, those NLP tasks are oftentimes performed using one model without being dependent on POS tagging (Gimpel et al. 2010; Schmid 2022; Kanakaraddi et al. 2022; Mitkov 2022; Jurafsky et al. 2023).

automatic_sequence_labeling_Peshmerge_Morad
The task of sequence labeling demonstrated as a POS tagger mapping each word of the input sequence Ez bajêr dibînim. (Northern Kurdish) `I see the city.’ to corresponding POS tags.

POS tagging methods

The task of POS tagging can be seen as a multi-class classification task where a model is trained on annotated data to enable it to classify each token in any given sequence of tokens. There are multiple approaches to tackle the task of POS tagging. Generally, those approaches can be grouped into three categories: rule-based, statistical, and neural-based (Jurafsky et al. 2023; Kanakaraddi et al. 2022).
In the following subsections, we explain those approaches in more detail.

Rule-Based

Rule-based methods are historically the first approaches for POS tagging. They are based on a handwritten and predefined set of rules depending on the linguistic features, such as lexical, syntactical, and morphological information of the specific language the tagger is created for. Linguistic experts mostly create those rules (Kanakaraddi et al. 2022; Mitkov 2022; Jurafsky et al. 2023; Chiche et al. 2022). Examples of rule-based methods are constraint grammar tagging (Karlsson 1990) and transformation-based tagging. Brill’s tagger (Brill 1992) is the most commonly used transformation-based tagger that utilizes a hybrid approach of data-driven and rule-based. However, rule-based methods are limited and can not handle unknown or ambiguous words since all rules are predefined. Statistical and neural-based POS tagging methods overcome this limitation by relying on contextual clues.

Bibliography

About the author

Peshmerge Morad

a machine learning & software engineer based in Germany, whose interests span multiple fields.

Add comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Cogito ergo sum