Lesson 1: Machine Speech Tagging

Definition:

Machine Speech Tagging is the automatic assignment of linguistic tags to spoken as well as written text to identify grammatical information. This forms the basis for machine understanding of technical aids, such as AI.

Key Concepts

Syntax, Parts of Speech (POS)
POS-Tagging: Tags, Tagsets
Corpus linguistics

UNIT 1: POS-TAGGING – DECODING LANGUAGE

We start the lesson with a short exercise: The teacher begins by naming a word. In a clockwise direction, each class member adds another word so that a sentence develops continuously. Do two or three rounds and memorise the sentences. Before the exercise, you can either give a topic for the circle story, e.g. the next school trip, or let your imagination run wild.

As we can see, it worked. Although we all contributed only a small part to the sentence and did not know how it would end, coherent sentences were created that we – probably with a lot of imagination – could also use in everyday life. Why is that? How did we even know which words would fit next?
When we learn a language, we store basic grammatical rules that we need for everyday communication. Among other things, this includes a basic understanding of sentence structure, i.e.: Which word may I place where in the sentence? And for this, in turn, the division into word types is a prerequisite. Opinions differ on this, but the following eight types of words are particularly relevant when considering their position in a sentence:

Noun: "Thing-word" – tree, bread, ship
Verb: "Do-word" – swim, eat, can
Adjective: "Like-word", quality – beautiful, red, sharp
Article/ Pronoun: "Companion" – the, one / substitute – I, it, you, your
Adverb: Circumstantial word – yesterday, happily, here
Conjunction: Conjunction – and, but, or
Preposition: Relation word – in, on, after
Particle: Function word, not inquirable – not, very, because

The word types can be used to clarify the rules of sentence construction and present them in a general form:

"The dog is biting." – Article, noun, verb, adjective.

In so-called POS-Tagging, Part-Of-Speech-Tagging, the different words in a text are assigned the appropriate word types according to this principle. The words are thus provided with "tags" that say something about them. For this purpose, there are various compilations of relevant tags – these are called "tagsets" – which differ depending on the language or the focus of the tagging. In English, we usually work with the "Penn Treebank Tagset". This specifies the individual word types even more precisely. This looks something like the following:

1/3

On the one hand, this assignment to word types can be done manually, on the other hand, there are also automated tag procedures. In both cases, however, it is important that the words are always considered in context because in some cases a word can be spelled the same but have different meanings. For example: "Panicking, I bat the bat with my racket". If one were to go strictly by the tagset and disregard the context, one would assign the tag "verb" to both "bat". In the context, however, it is clear that bat should mean the verb and bat the noun.
For example, in order for artificial intelligences, i.e. intelligent computer systems, to be able to translate language independently or generate it themselves, clean POS-Tagging and an understanding of general sentence structures are fundamental in order to be able to form grammatically correct sentences at all. In addition, this information can also be helpful for search engine optimisation.

Activity 1: Matching parts of speech
Practise your knowledge of word types in POS-Tagging

Word types and POS-Tagging need to be taught. Try the sentence yourself
"Yesterday, the brown cow casually hopped around the classroom and got stuck on the teacher's desk."
and match each word to the correct part of speech (see above).
Did that work? Then go to https://parts-of-speech.info/ and tag the same sentence on the website. Compare the differences with your assignment. Who was faster? Did you assign different parts of speech than the computer? Where do problems arise?

Did you finish the exercise? Yesterday (Adverb), the (Determiner) brown (Adjective) cow (Noun) casually (Adverb) hopped (Verb) around (Preposition) the (Determiner) classroom (Noun) and (Conjunction) got (Verb) stuck (Verb) on (Preposition) the (Determiner) teacher's (Noun) desk (Noun).

UNIT 2: CORPUS LINGUISTICS

Not only can AIs benefit from the advantages of POS-Tagging, but also linguistic research can greatly benefit from it. These technical tools of computational linguistics open up many new possibilities for the sub-field of "corpus linguistics". As its name suggests, corpus linguistics deals with various corpora – large collections of natural language data. This can be, for example, a Metaphor Corpus, an International Corpus on Learner English or a Corpus of Middle English prose and verse. Since they all comprise huge masses of data, the research work cannot be done with them on their own. Thanks to computer programmes and, among other things, POS-Tagging, it is nevertheless possible to work with them.

Computer analysis can, for example, reveal words that are used together more frequently than average in the corpora. For example, depending on the corpus, a search for "sky" may turn up the results "blue","grey" or "earth", because they occur disproportionately often in connection with each other in the texts of the corpus. The historical background of the language can also be researched.
For example, the verb "to google" has not always

been used, but it only became famous through the company Google. For example, in the Corpus of Contemporary American English (COCA), which contains 450 million words from a wide variety of sources, one can examine when “to google” was first used in these sources. In addition to all kinds of linguistic questions about how studies can also make use of such techniques. For example, for stylistic analyses, i.e. observations with a view to the style of a text corpus, important clues to the text genre or even the author can be observed through the distribution of word types. For example, it is possible to examine which terms occur particularly often in which parts of a novel, whether the sentences become longer at points that are exciting in terms of content, or whether linguistic regularities are violated in the course of the book. In other words, all observations are quantitative.

Bildschirmfoto 2023-12-09 um 10.16.47.png

This example shows, among other things, that in Oscar Wilde's work "The Picture of Dorian Gray" "Dorian" is by far the most frequent word in the text. Furthermore, it can be observed on the right in which sections of the text the name "Dorian" is mentioned particularly often and that he appears, for example, in a reciprocal relationship to the "Lord". This in turn allows us to draw conclusions about the content.

Such precise observations are, of course, not only possible for nouns, proper names and titles, but all types of words can be examined equally, just as in corpus linguistics.

So far we have seen many possibilities of machine speech tagging, POS-Tagging and digital tools of computational linguistics. In the remaining minutes, discuss in groups of three:
– In which parts of the work are humans superior to computers? What gaps might technology still have?
– Can all languages, groups of people etc. be equally studied and taken into account with such tools? If not, why not? What would have to be changed?

Activity 2: Discussion
Opportunities and failures

Did you finish the exercise? – It is not yet possible to label a (regional) dialect correctly with the machine. Therefore, the dialect must be normalised in order to be examined. In this way, linguistic variation is lost or not represented. – Due to the different tagsets and systems, universality and thus comparability is lacking. – Problem: to work with language, a tagset must exist. Regional dialects or linguistic minorities are not analysed in this way, or not in a way that is true to everyday life.

Final thought for this lesson

POS-Tagging provides us with the basis for all forms of automatic understanding of technical aids, such as AI.

How could POS-Tagging continue to contribute to the advancement of speech processing technologies and what potential do you see for future applications in this area?

Further Lessons

Sources used

TEXTS

Chiche, A., & Yitagesu, B. (2022). Part of speech tagging. A systematic review of deep learning and machine learning approaches. J Big Data 9. https://doi.org/10.1186/s40537-022-00561-y

Imo, W. (2016). Grammatik. Eine Einführung. J.B. Metzler.

Pittner, K., & Berman, J. (2021). Deutsche Syntax. Ein Arbeitsbuch. 7., überarbeitete und erweiterte Auflage. Narr.

Stückler, L. (2022). Empirische Methoden der Sprachwissenschaft, Vorlesung Universität Bern.

ILLUSTRATIONS

Art-generator (2023). https://hotpot.ai/art-generator

Digital Humanities (2022). ISDT. Italian Stanford Dependency Treebank. https://dh.fbk.eu/research/tint/

Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., & de Paiva, V.C. (2017). Universal Dependencies for Portuguese. International Conference on Dependency Linguistics. https://www.semanticscholar.org/paper/-Universal-Dependencies-for-Portuguese-Rademaker-Chalub/703a1e207c47436dd08b6524b68ccb5267aee7d3

Stückler, L. (2022). Empirische Methoden der Sprachwissenschaft, Vorlesung Universität Bern

van der Aa, Han. (2017). Comparing and Aligning Process Representations. https://www.researchgate.net/-figure/4-Overview-of-the-Penn-Treebank-tagset-from-135-p131_tbl3_320858849

Wolf, R. (2023). voyant-tools.org