Visual Analytics of Natural Language and Linguistic Data

◀ Back to Text Visualization Overview

As part of an interdisciplinary effort computer scientists and linguists at the University of Konstanz jointly develop novel ways of visually analyzing natural language for linguistic inquiry. The research questions under investigation mainly come from the linguistic fields of Typology and Historical Linguistics. The focus lies on an investigation of language features and phenomena from both cross-linguistic and diachronic perspectives. Currently, three main areas are being worked on that will be shortly described in the following.

Cross-Linguistic Comparison of Language Features

Comparing language features across different of today’s languages is key to learn more about causes and effects of language change. It can help to better understand natural language and reveal which kinds of phenomena are shared across all languages. Such phenomena can be considered rather stable over history and therefore universal. Other phenomena may, in contrast, vary heavily across languages which indicates that they are prone to change. Unfortunately, for many language phenomena the available information is sparse and imprecise, i.e. it corresponds to a rough manually constructed categorization that is only available for a subset of the world’s approximately 6,900 languages. However, there are cases where this gap in precision and amount can be filled with the support of computational means. Certain language phenomena may be approximated by deriving numerical features from texts. Those features may consist in rather complex sets of relations and therefore suitable visualizations are required to convey them to the analyst. One example for language features are vowel distributions within words, which indicate whether a language contains Vowel Harmony or uses Syllable Reduplication (see Figure 1).

High-resolution version containing 42 languages, click here.

Figure 1 - Matrices reveal how vowels are distributed within different languages. They show which vowels tend to succeed or avoid each other within words of different languages. Each cell indicates whether a certain vowel is more (blue) or less (red) likely to be observed after a given vowel than expected under independence assumption. The color saturation indicates the effect strength. Here, 7 language matrices are ordered from left to right according to the average effect strength (in parentheses). The patterns indicate different kinds of Vowel Harmony and the features involved.

Analysis of Geo-spatial and Genealogical Distributions of Language Features

When comparing features across languages it is of special interest to explore the geo-spatial and genealogical implications. It might be the case that only genetically related languages or languages in a specific world region share a certain peculiarity in a feature. Such findings help to hypothesize whether a certain feature is likely to be influenced by language contact or is rather likely to be inherited from proto-languages without changes. One preliminary example of plotting the geo-spatial distribution of a language feature using a distorted world map is shown in Figure 2.

For more details we would like to refer you to the corresponding paper.

Figure 2 - World map distorted using pixel placement in order to minimize overplotting. Each language is represented by one dot in world map. The color of the dot depends on the degree of conformance a language shows for the principle of similar place avoidance in places of articulation of subsequent consonants. Blue shows a strong conformance and red a strong noncompliance. The latter case is rare and seems to appear only in selected world regions, highlighted with red boxes.

Diachronic Analysis of Lexical Semantic Change

It has been shown that different meanings of a word can be disambiguated and even induced by their context. Within this project we aim to develop visual analytics approaches that track the development of word meanings over time. The aim is to gain a deeper understanding of how lexical semantic change happens. Figure 3 shows the quantitative development of different automatically learned senses for the words “to browse” and “to surf” based on daily newspaper editions over 20 years.

For more details we would like to refer you to the corresponding paper.

Figure 3 - The temporal development of different senses learned for the verbs “to browse” left and “to surf” right. The height of the plot corresponds to the relative frequency of the respective sense at a certain point in time. As can be seen, new senses appear for both words with the spread of the Internet.

More information about this and related work can be found in the following publications.