Project Description

The goal of the project was the linguistic analysis of a limited but important part of the lexical component of the German language: collocations. Collocations are sequences of words that co-occur with statistically significant frequency. Examples are phrases like pay attention and have in mind.

The lexicon contains thousands of collocations, and they occur with high frequency in both spoken and written language. Collocations constitute an essential component of the expressive power of a language.

Idioms constitute one particular kind of collocation. They form a lexical unit and are sometimes referred to as long words. English idioms like kick the bucket and let the cat out of the bag, and their German counterparts den Loeffel abgeben and die Katze aus dem Sack lassen are structurally complex but express a concept that may or may not be paraphrasable with a single word („die“, „tell a secret“).

The quality of a dictionary can be measured in part by the degree to which it accounts for collocations and, in bilingual dictionaries, how well it matches them crosslingually. Linguists assume the principle of composition, according to which the meaning of a string is the sum of the meanings of the individual words. Collocations, and idioms in particular, violate this principle systematically, though to different degrees. They are not easily integrated into traditional dictionaries, where individual words are listed in alphabetical order. Furthermore, the systematic search for attested uses of collocations and idioms, necessary for empirical research, is difficult. An additional challenge is posed by the fact that collocations change over time; old ones disappear from usage and new ones enter the language via various mechanisms.

Fortunately, the methods of computational linguistics have opened up new possibilities, both for the analysis of collocations and the search for attested data. The BBAW has built a very large representative and linguistically exploitable corpus of the German language of the 20th century (DWDS-Corpus).

A comprehensive analysis of all German collocations – even those in the contemporary language, was unrealistic in the framework of the project. There are too many of them, and there was no comparable previous work that could have served as a model for this pioneering enterprise. Consequently, we limited the scope of the project to verb-noun idioms (like die Katze aus dem Sack lassen), but aimed to be comprehensive. We also treated a certain number of adjective-noun collocations like blauer Brief and schwedische Gardinen.