Svenja Adolphs, a Lecturer in Applied Linguistics at the University of Nottingham, provides a guide to the area of corpus linguistics in her book Introducing Electronic Text Analysis. Corpus linguistics extracts patterns from text to help us gain an understanding of the governing rules and interconnectedness of language. Her work also looks at how electronic texts and analysis software are “being utilized by researchers in a range of diverse areas in the arts and humanities and in the social sciences.” It’s hard to believe that this area of study was first conducted before computers were around. Corpus linguistics seeks to bring order to and make sense of the breadth of information and diverse use of language available to us.
One of the methods Adolphs’ presents us with is concordance data. The Key Word in Context (KWIC) concordance takes a body of text and examines certain words and phrases. “A concordance is a way of presenting language data to facilitate analysis” (Adolphs, 5). However, language is complex because you can have multiple meanings for certain words and those meanings change over time. The output format in a KWIC concordance helps us to analyze a word in context. This greatly affects how we interpret historical texts like the Bible, Shakespeare, and even the US Constitution.
Researchers greatly benefit from existing corpora such as the Bank of English corpus, “which exceeds 500 million words at the time of writing” (Adolphs, 18). An extensive body of corpus grants more possibilities to researchers. In general, drawing from a larger body of data is more scientific and adds “to the robustness of the analytical results” (Adolphs, 19). Various qualitative and quantitative methods have been devised to “provide a way into the data that is informed by the data itself” (19).
The type-token ratio examines lexical density and “can be useful when assessing the level of complexity of a particular text or text collection” (Adolphs, 39-40). While this method gives us a basic insight to a text and maybe helpful in organizing, a closer examination of the words and phrases is required to make any sort of concrete evaluation of the complexity of the text.
Examining the words used in text or even spoken conversation can yield invaluable information to those in research as well as professional fields. Studying wordlists reveal the frequency of certain key words or phrases used.
In political science it may be the comparison of linguistic devices used by different political parties, for example in the context of election campaign discourse.
The frequency of words observed in wordlists are expressed as ratios within the body of a text. This is useful because texts vary in size, and ratios “provide a better basis for comparison of frequencies of individual items” (Adolphs, 43). The CANCODE corpora represents a list of general spoken English. In Adolphs’ comparison of the corpora of Health Professional (HP) and the CANCODE corpus the frequency of positive keywords reveal that the HP corpora are more geared towards speaking in the present tense. This comparison to the CANCODE corpora can help us form hypothesis about the nature of the health profession based on its lexicon.
2 responses to “Corpus Linguistics”