Vocabulary Building: Corpora
Corpora are collections of language from authentic sources, such as newspapers, magazines, or academic texts. Corpora are useful for language learning because they show how language is used in addition to different options for words and phrases. Below are strategies and questions to consider as well as sample queries in two sample corpora, WordandPhrase and COCA.
Strategies for using a language corpus
Determine which contexts are represented
- Does the corpus collect only spoken genres? Fiction? Academic?
- Which fields are represented? Natural sciences? Social sciences? Both?
Use sort and filter functions to organize data
- What are the 5 most common reporting verbs in your academic field?
- How does usage of a word differ from one genre to another genre?
Notice frequent patterns of collocations
- What are some common phrasal verb combinations (verb + preposition)?
- Which transitive verbs most frequently appear before abstract nouns?
Compare frequencies and contexts of alternative language
- What adjectives do different academic fields use to describe your topic?
- How might some synonyms affect your sentence structure?
Apply observations to your own writing
- Does the word you’re using appear more frequently in other contexts or have other connotations that may confuse your reader?
- Can the phrase you’re using be considered too informal for the genre you’re writing in?
Sample queries to try
Try searching with the Word and Phrase tool for the word “new” as any part of speech in all contexts.
Notice that “new” appears as an adjective and as an adverb. Notice the frequency of “new” in different genres represented numerically and in charts. Try clicking different synonyms (bottom-left) to reload sample sentences, sorted alphabetically by collocates (words it is frequently combined with) and color-coded by part of speech. Try filtering sample sentences by genre.
Try searching with the Word and Phrase tool for the most frequent “adjectives” that begin with the prefix “un.”
You should find a list of words like united, unique, unusual, and unable. Which appear most often in which contexts? Notice that the bottom-right displays definitions from Wordnet and common collocations, with example usages from all genres.
Try using the Word and Phrase tool to compare synonyms for “finding” + forms and synonyms for “indicate” + other conjunctions.
This section lets users input text (top-left). The text is color-coded by frequency (top-right) and phrases can be searched (mid-right). This searches synonyms for “finding” + forms and synonyms for “indicate” + other conjunctions. Phrases and collocations matching the criteria appear in the bottom-left and can be viewed color-coded in context in the bottom-right to compare usages.
Try using the Corpus of Contemporary American English to compare nouns that immediately follow the verb “show” with nouns that immediately follow the word “reveal” in academic contexts.
Try sorting collocates by frequency. Decimals and color refer to collocation strength; stronger collocations sound more natural. These combinations are also shown in sample sentences from the corpus, sorted by year and context.
Try using the Corpus of Contemporary American English tool to identify adjectives that appear up to two words before the word “results” in academic humanities and academic scientific writing.”
Notice that two lists sort adjective collocates, with exclusive ones more green, showing which word combinations are more frequent in the humanities or scientific writing. Sample text excerpts with the collocations appear in the bottom-right.
Try searching the Corpus of Contemporary American English tool for the word “limited” + noun in all contexts.
Rather than a bar graph of the word combination’s frequency in different contexts, you can generate a list of all contexts with results in descending order. The bottom-right displays different phrases from the academic scientific and technical writing sub-section.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License.
You may reproduce it for non-commercial use if you use the entire handout and attribute the source: The Writing Center, University of North Carolina at Chapel Hill