Vocabulary Building: Corpora

Corpora are collections of language from authentic sources, such as newspapers, magazines, or academic texts. Corpora are useful for language learning because they show how language is used in addition to different options for words and phrases. Below are strategies and questions to consider as well as sample queries in two sample corpora, WordandPhrase and COCA.


Determine which contexts are represented

Does the corpus collect only spoken genres? Fiction? Academic?
Which fields are represented? Natural sciences? Social sciences? Both?

Use sort and filter functions to organize data

What are the 5 most common reporting verbs in your academic field?
How does usage of a word differ from one genre to another genre?

Notice frequent patterns of collocations

What are some common phrasal verb combinations (verb + preposition)?
Which transitive verbs most frequently appear before abstract nouns?

Compare frequencies and contexts of alternative language

What adjectives do different academic fields use to describe your topic?
How might some synonyms affect your sentence structure?

Apply observations to your own writing

Does the word you’re using appear more frequently in other contexts or have other connotations that may confuse your reader?
Can the phrase you’re using be considered too informal for the genre you’re writing in?

Sample queries

Click on a thumbnail image to view a complete, expanded screenshot. Detailed instructions for searches can be found on each corpus’s help page.

Word and Phrase

corpora 0 thumbQuery: This searches the word “new” as any part of speech in all contexts.

Results: Entries for “new” as an adjective and adverb are displayed. Its frequency in different genres is represented numerically and in charts. Different synonyms (bottom-left) can be clicked to reload sample sentences, sorted alphabetically by collocates (words it is frequently combined with) and color-coded by part of speech. Sample sentences can be filtered by genre as well.

corpora 4 thumbQuery: This searches the most frequent adjectives in the corpora that begin with “un-“.

Results: “Unique” is the 2nd most frequent adjective in the list and appears most frequently in academic contexts. Clicking on it displays synonyms with their frequency in the bottom-left. The bottom-right displays definitions from Wordnet and common collocations, with example usages from all genres.

corpora 2 thumbQuery: This section lets users input text (top-left). The text is color-coded by frequency (top-right) and phrases can be searched (mid-right). This searches synonyms for “finding” + forms and synonyms for “indicate” + other conjunctions.

Results: Phrases and collocations matching the criteria appear in the bottom-left and can be viewed color-coded in context in the bottom-right to compare usages.

Corpus of Contemporary American English

corpora 1 thumbQuery: This search compares nouns that immediately follow “show” and “reveal” in academic contexts.

Results: Two lists sort collocates by frequency. Decimals and color refer to collocation strength; stronger collocations sound more natural. These combinations are also shown in sample sentences from the corpus, sorted by year and context.

corpora 3 thumbQuery: This search compares adjectives that appear up to two words before “results” in academic humanities and academic scientific writing.

Results: Two lists sort adjective collocates, with exclusive ones more green, showing which word combinations are more frequent in the humanities or scientific writing. Sample text excerpts with the collocations appear in the bottom-right.

corpora 5 thumbQuery: This searches phrases of “limited + noun” in all contexts.

Results: Rather than a bar graph of the word combination’s frequency in different contexts, this lists all sub-sections of contexts and sorts results in descending order. The bottom-right displays different phrases from the academic scientific and technical writing sub-section.

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.
You may reproduce it for non-commercial use if you use the entire handout (just click print) and attribute the source: The Writing Center, University of North Carolina at Chapel Hill

If you enjoy using our handouts, we appreciate contributions of acknowledgement.