In December, the Libraries acquired twelve full-text corpus datasets, compiled by Mark Davies, a retired professor of linguistics from Brigham Young University. The corpora will help Columbia researchers across many disciplines to understand how language is and has been used around the world, and they serve as another mark in the Libraries’ commitment to supporting large-scale language research.
The acquisition demonstrates its interdisciplinary nature by the varied source of funding. Kaoukab Chebaro and Pamela Graham (on behalf of Latin American, Caribbean, and Iberian Studies), Jeremiah Mercurio (Linguistics), Yasmin Saira (Business), John Tofanelli (English), and Will Vanti (Computer Science) all contributed resources to bring the datasets to Columbia. This broad group of librarians testifies to the value of the corpora to scholars in literature, areal studies, social sciences, business, and computer science.
Dennis Yi Tenen, Associate Professor of English and Comparative Literature, noted that “These datasets are necessary for any analysis of language and culture over time. In capturing the history of language in use, they give us a baseline against which we can compare our own research findings.”
The corpora include the Corpus of Contemporary American English (COCA). At one billion words, it is the only corpus of American English balanced by genre and features nearly half a billion texts published since 1990. Researchers can also make use of the Corpus of Historical American English (COHA). At nearly half a billion words, it is the largest structured corpus of historical American English.
Scholars using Spanish and Portuguese will benefit from corpora dedicated to those two languages. El corpus del español contains about two billion words collected from web pages in 21 different hispanophone countries. Similarly, O corpus do português contains one billion words from websites from four lusophone countries.
The other corpora include the 20-billion-word News on the Web corpus, the most up-to-date corpus of English (last updated in November 2024), as well as corpora devoted to the COVID-19 Pandemic, Wikipedia, TV transcripts and movie scripts, and global English. The corpora are described in detail in the Linguistics Research Guide.
Access to the datasets is restricted to Columbia scholars working on research projects, and the specific restrictions are also indicated in the Linguistics Research Guide. Instructors eager to use the data in undergraduate courses should use the web-based interfaces to the data: https://www.english-corpora.org, https://www.corpusdelespanol.org/web-dial/, and https://www.corpusdoportugues.org/web-dial/.
The corpora join ProQuest’s TDMStudio as the primary licensed tools the Libraries make available to researchers doing large-scale language research. TDMStudio allows scholars to build giant datasets of millions of documents and then use natural language processing tools to analyze, for example, newspaper articles on a specific topic or mentioning a particular person or company. The Full-Text Corpus Data go beyond by featuring words tagged for parts of speech, which can help computational linguists and natural language processing researchers advance their fields as well.
Researchers interested in the corpora or in TDMStudio should contact the Libraries’ Research Data Services at data@library.columbia.edu