Department of Translation, The Chinese University of Hong Kong

Centre for Translation Technology

Online Corpora

1. Academia Sinica Balanced Corpus of Modern Chinese 現代漢語平衡語料庫

Academia Sinica Balanced Corpus of Modern Chinese, simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The new version Sinica Corpus 4.0 targeted at 10 million words is ready for license in 2010 and the web search interface opens to public in 2013.

Link: http://asbc.iis.sinica.edu.tw/

2. British National Corpus (BNC)

The British National Corpus (BNC) is a 100 million-word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written.

Link: http://www.natcorp.ox.ac.uk/

3. Corpus of Canadian English (Strathy)

The Strathy Corpus of Canadian English contains 50 million words from more than 1,100 spoken, fiction, magazines, newspapers, and academic texts.

Link: https://www.english-corpora.org/can/

4. Corpus of Contemporary American English (COCA)

The Corpus of Contemporary American English (COCA) is a large and genre-balanced corpus of American English. The corpus contains more than 560 million words of text (20 million words each year 1990-2017) and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.

Link: https://www.english-corpora.org/coca/

5. Corpus of Historical American English (COHA)

The Corpus of Historical American English (COHA) is the largest structured corpus of historical English. The corpus contains more than 400 million words of text from the 1810s-2000s (which makes it 50-100 times as large as other comparable historical corpora of English) and the corpus is balanced by genre decade by decade.

Link: https://www.english-corpora.org/coha/

6. Corpus of Modern Scottish Writing

The Corpus of Modern Scottish Writing (CMSW) is an electronic corpus of written and printed texts from the period 1700-1945, complementing the Helsinki Corpus of Older Scots (1450-1700) and the Scottish Corpus of Texts & Speech (1945-present day). CMSW contains over 350 documents, containing approximately 5.5 million words of text overall.

Link: https://www.scottishcorpus.ac.uk/cmsw/search/

7. Corpus of US Supreme Court Opinions

Corpus of US Supreme Court Opinions contains approximately 130 million words in 32,000 Supreme Court decisions from the 1790s to the current time.

Link: https://www.english-corpora.org/scotus/

8. Linguistic Variation in Chinese Speech Communities (LIVAC) Synchronous Corpus

LIVAC has adopted a rigorous and regular as well as "Windows" approach in processing and filtering massive media texts from representative communities in the Pan-Chinese region including Hong Kong, Macau, Taipei, Singapore, Shanghai, Beijing, Guangzhou, Shenzhen.

Link: http://www.livac.org/search.php?lang=tc

9. News on the Web (NOW) Corpus

The NOW corpus contains 8.7 billion words of data from web-based newspapers and magazines from 2010 to the present time. More importantly, the corpus grows by about 140-160 million words of data each month (from about 300,000 new articles), or about 1.8 billion words each year.

Link: https://www.english-corpora.org/now/

10. Peking University CCL Online Corpus 北京大學語料庫

CCL Corpus is developed by the Centre for Chinese Linguistics and Institute of Computational Linguistics of Peking University. The corpus contains 477 million words of modern and ancient Chinese words.

Link: http://ccl.pku.edu.cn:8080/ccl_corpus/index.jsp

11. Scottish Corpus of Texts & Speech

The Scottish Corpora project has created large electronic corpora of written and spoken texts for the languages of Scotland. The corpus contains nearly 4.6 million words of text, with audio recordings to accompany many of the spoken texts.

Link: https://www.scottishcorpus.ac.uk/search/

12. The Modern Chinese Languages Corpus 國家語委現代漢語語料庫

The modern Chinese Languages Corpus is developed by National Languages Committee of China. It is a large-scale balanced corpus of Modern Chinese.

Link: http://corpus.zhonghuayuwen.org/index.aspx

13. The Movie Corpus

The Movies Corpus contains 200 million words of data in more than 25,000 movies from the 1930s to the current time. The corpus also allows you to look at variation over time (1930s-1950s to 1990s-2010s) and variation between dialects (e.g. American and British English).

Link: https://www.english-corpora.org/movies/

14. The TV Corpus

The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. The corpus also allows you to look at variation over time (1950s-1970s to 1990s-2010s) and variation between dialects (e.g. American and British English).

Link: https://www.english-corpora.org/tv/

15. Time Magazine Corpus

The TIME corpus is based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006, and it serves as a great resource to examine changes in American English during this time.

Link: https://www.english-corpora.org/time/