Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 newsgroups that are english-language 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of all articles within the English section of the Wikipedia that has been drawn in April 2010. It had been prepared, as described in detail below, to get rid of all links and irrelevant product (navigation text, etc) The corpus is untagged, raw text. Employed by Stanford NLP (1.8 GB).
: a corpus of manually-constructed description graphs, explanatory part ranks, and associated semistructured tablestore for some publicly available primary technology exam concerns in the usa (8 MB)
Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)