Will Roberts | sdewac-wcount-totals-thresh.tsv

sdewac-wcount-totals-thresh.tsv.xz (6.4 MB)
sdewac-wcount-pos-totals-thresh.tsv.xz (7.0 MB)

These files list all words in occurring 5 times or more in the SdeWaC corpus (no stop word or punctuation filtering is performed). Words are not lemmatised or otherwise modified, except that all digits in words are replaced with "2" (so that, e.g., 030-2093-2311 becomes 222-2222-2222). The -pos-totals- variant counts each word with its POS tag; the -totals- variant throws away the POS tags.

The files are encoded in UTF-8.

sdewac-wcount-totals-thresh.tsv.xz contains 1,545,828 lines in tab-separated format; each line contains an integer frequency followed by a word. Lines are sorted in order of decreasing frequency, then by reverse alphabetical order on words. The total number of tokens represented in the file is 874,536,282 (884,838,511 before thresholding, meaning that the UNKNOWN token occurs 10,302,229 times in SdeWaC, or about 1% of the time).

sdewac-wcount-pos-totals-thresh.tsv.xz contains 1,618,720 lines; each line contains an integer frequency, a word, and a POS tag according to the STTS inventory. There are 873,804,517 tokens counted in the file.

sdewac-wcount-lemmas.tsv.xz (29 MB)

This file lists all non-punctuation words in the SdeWaC corpus, lemmatised using the most recent version of the mate-tools parser. No thresholding is performed. The file is encoded in UTF-8 and contains 7,857,501 lines representing 773,505,667 tokens (meaning that SdeWaC contains 111,332,844 tokens tagged as punctuation, about 12.6% of the total number of tokens).

sdewac-wcount-totals-thresh.tsv

Links:

Web 2.0: