These files list all words in occurring 5 times or more in the
SdeWaC corpus (no stop word or punctuation filtering is performed).
Words are not lemmatised or otherwise modified, except that all
digits in words are replaced with "2" (so that, e.g., 030-2093-2311
becomes 222-2222-2222). The -pos-totals-
variant counts each word
with its POS tag; the -totals-
variant throws away the POS tags.
The files are encoded in UTF-8.
sdewac-wcount-totals-thresh.tsv.xz
contains 1,545,828 lines in
tab-separated format; each line contains an integer frequency
followed by a word. Lines are sorted in order of decreasing
frequency, then by reverse alphabetical order on words. The total
number of tokens represented in the file is 874,536,282 (884,838,511
before thresholding, meaning that the UNKNOWN
token occurs
10,302,229 times in SdeWaC, or about 1% of the time).
sdewac-wcount-pos-totals-thresh.tsv.xz
contains 1,618,720 lines;
each line contains an integer frequency, a word, and a POS tag
according to the STTS inventory. There are 873,804,517 tokens
counted in the file.
This file lists all non-punctuation words in the SdeWaC corpus, lemmatised using the most recent version of the mate-tools parser. No thresholding is performed. The file is encoded in UTF-8 and contains 7,857,501 lines representing 773,505,667 tokens (meaning that SdeWaC contains 111,332,844 tokens tagged as punctuation, about 12.6% of the total number of tokens).