This file contains 27,000,000 lines, each representing a single verb instance extracted from the SdeWaC corpus. It is distributed here freely for research purposes; if you intend to use this data in other ways, please contact me using the email link at right. If you use this resource in your research, please cite our paper:
Roberts, W, Egg, M, Kordoni, V. (2014). Subcategorisation acquisition from raw text for a free word-order language. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014). Gothenburg, Sweden: Association for Computational Linguistics;298-307. (bib)
Note: this file contains data from the first quarter of the SdeWaC corpus; the full corpus contains 108,785,070 verb instances, but space restrictions prevent me from hosting the file here. If you would like the full database, please contact me by email.
The counts in this lexicon are built using our subcategorisation frame (SCF) tagger described in the paper above; the tagger operates here using dependency parses of the SdeWaC corpus, as delivered by the most recent version of the mate-tools parser (instead of the Berkeley Parser + edge labeller combination introduced in our paper). Verb instances are tagged with a single SCF code derived from:
Schulte im Walde, S. (2002). A subcategorisation lexicon for German verbs induced from a lexicalised PCFG. In Proceedings of the 3rd Conference on Language Resources and Evaluation (LREC) (pp. 1351–1357).
Other syntactic information in the SCF database follows the annotation scheme used in the TIGER corpus.
For each verb in SdeWaC, the SCF database contains a single line entry. This entry is in tab-separated format; the columns contain the following information:
0
to
45400445
);True
if the verb is in the passive
voice and False
otherwise;Fields 10 and up list information pertaining to the verb's complements in the sentence; the total number of fields varies with the number of complements to the verb. There are eleven fields per complement:
n
,
v
, or p
);Field number | Field value | Complement index | Explanation |
---|---|---|---|
1 | 1 | Sentence 1 in SdeWaC. | |
2 | 9 | Word 9 in sentence 1. | |
3 | VVFIN | Verb's part of speech. | |
4 | hermachen | Verb's lemma. | |
5 | npr:über.Acc | Verb's subcategorisation code. | |
6 | False | Verb is not passive. | |
7 | S-2 | Clause type of the phrase containing the verb. | |
8 | Verb has no governor. | ||
9 | Verb has no governor. | ||
10 | 1 | 1 | Word 1 in sentence 1 is the head of the verbs first complement. |
11 | n | Complement is nominal. | |
12 | NN | Complement head's part of speech. | |
13 | Henker | Complement head. | |
14 | Henker | Lemmatised complement head. | |
15 | SB | Complement's edge tag (subject). | |
16 | |||
17 | |||
18 | |||
19 | |||
20 | |||
21 | 10 | 2 | Word 10 in sentence 1 is the head of the verbs second complement. |
22 | n | Complement is nominal. | |
23 | PRF | Complement head's part of speech. | |
24 | sich | Complement head. | |
25 | sich | Lemmatised complement head. | |
26 | OA | Complement's edge tag (accusative direct object). | |
27 | |||
28 | |||
29 | |||
30 | |||
31 | |||
32 | 11 | 3 | Word 11 in sentence 1 is the head of the verbs third complement. |
33 | p | Complement is prepositional. | |
34 | APPR | Complement head's part of speech. | |
35 | über | Complement head. | |
36 | über | Lemmatised complement head. | |
37 | OP | Complement's edge tag (prepositional object). | |
38 | 13 | Word 13 in sentence 1 is the argument of the verb's third complement. | |
39 | NN | Complement argument's part of speech. | |
40 | Kinder | Complement argument. | |
41 | Kind | Lemmatised complement argument. | |
42 | acc|pl|neut | Complement argument is accusative, plural, neuter. |