This file contains 27,000,000 lines, each representing a single verb instance extracted from the SdeWaC corpus. It is distributed here freely for research purposes; if you intend to use this data in other ways, please contact me using the email link at right. If you use this resource in your research, please cite our paper:

Roberts, W, Egg, M, Kordoni, V. (2014). Subcategorisation acquisition from raw text for a free word-order language. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014). Gothenburg, Sweden: Association for Computational Linguistics;298-307. (bib)

Note: this file contains data from the first quarter of the SdeWaC corpus; the full corpus contains 108,785,070 verb instances, but space restrictions prevent me from hosting the file here. If you would like the full database, please contact me by email.

Description

The counts in this lexicon are built using our subcategorisation frame (SCF) tagger described in the paper above; the tagger operates here using dependency parses of the SdeWaC corpus, as delivered by the most recent version of the mate-tools parser (instead of the Berkeley Parser + edge labeller combination introduced in our paper). Verb instances are tagged with a single SCF code derived from:

Schulte im Walde, S. (2002). A subcategorisation lexicon for German verbs induced from a lexicalised PCFG. In Proceedings of the 3rd Conference on Language Resources and Evaluation (LREC) (pp. 1351–1357).

Other syntactic information in the SCF database follows the annotation scheme used in the TIGER corpus.

For each verb in SdeWaC, the SCF database contains a single line entry. This entry is in tab-separated format; the columns contain the following information:

  1. The index of the sentence in SdeWaC (an integer from 0 to 45400445);
  2. The index of the verb in the current sentence;
  3. The STTS part of speech tag of the verb;
  4. The lemmatised form of the verb;
  5. The subcategorisation code of the verb;
  6. A Boolean field containing True if the verb is in the passive voice and False otherwise;
  7. A code indicating the clause type of phrase containing the verb;
  8. The index of the verb's governor in the sentence, if found: a governor can be a modal verb, an auxiliary verb for tense constructions, or an aspectual verb for certain kinds of aspectual constructions;
  9. The lemmatised form of the governing verb, if found.

Fields 10 and up list information pertaining to the verb's complements in the sentence; the total number of fields varies with the number of complements to the verb. There are eleven fields per complement:

  1. The index in the sentence of the complement's head;
  2. The simplified part of speech of the complement (one of n, v, or p);
  3. The full STTS part of speech code of the complement's head;
  4. The complement's head;
  5. The lemmatised form of the complement's head;
  6. The edge label of the complement;
  7. The index in the sentence of the complement's argument (only if the complement is prepositional);
  8. The part of speech of the complement's argument;
  9. The complement's argument;
  10. The lemmatised form of the complement's argument;
  11. Morphological analysis of the complement's argument.

Example

Field numberField valueComplement indexExplanation
11Sentence 1 in SdeWaC.
29Word 9 in sentence 1.
3VVFINVerb's part of speech.
4hermachenVerb's lemma.
5npr:über.AccVerb's subcategorisation code.
6FalseVerb is not passive.
7S-2Clause type of the phrase containing the verb.
8Verb has no governor.
9Verb has no governor.
1011Word 1 in sentence 1 is the head of the verbs first complement.
11nComplement is nominal.
12NNComplement head's part of speech.
13HenkerComplement head.
14HenkerLemmatised complement head.
15SBComplement's edge tag (subject).
16
17
18
19
20
21102Word 10 in sentence 1 is the head of the verbs second complement.
22nComplement is nominal.
23PRFComplement head's part of speech.
24sichComplement head.
25sichLemmatised complement head.
26OAComplement's edge tag (accusative direct object).
27
28
29
30
31
32113Word 11 in sentence 1 is the head of the verbs third complement.
33pComplement is prepositional.
34APPRComplement head's part of speech.
35überComplement head.
36überLemmatised complement head.
37OPComplement's edge tag (prepositional object).
3813Word 13 in sentence 1 is the argument of the verb's third complement.
39NNComplement argument's part of speech.
40KinderComplement argument.
41KindLemmatised complement argument.
42acc|pl|neutComplement argument is accusative, plural, neuter.