Will Roberts | German Subcategorisation Lexicon

sdewac-scf-db-mate.sec1.tsv.xz (431M, UTF-8 encoding)

This file contains 27,000,000 lines, each representing a single verb instance extracted from the SdeWaC corpus. It is distributed here freely for research purposes; if you intend to use this data in other ways, please contact me using the email link at right. If you use this resource in your research, please cite our paper:

Roberts, W, Egg, M, Kordoni, V. (2014). Subcategorisation acquisition from raw text for a free word-order language. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014). Gothenburg, Sweden: Association for Computational Linguistics;298-307. (bib)

Note: this file contains data from the first quarter of the SdeWaC corpus; the full corpus contains 108,785,070 verb instances, but space restrictions prevent me from hosting the file here. If you would like the full database, please contact me by email.

Description

The counts in this lexicon are built using our subcategorisation frame (SCF) tagger described in the paper above; the tagger operates here using dependency parses of the SdeWaC corpus, as delivered by the most recent version of the mate-tools parser (instead of the Berkeley Parser + edge labeller combination introduced in our paper). Verb instances are tagged with a single SCF code derived from:

Schulte im Walde, S. (2002). A subcategorisation lexicon for German verbs induced from a lexicalised PCFG. In Proceedings of the 3rd Conference on Language Resources and Evaluation (LREC) (pp. 1351–1357).

Other syntactic information in the SCF database follows the annotation scheme used in the TIGER corpus.

For each verb in SdeWaC, the SCF database contains a single line entry. This entry is in tab-separated format; the columns contain the following information:

The index of the sentence in SdeWaC (an integer from 0 to 45400445);
The index of the verb in the current sentence;
The STTS part of speech tag of the verb;
The lemmatised form of the verb;
The subcategorisation code of the verb;
A Boolean field containing True if the verb is in the passive voice and False otherwise;
A code indicating the clause type of phrase containing the verb;
The index of the verb's governor in the sentence, if found: a governor can be a modal verb, an auxiliary verb for tense constructions, or an aspectual verb for certain kinds of aspectual constructions;
The lemmatised form of the governing verb, if found.

Fields 10 and up list information pertaining to the verb's complements in the sentence; the total number of fields varies with the number of complements to the verb. There are eleven fields per complement:

The index in the sentence of the complement's head;
The simplified part of speech of the complement (one of n, v, or p);
The full STTS part of speech code of the complement's head;
The complement's head;
The lemmatised form of the complement's head;
The edge label of the complement;
The index in the sentence of the complement's argument (only if the complement is prepositional);
The part of speech of the complement's argument;
The complement's argument;
The lemmatised form of the complement's argument;
Morphological analysis of the complement's argument.

Example


Field number	Field value	Complement index	Explanation
1	`1`		Sentence 1 in SdeWaC.
2	`9`		Word 9 in sentence 1.
3	`VVFIN`		Verb's part of speech.
4	`hermachen`		Verb's lemma.
5	`npr:über.Acc`		Verb's subcategorisation code.
6	`False`		Verb is not passive.
7	`S-2`		Clause type of the phrase containing the verb.
8			Verb has no governor.
9			Verb has no governor.
10	`1`	1	Word 1 in sentence 1 is the head of the verbs first complement.
11	`n`		Complement is nominal.
12	`NN`		Complement head's part of speech.
13	`Henker`		Complement head.
14	`Henker`		Lemmatised complement head.
15	`SB`		Complement's edge tag (subject).
16
17
18
19
20
21	`10`	2	Word 10 in sentence 1 is the head of the verbs second complement.
22	`n`		Complement is nominal.
23	`PRF`		Complement head's part of speech.
24	`sich`		Complement head.
25	`sich`		Lemmatised complement head.
26	`OA`		Complement's edge tag (accusative direct object).
27
28
29
30
31
32	`11`	3	Word 11 in sentence 1 is the head of the verbs third complement.
33	`p`		Complement is prepositional.
34	`APPR`		Complement head's part of speech.
35	`über`		Complement head.
36	`über`		Lemmatised complement head.
37	`OP`		Complement's edge tag (prepositional object).
38	`13`		Word 13 in sentence 1 is the argument of the verb's third complement.
39	`NN`		Complement argument's part of speech.
40	`Kinder`		Complement argument.
41	`Kind`		Lemmatised complement argument.
42	`acc\|pl\|neut`		Complement argument is accusative, plural, neuter.

German Subcategorisation Lexicon

Description

Example

Links:

Web 2.0: