Talbanken05

This is the home page for Talbanken05, a modernized version of Talbanken76, a Swedish treebank of roughly 300,000 words, constructed at Lund University in the 1970s. The treebank comes with no guarantee but is freely available for research and educational purposes as long as proper credit is given for the work done to produce the material (both in Lund and in Växjö).

[Download Talbanken05 (Version 1.1)]
NB: The very first release of version 1.1 had a bug in the indexing of the CoNLL version. If you downloaded it before 4 June 2006 (and want to use the CoNLL version), you should download it again. We apologize for the inconvenience and express our gratitude to Markus Dickinson who spotted the bug.

The archive available for download contains the entire treebank (divided into sections P, G, IB and SD) in four versions:

MAMBA: Original syntactic and lexical annotation, (original text encoding in ISO-8859-1)
FPS: Flat phrase structure annotation (TIGER-XML encoding in ISO-8859-1)
DPS: Deepened phrase structure annotation (TIGER-XML encoding in ISO-8859-1)
Dep: Dependency structure annotation (Malt-XML encoding in ISO-8859-1 and CoNLL-X format in UTF-8)

Relation to earlier versions:

From Version 1.0 to 1.1: Version 1.1 includes all information from the lexical layer in MAMBA in the part-of-speech annotation in FPS, DPS and Dep, whereas version 1.0 only includes basic parts-of-speech. In addition, a new version of Dep has been added in the format used in the CoNLL-X shared task on multilingual dependency parsing.

Below we give a brief description of the original treebank (Talbanken76), the process of conversion, and the three different annotation standards (FPS, DPS, Dep). The conversion is described more fully in the following papers:

Overview: Joakim Nivre, Jens Nilsson and Johan Hall (2006) Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, 2006, Genoa, Italy.
More detailed overview: Jens Nilsson, Johan Hall and Joakim Nivre (2005) MAMBA Meets TIGER: Reconstructing a Swedish Treebank from Antiquity. In Proceedings of the NODALIDA Special Session on Treebanks.
In-depth description of the reconstruction: Jens Nilsson and Johan Hall (2005) Reconstruction of the Swedish Treebank Talbanken. MSI report 05067. Växjö University: School of Mathematics and Systems Engineering.

Talbanken76

Talbanken76 was originally published as:

Jan Einarsson: Talbankens skriftspråkskonkordans (1976)
Jan Einarsson: Talbankens talspråkskonkordans (1976)

The data were collected in several projects at Lund University in the 1970s and the material is described in several publications:

Ulf Teleman: Manual för grammatisk beskrivning av talad och skriven svenska (MAMBA) (1974)
Margareta Westman: Bruksprosa (1974)
Nils Jörgensen: Meningsbyggnaden i talad svenska (1976)
Tor G Hultman och Margareta Westman: Gymnasistsvenska (1977)
Jan Einarsson: Talad och skriven svenska (1978)

Teleman (1974) describes the analysis principles, while the other books apply these principles to different authentic materials.

Talbanken76 consists of a written language part and a spoken language part of roughly equal size. The written language part in turn consists of two sections, the so-called professional prose section (P), with data from textbooks, brochures, newspapers, etc., and a collection of high school students' essays (G). The spoken language part also has two sections, interviews (IB) and conversations and debates (SD). Altogether, the corpus contains close to 300,000 running tokens.

The MAMBA annotation scheme consists of two layers, the first being a lexical analysis, consisting of part-of-speech information including morphological features, and the second being a syntactic analysis, in terms of grammatical functions. Both layers are flat in the sense that they consist of tags assigned to individual word tokens, but the syntactic layer also gives information about constituent structure, as exemplified in the annotation of the sentence Genom skattereformen införs individuell beskattning av arbetsinkomster (Through the tax reform, individual taxation of work income is introduced):

*GENOM                  PR        AAPR        
SKATTEREFORMEN          NNDDSS    AA          
INFÖRS                  VVPSSMPA  FV          
INDIVIDUELL             AJ        SSAT        
BESKATTNING             VN        SS          
AV                      PR        SSETPR      
ARBETSINKOMSTER         NN  SS    SSET        
.                       IP        IP

The first column of annotation is the lexical analysis, while the second column is the syntactic analysis. The grammatical subject of the sentence is the phrase individuell beskattning av arbetsinkomster (individual taxation of work income), where the head word beskattning (taxation) is assigned the simple tag SS for subject, while the pre-modifying adjective individuell (individual) is tagged SS and AT for adjectival modifier; in the post-modifying prepositional phrase, the noun arbetsinkomster (work income) is tagged SS and ET for post-modifier, while the preposition av (of) is tagged SS, ET and PR for preposition.

Tables explaining the categories used can be found here:

Conversion

The syntactic analysis in Talbanken76 is described by its creators as an eclectic combination of dependency grammar, topological field analysis and immediate constituent analysis. This makes it very suitable for conversion to both phrase structure and dependency annotation. The conversion has proceeded in three steps:

The original flat but multi-layered annotation is converted to a bare phrase structure annotation, i.e. a phrase structure with unlabeled nonterminal nodes, and edges labeled with grammatical functions. This conversion is rather straightforward given the partially hierarchical annotation exemplified above.
The bare phrase structure annotation is extended to a full phrase structure representation by labeling nonterminal nodes with syntactic categories. These categories are not part of the original annotation and have to be inferred from other parts of the annotation.
The full phrase structure annotation is converted to a dependency annotation using the standard technique with head-finding rules and preserving grammatical functions as edge labels. Head-finding rules are not part of the original annotation scheme and have to be constructed manually.

Phrase Structure Annotation

The phrase structure annotation, which is the outcome of the second conversion step, uses a conventional set of phrase types (S, NP, VP, etc.) in combination with the grammatical functions of the original MAMBA annotation. The representation allows discontinuous phrases, although discontinuous constituents are relatively rare in the treebank.

The phrase structure annotation comes in two versions, one with the flattest possible trees that can be extracted from the original annotation, called Flat Phrase Structure (FPS), and one where trees have been deepened by inserting, e.g., NPs within PPs and VPs within (larger) VPs, called Deepened Phrase Structure (DPS). In both cases, the conversion has necessitated the introduction of a small number of new syntactic functions.

Dependency Structure Annotation

The dependency annotation (Dep), which is the outcome of the third conversion step, consists of terminal nodes connected by edges labeled with the same syntactic functions as DPS, extended with the label ROOT for words that are not governed by another word in the dependency structure. The representation allows non-projective dependency structures, which are needed to capture discontinuous constituents.

The conversion from phrase structure to dependency structure uses a priority list for finding the head of a phrase.

CPOSTAG, POSTAG and FEATS in CoNLL Format

In the CoNLL-X format, information from the lexical categories in the MAMBA annotation is distributed over 3 fields:

Coarse-grained part-of-speech (CPOSTAG)
Part-of-speech (POSTAG)
Features (FEATS)

In mapping the part-of-speech tags of the Dep format to the CoNLL-X shared task format, we have used the following principles:

The POSTAG field contains the two-character basic part-of-speech tag from the MAMBA annotation.
The FEATS field contains the additional two-character feature tags from the MAMBA annotation (if any), separated by vertical bars (|).
The CPOSTAG field contains the two-character basic part-of-speech tag from the MAMBA annotation or a one-character tag obtained by merging part-of-speech categories from the original annotation, so that, e.g., "V" generalizes over all verb tags ("AV", "BV", "HV", etc.). The mapping from POSTAG to CPOSTAG roughly follows the grouping of the MAMBA scheme itself and is summarized in this table.

NB: The version of Talbanken05 distributed in CoNLL-X shared task format here is not identical to the data set used in the shared task itself. The latter dataset used the two-character tags in both the CPOSTAG and POSTAG and had no information in the FEATS field.