Talbanken05
This is the home page for Talbanken05, a modernized version of Talbanken76,
a Swedish treebank of roughly 300,000 words, constructed at Lund University
in the 1970s. The treebank comes with no guarantee but is freely available
for research and educational purposes as long as proper credit is given for
the work done to produce the material (both in Lund and in Växjö).
-
[Download Talbanken05 (Version 1.1)]
- NB: The very first release of version 1.1 had a bug in the indexing of
the CoNLL version. If you downloaded it before 4 June 2006 (and want to use the CoNLL version),
you should download it again. We apologize for the inconvenience and express our gratitude to
Markus Dickinson who spotted the bug.
The archive available for download contains the entire treebank (divided into
sections P, G, IB and SD) in four versions:
- MAMBA: Original syntactic and lexical annotation, (original text encoding in ISO-8859-1)
- FPS: Flat phrase structure annotation (TIGER-XML encoding in ISO-8859-1)
- DPS: Deepened phrase structure annotation (TIGER-XML encoding in ISO-8859-1)
- Dep: Dependency structure annotation (Malt-XML encoding in ISO-8859-1 and CoNLL-X format in UTF-8)
Relation to earlier versions:
- From Version 1.0 to 1.1: Version 1.1 includes all
information from the lexical layer in MAMBA in the part-of-speech annotation in FPS, DPS
and Dep, whereas version 1.0 only includes basic parts-of-speech. In addition, a new
version of Dep has been added in the format used in the
CoNLL-X shared task on multilingual dependency parsing.
Below we give a brief description of the original treebank (Talbanken76), the
process of conversion, and the three different annotation standards (FPS, DPS, Dep).
The conversion is described more fully in the following papers:
Talbanken76
Talbanken76 was originally published as:
- Jan Einarsson: Talbankens skriftspråkskonkordans (1976)
- Jan Einarsson: Talbankens talspråkskonkordans (1976)
The data were collected in several projects at Lund University in the 1970s
and the material is described in several publications:
- Ulf Teleman: Manual för grammatisk beskrivning av talad och skriven svenska (MAMBA) (1974)
- Margareta Westman: Bruksprosa (1974)
- Nils Jörgensen: Meningsbyggnaden i talad svenska (1976)
- Tor G Hultman och Margareta Westman: Gymnasistsvenska (1977)
- Jan Einarsson: Talad och skriven svenska (1978)
Teleman (1974) describes the analysis principles, while the other books apply these
principles to different authentic materials.
Talbanken76 consists of a written language part and a spoken language part
of roughly equal size. The written language part in turn consists of two sections,
the so-called professional prose section (P), with
data from textbooks, brochures, newspapers, etc., and a collection
of high school students' essays (G). The spoken language part also
has two sections, interviews (IB) and conversations and
debates (SD). Altogether, the corpus contains close to 300,000
running tokens.
The MAMBA annotation scheme consists of two layers, the first
being a lexical analysis, consisting of part-of-speech
information including morphological features, and the second
being a syntactic analysis, in terms of grammatical functions.
Both layers are flat in the sense that they consist of tags
assigned to individual word tokens, but the syntactic layer also
gives information about constituent structure, as exemplified in
the annotation of the sentence Genom skattereformen införs
individuell beskattning av arbetsinkomster (Through the tax
reform, individual taxation of work income is introduced):
*GENOM PR AAPR
SKATTEREFORMEN NNDDSS AA
INFÖRS VVPSSMPA FV
INDIVIDUELL AJ SSAT
BESKATTNING VN SS
AV PR SSETPR
ARBETSINKOMSTER NN SS SSET
. IP IP
The first column of annotation is the lexical analysis, while
the second column is the syntactic analysis. The grammatical
subject of the sentence is the phrase individuell beskattning
av arbetsinkomster (individual taxation of work income),
where the head word beskattning (taxation) is assigned
the simple tag SS for subject, while the pre-modifying
adjective individuell (individual) is tagged SS and AT
for adjectival modifier; in the post-modifying prepositional
phrase, the noun arbetsinkomster (work income) is tagged
SS and ET for post-modifier, while the preposition av
(of) is tagged SS, ET and PR for preposition.
Tables explaining the categories used can be found here:
Conversion
The syntactic analysis in Talbanken76 is described by its
creators as an eclectic combination of
dependency grammar, topological field analysis and
immediate constituent analysis. This makes it
very suitable for conversion to both
phrase structure and dependency annotation. The conversion
has proceeded in three steps:
-
The original flat but multi-layered annotation is converted
to a bare phrase structure annotation, i.e. a phrase
structure with unlabeled nonterminal nodes, and edges labeled
with grammatical functions. This conversion is rather
straightforward given the partially hierarchical annotation
exemplified above.
-
The bare phrase structure annotation is extended to a full
phrase structure representation by labeling nonterminal nodes
with syntactic categories. These categories are not part of
the original annotation and have to be inferred from
other parts of the annotation.
-
The full phrase structure annotation is converted to a
dependency annotation using the standard technique with
head-finding rules and
preserving grammatical functions as edge labels. Head-finding
rules are not part of the original annotation scheme and have
to be constructed manually.
Phrase Structure Annotation
The phrase structure annotation, which is the outcome of the second
conversion step, uses a conventional set of phrase types (S, NP, VP,
etc.) in combination with the grammatical functions of the original
MAMBA annotation. The representation allows discontinuous phrases,
although discontinuous constituents are relatively rare in the treebank.
The phrase structure annotation comes in two versions, one with the
flattest possible trees that can be extracted from the original
annotation, called Flat Phrase Structure (FPS),
and one where trees have been deepened by inserting,
e.g., NPs within PPs and VPs within (larger) VPs,
called Deepened Phrase Structure (DPS).
In both cases, the conversion has necessitated the introduction
of a small number of new syntactic functions.
Dependency Structure Annotation
The dependency annotation (Dep), which is the outcome of the third conversion
step, consists of terminal nodes connected by edges labeled with
the same syntactic functions as DPS, extended with the label ROOT
for words that are not governed by another word in the dependency
structure. The representation allows non-projective
dependency structures, which are needed to capture discontinuous
constituents.
The conversion from phrase structure to dependency structure
uses a priority list for finding the head of a phrase.
CPOSTAG, POSTAG and FEATS in CoNLL Format
In the CoNLL-X format, information
from the lexical categories in the MAMBA annotation is distributed over 3 fields:
- Coarse-grained part-of-speech (CPOSTAG)
- Part-of-speech (POSTAG)
- Features (FEATS)
In mapping the part-of-speech tags of the Dep format to the CoNLL-X shared task format,
we have used the following principles:
- The POSTAG field contains the two-character basic part-of-speech tag from the
MAMBA annotation.
- The FEATS field contains the additional two-character feature tags from the
MAMBA annotation (if any), separated by vertical bars (|).
- The CPOSTAG field contains the two-character basic part-of-speech tag from
the MAMBA annotation or a one-character tag obtained by merging part-of-speech
categories from the original annotation, so that, e.g., "V" generalizes over all
verb tags ("AV", "BV", "HV", etc.). The mapping from POSTAG to CPOSTAG
roughly follows the grouping of the MAMBA scheme itself and is summarized in this
table.
NB: The version of Talbanken05 distributed in CoNLL-X shared task format here
is not identical to the data set used in the shared task itself. The latter dataset
used the two-character tags in both the CPOSTAG and POSTAG and had no information
in the FEATS field.