The Evaluation Set of the Swedish Treebank

The evaluation set of the Swedish Treebank consist of two parts. The Talbanken part and the SUC part, each consisting of about 20 000 tokens. This means that the evaluation set consists of about 20% of the available data for Talbanken, but only about 2% for SUC. The total number of tokens in the evaluation set of The Swedish Treebank is 41,040 tokens. The following table shows how the documents are distributed over different genres in the Talbanken and SUC part of the evaluation set.

Talbanken part
File# TokensGenre
P1081441Community information
P1101446Community information
P114499Community information
P1221650Community information
P2041457Press
P2101127Press
P213393Press
P214638Press
P2181417Press
P3012331Textbooks
P3072078Textbooks
P311934Textbooks
P412936Debate articles
P4151989Debate articles
P4162046Debate articles
Total20382
SUC part
File# TokensGenre
aa052056Press
aa092073Press
ba072100Press
ea102194Skills, trades and hobbies
ea122017Skills, trades and hobbies
ja062123Learned and scientific writing
kk142067Imaginative prose
kk442016Imaginative prose
kl072008Imaginative prose, crime
kn082004Imaginative prose, trivia
Total20658