The evaluation set of the Swedish Treebank consist of two parts. The Talbanken part and the SUC part, each consisting of about 20 000 tokens. This means that the evaluation set consists of about 20% of the available data for Talbanken, but only about 2% for SUC. The total number of tokens in the evaluation set of The Swedish Treebank is 41,040 tokens. The following table shows how the documents are distributed over different genres in the Talbanken and SUC part of the evaluation set.
Talbanken part | ||
File | # Tokens | Genre |
P108 | 1441 | Community information |
P110 | 1446 | Community information |
P114 | 499 | Community information |
P122 | 1650 | Community information |
P204 | 1457 | Press |
P210 | 1127 | Press |
P213 | 393 | Press |
P214 | 638 | Press |
P218 | 1417 | Press |
P301 | 2331 | Textbooks |
P307 | 2078 | Textbooks |
P311 | 934 | Textbooks |
P412 | 936 | Debate articles |
P415 | 1989 | Debate articles |
P416 | 2046 | Debate articles |
Total | 20382 | |
SUC part | ||
File | # Tokens | Genre |
aa05 | 2056 | Press |
aa09 | 2073 | Press |
ba07 | 2100 | Press |
ea10 | 2194 | Skills, trades and hobbies |
ea12 | 2017 | Skills, trades and hobbies |
ja06 | 2123 | Learned and scientific writing |
kk14 | 2067 | Imaginative prose |
kk44 | 2016 | Imaginative prose |
kl07 | 2008 | Imaginative prose, crime |
kn08 | 2004 | Imaginative prose, trivia |
Total | 20658 | |