Malt-XML and Malt-TAB

Malt-XML

Malt-XML is an XML-based representation format for dependency treebanks. It is based on the following simple principles of representation:

A treebank is a sequence of sentences.
A sentence is a sequence of words.
A word is an element with up to five attributes:
1. id = Unique id within the sentence. (required)
2. form = Word form (string). (required)
3. postag = Part-of-speech tag. (optional)
4. head = Syntactic head (word id). (optional)
5. deprel = Dependency relation to head. (optional)

The representation is based on the assumption that each word has at most one head. By convention, word ids start at 1 and a root word has head="0" and deprel="ROOT". A dependency tree for the Swedish sentence "Genom skattereformen införs individuell beskattning (särbeskattning) av arbetsinkomster." can be represented as follows:

<sentence id="2" user="malt" date="">
  <word id="1" form="Genom" postag="pp" head="3" deprel="ADV"/>
  <word id="2" form="skattereformen" postag="nn.utr.sin.def.nom" head="1" deprel="PR"/>
  <word id="3" form="införs" postag="vb.prs.sfo" head="0" deprel="ROOT"/>
  <word id="4" form="individuell" postag="jj.pos.utr.sin.ind.nom" head="5" deprel="ATT"/>
  <word id="5" form="beskattning" postag="nn.utr.sin.ind.nom" head="3" deprel="SUB"/>
  <word id="6" form="(" postag="pad" head="5" deprel="IP"/>
  <word id="7" form="särbeskattning" postag="nn.utr.sin.ind.nom" head="5" deprel="APP"/>
  <word id="8" form=")" postag="pad" head="5" deprel="IP"/>
  <word id="9" form="av" postag="pp" head="5" deprel="ATT"/>
  <word id="10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom" head="9" deprel="PR"/>
  <word id="11" form="." postag="mad" head="3" deprel="IP"/>
</sentence>

The tagsets used for parts-of-speech and dependency relations must be specified in the header of the XML document. An example document can be found here. An XML schema for Malt-XML treebanks can be found here.

Malt-TAB

Malt-TAB is a text-based representation, which is mainly used by MaltParser. Malt-TAB contains a subset of the features in Malt-XML, and attributes are implicitly defined by their position. Each word is represented on one line, with attribute values being separated by tabs. The required order of attributes is as follows:

form (required) < postag (required) < head (optional) < deprel (optional)

Although head and deprel are optional, they must either both be included or both be omitted. (Normally, all four columns are present in the input when training the parser and in the output when parsing, while only form and postag are present in the input when parsing.) Please note also that the id attribute is not represented explicitly at all. Words in a sentence are separated by one newline; sentences are separated by one additional newline. A dependency tree for the Swedish sentence "Genom skattereformen införs individuell beskattning (särbeskattning) av arbetsinkomster." can be represented as follows:

Genom		pp			3	ADV
skattereformen	nn.utr.sin.def.nom	1	PR
införs		vb.prs.sfo		0	ROOT
individuell	jj.pos.utr.sin.ind.nom	5	ATT
beskattning	nn.utr.sin.ind.nom	3	SUB
(		pad			5	IP
särbeskattning	nn.utr.sin.ind.nom	5	APP
)		pad			5	IP
av		pp			5	ATT
arbetsinkomster	nn.utr.plu.ind.nom	9	PR
.		mad			3	IP

An example document can be found here.