CoNLL-U format tests

This file contains an informal mixture of tests for various aspects of the CoNLL-U format.

Valid examples:

Multiword token ("haven't")

1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2-3 haven't _ _ _ _ _ _ _ _ 2 have have VERB VB Tens=Pres 0 root _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _

Additional dependencies (DEPS field)

1 They they PRON PRN Case=Nom|Num=Plur 2 nsubj 4:nsubj _ 2 buy buy VERB VBP Num=Plur|Per=3|Tense=Pres 0 root _ _ 3 and and CONJ CC _ 2 cc _ _ 4 sell sell VERB VBP Num=Plur|Per=3|Tense=Pres 2 conj _ _ 5 books book NOUN NNS Num=Plur 2 dobj 4:dobj _ 6 . . PUNCT . _ 2 punct _ _

Multiple sentences

1 LONDRA Londra NOUN _ _ 0 root _ _ 2 . . . _ _ 1 punct _ _ # This is a comment 1 Gas gas NOUN _ Gen=M|Num=N 0 root _ _ 2-3 dalla _ _ _ _ _ _ _ _ 2 da da ADP _ _ 1 adpmod _ _ 3 la la DET _ Gen=F|Num=S 4 det _ _ 4 statua statua NOUN _ Gen=F|Num=S 2 adpobj _ _ 5 . . . _ _ 1 punct _ _ 1 Evacuata evacuare VERB _ Gen=F|Mod=P|Num=S 3 partmod _ _ 2 la il DET _ Gen=F|Num=S 3 det _ _ 3 Tate Tate NOUN _ _ 0 root _ _ 4 Gallery Gallery NOUN _ _ 3 mwe _ _ 5 . . PUNCT _ _ 3 punct _ _

Multiple sentences and multiword token

# give the toys to the children
1     donner    donner   VERB   _   VerbForm=Inf               0   root   _   give
2     les       le       DET    _   Definite=Def|Number=Plur   3   det    _   the
3     jouets    jouet    NOUN   _   Gender=Masc|Number=Plur    1   dobj   _   toys
4-5   aux       _        _      _   _                          _   _      _   _
4     à         à        ADP    _   _                          6   case   _   to
5     les       le       DET    _   Definite=Def|Number=Plur   6   det    _   the
6     enfants   enfant   NOUN   _   Gender=Masc|Number=Plur    1   nmod   _   children

# now the parallel English tree
1     give       donner   VERB   _   VerbForm=Inf               0   root   _   give
2     the        le       DET    _   Definite=Def|Number=Plur   3   det    _   the
3     toys       jouet    NOUN   _   Gender=Masc|Number=Plur    1   dobj   _   toys
4     to         à        ADP    _   _                          6   case   _   to
5     the        le       DET    _   Definite=Def|Number=Plur   6   det    _   the
6     children   enfant   NOUN   _   Gender=Masc|Number=Plur    1   nmod   _   children

Sentence labels

# sentence-label 1 1 LONDRA Londra NOUN _ _ 0 root _ _ 2 . . . _ _ 1 punct _ _ # sentence-label A 1 Gas gas NOUN _ Gen=M|Num=N 0 root _ _ 2 . . . _ _ 1 punct _ _ # sentence-label B4 1 Tate Tate NOUN _ _ 0 root _ _ 2 Gallery Gallery NOUN _ _ 1 mwe _ _ 3 . . PUNCT _ _ 1 punct _ _

Custom styles

1 They they PRON PRN Case=Nom|Num=Plur 2 nsubj 4:nsubj _ 2 buy buy VERB VBP Num=Plur|Per=3|Tense=Pres 0 root _ _ 3 and and CONJ CC _ 2 cc _ _ 4 sell sell VERB VBP Num=Plur|Per=3|Tense=Pres 2 conj _ _ 5 books book NOUN NNS Num=Plur 2 dobj 4:dobj _ 6 . . PUNCT . _ 2 punct _ _

Acceptable examples with loose parsing

Otherwise valid, but two spaces instead of single tab as field separator and no terminal newline:
1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2-3 haven't _ _ _ _ _ _ _ _ 2 have have VERB VB Tens=Pres 0 root _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _

Non-valid examples:

Non-valid examples from UD tools test cases.

ambiguous-feature.conll

# not valid: feature definition is malformed / ambiguous (two "=" characters) 1 non-valid non-valid NOUN SP A=B=C 0 ROOT _ _

duplicate-feature.conll

# not valid: feature name occurs twice 1 non-valid non-valid NOUN SP Gen=M|Gen=M 0 ROOT _ _

duplicate-id.conll

# not valid: IDs must be sequential integers (1, 2, ...) 1 valid valid NOUN SP _ 0 ROOT _ _ 1 . . . FS _ 1 p _ _

duplicate-value.conll

# not valid: feature value occurs twice 1 non-valid non-valid NOUN SP Gen=M,M 0 ROOT _ _

empty-head.conll

# not valid: HEAD must not be empty 1 have have VERB VB Tens=Pres root _ _

empty-field.conll

# not valid: no field can be empty. 1 valid NOUN SP _ 0 ROOT _ _

empty-sentence.conll

# not valid: sentences must contain at least one word. # valid one-word sentence. 1 valid valid NOUN SP _ 0 ROOT _ _

extra-empty-line.conll

# valid one-word sentence. 1 valid valid NOUN SP _ 0 ROOT _ _ # format error: sentences must be separated by exactly one empty line # valid one-word sentence. 1 valid valid NOUN SP _ 0 ROOT _ _

extra-field.conll

# not valid: 11 TAB-separated fields 1 non-valid non-valid NOUN SP _ 0 ROOT _ _ extra

id-starting-from-2.conll

# valid one-word sentence. 1 valid valid NOUN SP _ 0 ROOT _ _ # not valid: ID must start at 1 for each new sentence 2 valid valid NOUN SP _ 0 ROOT _ _

invalid-deps-id.conll

# not valid: HEAD must reference a valid ID 1 have have VERB VB Tens=Pres 0 root 3:nsubj _ 2 . . . FS _ 1 punct _ _

invalid-deps-order.conll

# not valid: DEPS must be sorted by HEAD index. 1 They they PRON PRN Case=Nom|Num=Plur 2 nsubj 4:nsubj|2:xsubj _ 2 buy buy VERB VBP Num=Plur|Per=3|Tense=Pres 0 root _ _ 3 and and CONJ CC _ 2 cc _ _ 4 sell sell VERB VBP Num=Plur|Per=3|Tense=Pres 2 conj _ _ 5 books book NOUN NNS Num=Plur 2 dobj 4:dobj _ 6 . . PUNCT . _ 2 punct _ _

invalid-deps-syntax.conll

# not valid: DEPS must be 'HEAD:REL' pairs separated by bars ('|') 1 have have VERB VB Tens=Pres 0 root 2 _ 2 . . . FS _ 1 punct _ _

invalid-head.conll

# not valid: HEAD must reference a valid ID 1 have have VERB VB Tens=Pres 0 root _ _ 2 . . . FS _ 3 punct _ _

invalid-range.conll

# not valid: (first-last) multiword ranges must have first <= last 1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2-1 haven't _ _ _ _ _ _ _ _ 2 have have VERB VB Tens=Pres 0 root _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _

lowercase-feature.conll

# not valid: feature names must have format '[A-Z0-9][a-zA-Z0-9]*' # (see http://universaldependencies.github.io/docs/features.html) 1 non-valid non-valid NOUN SP lower=Nonvalid 0 ROOT _ _

lowercase-value.conll

# not valid: feature values must have format '[A-Z0-9][a-zA-Z0-9]*' # (see http://universaldependencies.github.io/docs/features.html) 1 non-valid non-valid NOUN SP Lower=nonvalid 0 ROOT _ _

malformed_deps.conll

# This is a comment 1 Gas gas NOUN S Gen=M|Num=N 0 ROOT xxx _

misordered-feature.conll

# not valid: features must be ordered alphabetically (ignoring case) # (see http://universaldependencies.github.io/docs/features.html) 1 non-valid non-valid NOUN SP XB=True|Xa=True 0 ROOT _ _

misordered-multiword.conll

# not valid: multiword tokens must appear before the first word in their # range 1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2 have have VERB VB Tens=Pres 0 root _ _ 2-3 haven't _ _ _ _ _ _ _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _

misplaced-comment-mid.conll

# not valid: comment lines inside sentences are disallowed. 1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2-3 haven't _ _ _ _ _ _ _ _ # this comment should not be here 2 have have VERB VB Tens=Pres 0 root _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _

misplaced-comment-end.conll

# not valid: comment lines should precede a sentence 1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2-3 haven't _ _ _ _ _ _ _ _ 2 have have VERB VB Tens=Pres 0 root _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _ # this comment should not be here as it does not precede a sentence.

missing final newline

1 Gas gas NOUN S Gen=M|Num=N 0 root _ _

multiword-with-pos.conll

# not valid: multiword tokens must have underscore ("_") for all fields # except FORM. 1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2-3 haven't _ VERB _ _ _ _ _ _ 2 have have VERB VB Tens=Pres 0 root _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _

nonsequential-id.conll

# not valid: IDs must be sequential integers (1, 2, ...) 1 valid valid NOUN SP _ 0 ROOT _ _ 3 . . . FS _ 1 p _ _

overlapping-multiword.conll

# not valid: multiword token ranges may not overlap 1 I I PRON PRN Num=Sing|Per=1 2 nsubj _ _ 2-3 haven't _ _ _ _ _ _ _ _ 2 have have VERB VB Tens=Pres 0 root _ _ 3-4 nota _ _ _ _ _ _ _ _ 3 not not ADV RB _ 2 neg _ _ 4 a a DET DT _ 5 det _ _ 5 clue clue NOUN NN Num=Sing 2 dobj _ _ 6 . . PUNCT . _ 2 punct _ _

space-in-field.conll

# not valid: no field can contain space. 1 not valid valid NOUN SP _ 0 ROOT _ _

token_with_cols_filled.conll

# (TODO: is this the same general case as mutiword-with-pos.conll?) # This is a comment 1 Gas gas NOUN S Gen=M|Num=N 0 ROOT _ _ 2-3 dalla dalla _ _ _ 0 ROOT _ _ 2 da da ADP EA _ 1 adpmod _ _ 3 la la DET RD Gen=F|Num=S 4 det _ _ 4 statua statua NOUN S Gen=F|Num=S 2 adpobj _ _ 5 . . . FS _ 1 p _ _

trailing-tab.conll

# not valid: extra TAB before newline 1 non-valid non-valid NOUN SP _ 0 ROOT _ _