CoNLL

TLDR

import edsnlp

stream = edsnlp.data.read_conll(path)
stream = stream.map_pipeline(nlp)

You can easily integrate CoNLL formatted files into your project by using EDS-NLP's CoNLL reader.

There are many CoNLL formats corresponding to different shared tasks, but one of the most common is the CoNLL-U format, which is used for dependency parsing. In CoNLL files, each line corresponds to a token and contains various columns with information about the token, such as its index, form, lemma, POS tag, and dependency relation.

EDS-NLP lets you specify the name of the columns if they are different from the default CoNLL-U format. If the columns parameter is unset, the reader looks for a comment containing # global.columns to infer the column names. Otherwise, the columns are

ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC

A typical CoNLL file looks like this:

sample.conllu

1   euh euh INTJ    _   _   5   discourse   _   SpaceAfter=No
2   ,   ,   PUNCT   _   _   1   punct   _   _
3   il  lui PRON    _   Gender=Masc|Number=Sing|Person=3|PronType=Prs   5   expl:subj   _   _
...

Reading CoNLL files

The ConllReader (or edsnlp.data.read_conll) reads a file or directory of CoNLL files and yields documents.

The raw output (i.e., by setting converter=None) will be in the following form for a single doc:

{
    "words": [
        {"ID": "1", "FORM": ...},
        ...
    ],
}

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_conll("path/to/conll/file/or/directory")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_conll returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list :

docs = list(edsnlp.data.read_conll("path/to/conll/file/or/directory"))

Parameters

PARAMETER	DESCRIPTION
`path`	Path to the directory containing the CoNLL files (will recursively look for files in subdirectories). TYPE: `Union[str, Path]`
`columns`	List of column names to use. If None, will try to extract to look for a `#global.columns` comment at the start of the file to extract the column names. TYPE: `Optional[List[str]]` DEFAULT: `None`
`shuffle`	Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping). TYPE: `Literal['dataset', False]` DEFAULT: `False`
`seed`	The seed to use for shuffling. TYPE: `Optional[int]` DEFAULT: `None`
`loop`	Whether to loop over the data indefinitely. TYPE: `bool` DEFAULT: `False`
`nlp`	The pipeline object (optional and likely not needed, prefer to use the `tokenizer` directly argument instead). TYPE: `Optional[PipelineProtocol]`
`tokenizer`	The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer : the tokenizer of the next pipeline run by `.map_pipeline` in a Stream. or the `eds` tokenizer by default. TYPE: `Optional[Tokenizer]`
`converter`	Converter to use to convert the documents to dictionary objects. TYPE: `Optional[AsList[Union[str, Callable]]]` DEFAULT: `['conll']`
`filesystem`	The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. `s3://` will use S3). TYPE: `Optional[FileSystem]` DEFAULT: `None`