CoNLL
TLDR
import edsnlp
stream = edsnlp.data.read_conll(path)
stream = stream.map_pipeline(nlp)
You can easily integrate CoNLL formatted files into your project by using EDS-NLP's CoNLL reader.
There are many CoNLL formats corresponding to different shared tasks, but one of the most common is the CoNLL-U format, which is used for dependency parsing. In CoNLL files, each line corresponds to a token and contains various columns with information about the token, such as its index, form, lemma, POS tag, and dependency relation.
EDS-NLP lets you specify the name of the columns
if they are different from the default CoNLL-U format. If the columns
parameter is unset, the reader looks for a comment containing # global.columns
to infer the column names. Otherwise, the columns are
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC
A typical CoNLL file looks like this:
1 euh euh INTJ _ _ 5 discourse _ SpaceAfter=No
2 , , PUNCT _ _ 1 punct _ _
3 il lui PRON _ Gender=Masc|Number=Sing|Person=3|PronType=Prs 5 expl:subj _ _
...
Reading CoNLL files[source]
The ConllReader (or edsnlp.data.read_conll
) reads a file or directory of CoNLL files and yields documents.
The raw output (i.e., by setting converter=None
) will be in the following form for a single doc:
{
"words": [
{"ID": "1", "FORM": ...},
...
],
}
Example
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_conll("path/to/conll/file/or/directory")
annotated_docs = nlp.pipe(doc_iterator)
Generator vs list
edsnlp.data.read_conll
returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list :
docs = list(edsnlp.data.read_conll("path/to/conll/file/or/directory"))
Parameters
PARAMETER | DESCRIPTION |
---|---|
path | Path to the directory containing the CoNLL files (will recursively look for files in subdirectories). TYPE: |
columns | List of column names to use. If None, will try to extract to look for a TYPE: |
shuffle | Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping). TYPE: |
seed | The seed to use for shuffling. TYPE: |
loop | Whether to loop over the data indefinitely. TYPE: |
nlp | The pipeline object (optional and likely not needed, prefer to use the TYPE: |
tokenizer | The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :
TYPE: |
converter | Converter to use to convert the documents to dictionary objects. TYPE: |
filesystem | The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. TYPE: |