The BaseComponent adds a set_extensions method,
called at the creation of the object.
It helps decouple the initialisation of the pipeline from
the creation of extensions, and is particularly usefull when
distributing EDSNLP on a cluster, since the serialisation mechanism
imposes that the extensions be reset.
classBaseComponent(object):""" The `BaseComponent` adds a `set_extensions` method, called at the creation of the object. It helps decouple the initialisation of the pipeline from the creation of extensions, and is particularly usefull when distributing EDSNLP on a cluster, since the serialisation mechanism imposes that the extensions be reset. """def__init__(self,*args,**kwargs):super().__init__(*args,**kwargs)self.set_extensions()@classmethoddefset_extensions(cls)->None:""" Set `Doc`, `Span` and `Token` extensions. """passdef_boundaries(self,doc:Doc,terminations:Optional[List[Span]]=None)->List[Tuple[int,int]]:""" Create sub sentences based sentences and terminations found in text. Parameters ---------- doc: spaCy Doc object terminations: List of tuples with (match_id, start, end) Returns ------- boundaries: List of tuples with (start, end) of spans """ifterminationsisNone:terminations=[]sent_starts=[sent.startforsentindoc.sents]termination_starts=[t.startfortinterminations]starts=sent_starts+termination_starts+[len(doc)]# Remove duplicatesstarts=list(set(starts))# Sort startsstarts.sort()boundaries=[(start,end)forstart,endinzip(starts[:-1],starts[1:])]returnboundaries