Skip to content

edsnlp.pipes.core.normalizer.pollution.factory

create_component = registry.factory.register('eds.pollution', assigns=['doc.spans'], deprecated=['pollution'])(PollutionTagger) module-attribute

Tags pollution tokens.

Populates a number of spaCy extensions :

  • Token._.pollution : indicates whether the token is a pollution
  • Doc._.clean : lists non-pollution tokens
  • Doc._.clean_ : original text with pollutions removed.
  • Doc._.char_clean_span : method to create a Span using character indices extracted using the cleaned text.

Parameters

PARAMETER DESCRIPTION
nlp

The pipeline object

TYPE: PipelineProtocol

name

The component name.

TYPE: Optional[str]

pollution

Dictionary containing regular expressions of pollution.

TYPE: Dict[str, Union[str, List[str]]] DEFAULT: {'information': True, 'bars': True, 'biology': ...