Changelog
Unreleased
Added
- Support for chained attributes in the
processingpipelines - Colour utility with the category20 colour palette
v0.5.1 (2022-04-11)
Fixed
- Updated Numpy requirements to be compatible with the
EDSPhraseMatcher
v0.5.0 (2022-04-08)
Added
- New
edslanguage to better fit French clinical documents and improve speed - Testing for markdown codeblocks to make sure the documentation is actually executable
Changed
- Complete revamp of the date detection pipeline, with better parsing and more exhaustive matching
- Reimplementation of the EDSPhraseMatcher in Cython, leading to a x15 speed increase
v0.4.4
- Add
measurespipeline - Cap Jinja2 version to fix mkdocs
- Adding the possibility to add context in the processing module
- Improve the speed of char replacement pipelines (accents and quotes)
- Improve the speed of the regex matcher
v0.4.3
- Fix regex matching on spans.
- Add fast_parse in date pipeline.
- Add relative_date information parsing
v0.4.2
- Fix issue with
dateparserlibrary (see scrapinghub/dateparser#1045) - Fix
attrissue in theadvanced-regexpipelin - Add documentation for
eds.covid - Update the demo with an explanation for the regex
v0.4.1
- Added support to Koalas DataFrames in the
edsnlp.processingpipe. - Added
eds.covidNER pipeline for detecting COVID19 mentions.
v0.4.0
- Profound re-write of the normalisation :
- The custom attribute
CUSTOM_NORMis completely abandoned in favour of a more spacyfic alternative - The
normalizerpipeline modifies theNORMattribute in place - Other pipelines can modify the
Token._.excludedcustom attribute - EDS regex and term matchers can ignore excluded tokens during matching, effectively adding a second dimension to normalisation (choice of the attribute and possibility to skip pollution tokens regardless of the attribute)
- Matching can be performed on custom attributes more easily
- Qualifiers are regrouped together within the
edsnlp.qualifierssubmodule, the inheritance from theGenericMatcheris dropped. edsnlp.utils.filter.filter_spansnow accepts alabel_to_removeparameter. If set, only corresponding spans are removed, along with overlapping spans. Primary use-case: removing pseudo cues for qualifiers.- Generalise the naming convention for extensions, which keep the same name as the pipeline that created them (eg
Span._.negationfor theeds.negationpipeline). The previous convention is kept for now, but calling it issues a warning. - The
datespipeline underwent some light formatting to increase robustness and fix a few issues - A new
consultation_datespipeline was added, which looks for dates preceded by expressions specific to consultation dates - In rule-based processing, the
terms.pysubmodule is replaced bypatterns.pyto reflect the possible presence of regular expressions - Refactoring of the architecture :
- pipelines are now regrouped by type (
core,ner,misc,qualifiers) matcherssubmodule containsRegexMatcherandPhraseMatcherclasses, which interact with the normalisationmultiprocessingsubmodule containssparkandlocalmultiprocessing toolsconnectorscontainsBrat,OMOPandLabelToolconnectorsutilscontains various utilities- Add entry points to make pipeline usable directly, removing the need to import
edsnlp.components. - Add a
edsnamespace for components: for instance,negationbecomeseds.negation. Using the former pipeline name still works, but issues a deprecation warning. - Add 3 score pipelines related to emergency
- Add a helper function to use a spaCy pipeline as a Spark UDF.
- Fix alignment issues in RegexMatcher
- Change the alignment procedure, dropping clumsy
numpydependency in favour ofbisect - Change the name of
eds.antecedentstoeds.history. Callingeds.antecedentsstill works, but issues a deprecation warning and support will be removed in a future version. - Add a
eds.covidcomponent, that identifies mentions of COVID - Change the demo, to include NER components
v0.3.2
- Major revamp of the normalisation.
- The
normalizerpipeline now adds atomic components (lowercase,accents,quotes,pollution&endlines) to the processing pipeline, and compiles the results into a newDoc._.normalizedextension. The latter is itself a spaCyDocobject, wherein tokens are normalised and pollution tokens are removed altogether. Components that match on theCUSTOM_NORMattribute process thenormalizeddocument, and matches are brought back to the original document using a token-wise mapping. - Update the
RegexMatcherto use theCUSTOM_NORMattribute - Add an
EDSPhraseMatcher, wrapping spaCy'sPhraseMatcherto enable matching onCUSTOM_NORM. - Update the
matcherandadvancedpipelines to enable matching on theCUSTOM_NORMattribute. - Add an OMOP connector, to help go back and forth between OMOP-formatted pandas dataframes and spaCy documents.
- Add a
reasonpipeline, that extracts the reason for visit. - Add an
endlinespipeline, that classifies newline characters between spaces and actual ends of line. - Add possibility to annotate within entities for qualifiers (
negation,hypothesis, etc), ie if the cue is within the entity. Disabled by default.
v0.3.1
- Update
datesto remove miscellaneous bugs. - Add
isortpre-commit hook. - Improve performance for
negation,hypothesis,antecedents,familyandrspeechby using spaCy'sfilter_spansand ourconsume_spansmethods. - Add proposition segmentation to
hypothesisandfamily, enhancing results.
v0.3.0
- Renamed
generictomatcher. This is a non-breaking change for the average user, adding the pipeline is still :
nlp.add_pipe("matcher", config=dict(terms=dict(maladie="maladie")))
- Removed
quickumlspipeline. It was untested, unmaintained. Will be added back in a future release. - Add
scorepipeline, andcharlson. - Add
advanced-regexpipeline - Corrected bugs in the
negationpipeline
v0.2.0
- Add
negationpipeline - Add
familypipeline - Add
hypothesispipeline - Add
antecedentspipeline - Add
rspeechpipeline - Refactor the library :
- Remove the
rulesfolder - Add a
pipelinesfolder, containing one subdirectory per component - Every component subdirectory contains a module defining the component, and a module defining a factory, plus any other utilities (eg
terms.py)
v0.1.0
First working version. Available pipelines :
sectionsentencesnormalizationpollution