Metadata-Version: 2.3
Name: medkit-lib
Version: 0.15.0
Summary: A Python library for a learning health system
Project-URL: Changelog, https://medkit.readthedocs.io/en/stable/changelog.html
Project-URL: Documentation, https://medkit.readthedocs.io
Project-URL: Issues, https://github.com/medkit-lib/medkit/issues
Project-URL: Source, https://github.com/medkit-lib/medkit
Author: HeKA Research Team
Maintainer-email: medkit maintainers <medkit-maintainers@inria.fr>
License-Expression: MIT
License-File: LICENSE
Keywords: bert,digital health,ehr,nlp,umls
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Software Development
Requires-Python: >=3.8
Requires-Dist: anyascii
Requires-Dist: duptextfinder>=0.3.0
Requires-Dist: flashtext>=2.7
Requires-Dist: intervaltree
Requires-Dist: numpy
Requires-Dist: pyaml
Requires-Dist: pysimstring
Requires-Dist: requests
Requires-Dist: smart-open
Requires-Dist: soundfile
Requires-Dist: tqdm
Requires-Dist: typing-extensions>=4.6.0
Provides-Extra: all
Requires-Dist: edsnlp>=0.9; extra == 'all'
Requires-Dist: feather-format>=0.4; extra == 'all'
Requires-Dist: huggingface-hub; extra == 'all'
Requires-Dist: iamsystem>=0.6.0; extra == 'all'
Requires-Dist: nlstruct>=0.2; extra == 'all'
Requires-Dist: packaging; extra == 'all'
Requires-Dist: pandas>=1.4; extra == 'all'
Requires-Dist: pyannote-audio>=3.1; extra == 'all'
Requires-Dist: pyannote-core>=5.0; extra == 'all'
Requires-Dist: pyannote-metrics>=3.2.0; extra == 'all'
Requires-Dist: pyrush>=1.0; extra == 'all'
Requires-Dist: pysrt>=1.1.2; extra == 'all'
Requires-Dist: quickumls>=1.4; extra == 'all'
Requires-Dist: resampy>=0.4; extra == 'all'
Requires-Dist: sacremoses; extra == 'all'
Requires-Dist: scikit-learn>=1.3.2; extra == 'all'
Requires-Dist: sentencepiece; extra == 'all'
Requires-Dist: seqeval>=1.2.2; extra == 'all'
Requires-Dist: spacy>=3.4; extra == 'all'
Requires-Dist: speechbrain>=0.5; extra == 'all'
Requires-Dist: torch>=2.1.1; extra == 'all'
Requires-Dist: torchaudio>=2.1.1; extra == 'all'
Requires-Dist: transformers>=4.21; extra == 'all'
Requires-Dist: unqlite>=0.9.6; extra == 'all'
Requires-Dist: webrtcvad>=2.0; extra == 'all'
Provides-Extra: docs
Requires-Dist: myst-nb; extra == 'docs'
Requires-Dist: numpydoc; extra == 'docs'
Requires-Dist: pandas; extra == 'docs'
Requires-Dist: sphinx; extra == 'docs'
Requires-Dist: sphinx-autoapi; extra == 'docs'
Requires-Dist: sphinx-autobuild; extra == 'docs'
Requires-Dist: sphinx-book-theme; extra == 'docs'
Requires-Dist: sphinx-design; extra == 'docs'
Requires-Dist: sphinxcontrib-mermaid; extra == 'docs'
Provides-Extra: edsnlp
Requires-Dist: edsnlp>=0.9; extra == 'edsnlp'
Provides-Extra: hf-entity-matcher
Requires-Dist: torch>=2.1.1; extra == 'hf-entity-matcher'
Requires-Dist: transformers>=4.21; extra == 'hf-entity-matcher'
Provides-Extra: hf-transcriber
Requires-Dist: torchaudio>=2.1.1; extra == 'hf-transcriber'
Requires-Dist: transformers>=4.21; extra == 'hf-transcriber'
Provides-Extra: hf-translator
Requires-Dist: sacremoses; extra == 'hf-translator'
Requires-Dist: sentencepiece; extra == 'hf-translator'
Requires-Dist: torch>=2.1.1; extra == 'hf-translator'
Requires-Dist: transformers>=4.21; extra == 'hf-translator'
Provides-Extra: hf-utils
Requires-Dist: transformers>=4.21; extra == 'hf-utils'
Provides-Extra: iamsystem-matcher
Requires-Dist: iamsystem>=0.6.0; extra == 'iamsystem-matcher'
Provides-Extra: metrics-diarization
Requires-Dist: pyannote-core>=5.0; extra == 'metrics-diarization'
Requires-Dist: pyannote-metrics>=3.2.0; extra == 'metrics-diarization'
Provides-Extra: metrics-ner
Requires-Dist: seqeval>=1.2.2; extra == 'metrics-ner'
Requires-Dist: torch>=2.1.1; extra == 'metrics-ner'
Requires-Dist: transformers>=4.21; extra == 'metrics-ner'
Provides-Extra: metrics-text-classification
Requires-Dist: scikit-learn>=1.3.2; extra == 'metrics-text-classification'
Provides-Extra: metrics-transcription
Requires-Dist: speechbrain>=0.5; extra == 'metrics-transcription'
Provides-Extra: nlstruct
Requires-Dist: huggingface-hub; extra == 'nlstruct'
Requires-Dist: nlstruct>=0.2; extra == 'nlstruct'
Requires-Dist: torch>=2.1.1; extra == 'nlstruct'
Provides-Extra: pa-speaker-detector
Requires-Dist: pyannote-audio>=3.1; extra == 'pa-speaker-detector'
Requires-Dist: torch>=2.1.1; extra == 'pa-speaker-detector'
Provides-Extra: quick-umls
Requires-Dist: packaging; extra == 'quick-umls'
Requires-Dist: quickumls>=1.4; extra == 'quick-umls'
Requires-Dist: unqlite>=0.9.6; extra == 'quick-umls'
Provides-Extra: resampler
Requires-Dist: resampy>=0.4; extra == 'resampler'
Provides-Extra: rush-sentence-tokenizer
Requires-Dist: pyrush>=1.0; extra == 'rush-sentence-tokenizer'
Provides-Extra: sb-transcriber
Requires-Dist: speechbrain>=0.5; extra == 'sb-transcriber'
Requires-Dist: torch>=2.1.1; extra == 'sb-transcriber'
Requires-Dist: transformers>=4.21; extra == 'sb-transcriber'
Provides-Extra: spacy
Requires-Dist: spacy>=3.4; extra == 'spacy'
Provides-Extra: srt-io-converter
Requires-Dist: pysrt>=1.1.2; extra == 'srt-io-converter'
Provides-Extra: syntactic-relation-extractor
Requires-Dist: spacy>=3.4; extra == 'syntactic-relation-extractor'
Provides-Extra: training
Requires-Dist: torch>=2.1.1; extra == 'training'
Provides-Extra: umls-coder-normalizer
Requires-Dist: feather-format>=0.4; extra == 'umls-coder-normalizer'
Requires-Dist: pandas>=1.4; extra == 'umls-coder-normalizer'
Requires-Dist: torch>=2.1.1; extra == 'umls-coder-normalizer'
Requires-Dist: transformers>=4.21; extra == 'umls-coder-normalizer'
Provides-Extra: webrtc-voice-detector
Requires-Dist: webrtcvad>=2.0; extra == 'webrtc-voice-detector'
Description-Content-Type: text/markdown

# medkit

![medkit logo](https://github.com/medkit-lib/medkit/blob/main/docs/_static/medkit-logo.png?raw=true)

|         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CI      | [![docs status](https://readthedocs.org/projects/medkit/badge/?version=latest)](https://medkit.readthedocs.io/en/latest/) [![pre-commit status](https://github.com/medkit-lib/medkit/actions/workflows/pre-commit.yaml/badge.svg)](https://github.com/medkit-lib/medkit/actions/workflows/pre-commit.yaml) [![test: status](https://github.com/medkit-lib/medkit/actions/workflows/test.yaml/badge.svg)](https://github.com/medkit-lib/medkit/actions/workflows/test.yaml) |
| Package | [![PyPI version](https://img.shields.io/pypi/v/medkit-lib.svg?logo=pypi&label=PyPI&logoColor=gold)](https://pypi.org/project/medkit-lib/) [![PyPI Python versions](https://img.shields.io/pypi/pyversions/medkit-lib.svg?logo=python&label=Python&logoColor=gold)](https://pypi.org/project/medkit-lib/)                                                                                                                                                                   |
| Project | [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://spdx.org/licenses/MIT.html) [![Formatter: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![Project: Hatch](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://hatch.pypa.io)                                                                                   |

----

`medkit` is a toolkit for a learning health system, developed by the [HeKA research team](https://team.inria.fr/heka).

This python library aims at:

1. Facilitating the manipulation of healthcare data of various modalities (e.g., structured, text, audio data)
for the extraction of relevant features.

2. Developing supervised models from these various modalities for decision support in healthcare.

## Installation

To install `medkit` with basic functionalities:

```console
pip install medkit-lib
```

To install `medkit` with all its optional features:

```console
pip install 'medkit-lib[all]'
```

## Example

A basic named-entity recognition pipeline using `medkit`:

```python
# 1. Define individual operations.
from medkit.text.preprocessing import CharReplacer, LIGATURE_RULES, SIGN_RULES
from medkit.text.segmentation import SentenceTokenizer, SyntagmaTokenizer
from medkit.text.context.negation_detector import NegationDetector
from medkit.text.ner.hf_entity_matcher import HFEntityMatcher

# Preprocessing
char_replacer = CharReplacer(rules=LIGATURE_RULES + SIGN_RULES)
# Segmentation
sent_tokenizer = SentenceTokenizer(output_label="sentence")
synt_tokenizer = SyntagmaTokenizer(output_label="syntagma")
# Negation detection
neg_detector = NegationDetector(output_label="is_negated")
# Entity recognition
entity_matcher = HFEntityMatcher(model="my-BERT-model", attrs_to_copy=["is_negated"])

# 2. Combine operations into a pipeline.
from medkit.core.pipeline import Pipeline, PipelineStep

ner_pipeline = Pipeline(
    input_keys=["full_text"],
    output_keys=["entities"],
    steps=[
        PipelineStep(char_replacer, input_keys=["full_text"], output_keys=["clean_text"]),
        PipelineStep(sent_tokenizer, input_keys=["clean_text"], output_keys=["sentences"]),
        PipelineStep(synt_tokenizer, input_keys=["sentences"], output_keys=["syntagmas"]),
        PipelineStep(neg_detector, input_keys=["syntagmas"], output_keys=[]),
        PipelineStep(entity_matcher, input_keys=["syntagmas"], output_keys=["entities"]),
    ],
)

# 3. Run the NER pipeline on a BRAT document.
from medkit.io import BratInputConverter

docs = BratInputConverter().load(path="/path/to/dataset/")
entities = ner_pipeline.run([doc.raw_segment for doc in docs])
```

## Getting started

To get started with `medkit`, please checkout our [documentation](https://medkit.readthedocs.io/).

This documentation also contains tutorials and examples showcasing the use of `medkit` for different tasks.

## Contributing

Thank you for your interest into medkit !

We'll be happy to get your inputs !

If your problem has not been reported by another user, please open an
[issue](https://github.com/medkit-lib/medkit/issues), whether it's for:

* reporting a bug, 
* discussing the current state of the code, 
* submitting a fix, 
* proposing new features, 
* or contributing to documentation, ...

If you want to propose a pull request, you can read [CONTRIBUTING.md](./CONTRIBUTING.md).

## Contact

Feel free to contact us by sending an email to [medkit-maintainers@inria.fr](mailto:medkit-maintainers@inria.fr).
