Metadata-Version: 2.1
Name: opusfilter
Version: 3.3.1
Summary: Toolbox for filtering parallel corpora
Home-page: https://github.com/Helsinki-NLP/OpusFilter
Author: Mikko Aulamo, Sami Virpioja
Author-email: mikko.aulamo@helsinki.fi
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: setuptools
Requires-Dist: opustools>=1.6.2
Requires-Dist: beautifulsoup4>=4.8.0
Requires-Dist: graphviz
Requires-Dist: py3langid>=0.2.2
Requires-Dist: matplotlib
Requires-Dist: morfessor
Requires-Dist: opus-fast-mosestokenizer>=0.0.8.11
Requires-Dist: pandas>=1.0.0
Requires-Dist: xxhash>=3.2.0
Requires-Dist: sentence-splitter
Requires-Dist: rapidfuzz
Requires-Dist: ruamel.yaml>=0.15.0
Requires-Dist: regex
Requires-Dist: requests
Requires-Dist: scikit-learn
Requires-Dist: subword-nmt
Requires-Dist: tqdm
Requires-Dist: iso639-lang
Requires-Dist: lingua-language-detector<2.1,>=1.3.0; python_version < "3.10"
Requires-Dist: lingua-language-detector>=2.1.1; python_version >= "3.10"
Provides-Extra: all
Requires-Dist: eflomal>=2.0.0; extra == "all"
Requires-Dist: jieba>=0.42; extra == "all"
Requires-Dist: mecab-python3>=1.0.8; extra == "all"
Requires-Dist: unidic-lite; extra == "all"
Requires-Dist: laserembeddings; extra == "all"
Requires-Dist: varikn; extra == "all"
Requires-Dist: pytest; extra == "all"
Requires-Dist: myst-parser; extra == "all"
Requires-Dist: sphinx; extra == "all"
Requires-Dist: sphinx-rtd-theme; extra == "all"
Requires-Dist: sphinxcontrib-bibtex; extra == "all"
Provides-Extra: docs
Requires-Dist: myst-parser; extra == "docs"
Requires-Dist: sphinx; extra == "docs"
Requires-Dist: sphinx-rtd-theme; extra == "docs"
Requires-Dist: sphinxcontrib-bibtex; extra == "docs"
Provides-Extra: eflomal
Requires-Dist: eflomal>=2.0.0; extra == "eflomal"
Provides-Extra: fasttext
Requires-Dist: py3langid<0.3.0; extra == "fasttext"
Requires-Dist: numpy<2.0.0; extra == "fasttext"
Requires-Dist: fasttext; extra == "fasttext"
Provides-Extra: heliport
Requires-Dist: heliport>=0.10.0; extra == "heliport"
Provides-Extra: jieba
Requires-Dist: jieba>=0.42; extra == "jieba"
Provides-Extra: laser
Requires-Dist: laserembeddings; extra == "laser"
Provides-Extra: mecab
Requires-Dist: mecab-python3>=1.0.8; extra == "mecab"
Requires-Dist: unidic-lite; extra == "mecab"
Provides-Extra: pycld2
Requires-Dist: pycld2; extra == "pycld2"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Provides-Extra: varikn
Requires-Dist: varikn; extra == "varikn"

# OpusFilter

OpusFilter is a tool for filtering and combining parallel corpora.

Features:

* Corpus preprocessing pipelines configured with [YAML](https://yaml.org/)
* Simple downloading of parallel corpora from [OPUS](http://opus.nlpl.eu/) with [OpusTools](https://github.com/Helsinki-NLP/OpusTools)
* Implementations for many common text file operations on parallel files
* Memory-efficient processing of large files
* Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
* Extendable with your own filters written in Python

OpusFilter has been presented in [ACL 2020 system demonstrations](https://www.aclweb.org/anthology/2020.acl-demos.20).

## Installing

Install the latest release from PyPI:

* `pip install opusfilter` or `pip install opusfilter[all]` (include optional Python libraries)

Install from source:

* `pip install .` or `python setup.py install`

### Troubleshooting

OpusFilter should generally work fine on Python 3.8 to 3.13. In the case of troubles, try installing the exact versions in `requirements.txt`:

* `pip install -r requirements.txt`

## Documentation

The complete OpusFilter documentation is available from [helsinki-nlp.github.io/OpusFilter](https://helsinki-nlp.github.io/OpusFilter/).

You can also build the documents from the source:

* `pip install -r docs/requirements.txt` or  `pip install .[docs]`
* `sphinx-build docs docs-html`

## Changelog

A changelog is available in [docs/CHANGELOG.md](docs/CHANGELOG.md).

## Citing

If you use OpusFilter in your research, please cite our [ACL 2020 paper](https://www.aclweb.org/anthology/2020.acl-demos.20):

```bibtex
@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"
}
```

A full bibliography of papers cited in the documentation and code can be found from [docs/references.bib](docs/references.bib).

## Contributing

See [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md).
