Metadata-Version: 2.1
Name: textcl
Version: 0.1.0
Summary: Package for text preprocessing to use in nlp tasks
Home-page: https://github.com/alinapetukhova/textcl
Author: Alina Petukhova
Author-email: petukhova.alina@gmail.com
License: MIT
Download-URL: https://github.com/alinapetukhova/textcl/archive/refs/tags/v.0.1.0.tar.gz
Keywords: NLP,Text preprocessing,Outlier detection
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: flair (>=0.7)
Requires-Dist: langdetect (>=1.0.8)
Requires-Dist: numpy (<1.20.0,>=1.16.5)
Requires-Dist: pandas (>=1.0.3)
Requires-Dist: lxml (>=4.6.2)
Requires-Dist: protobuf (>=3.14.0)
Requires-Dist: nltk (>=3.4.5)

# TextCL

[![Build Status](https://travis-ci.com/alinapetukhova/textcl.svg?branch=master)](https://travis-ci.com/github/alinapetukhova/textcl)
[![codecov](https://codecov.io/gh/alinapetukhova/textcl/branch/master/graph/badge.svg?token=jgYuXyGGjS)](https://codecov.io/gh/alinapetukhova/textcl)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Introduction

The **TextCL** package aims to clean text data for later use in Natural Language Processing tasks. It can be used as an initial step in text analysis as well as in predictive, classification or text generation models.

The quality of the models strongly depends on the quality of the input data. Common problems in the data sets include:

- If data are coming from a optical character recognition (OCR) platform, text in tables and columns is usually not processed correctly and will add noise to the models.
- Some parts of large texts scopes may contain sentences from different languages rather than the target language of the model and have to be filtered out.
- Real-world texts often have duplicated sentences due to the use of templates. In text generation tasks, this can cause model overfitting and duplications in generated texts or summaries.
- Data sets may contain text that is different from the main topic, such as a weather forecast in an accounting report.

## Features

The **TextCL** package allows the user to perform the following text pre-processing tasks:

- Split texts into sentences.
- Language filtering, for removing sentences from text not in the target language.
- Perplexity filtering, for removing linguistically unconnected sentences, that can be produced by OCR modules. For example: `Sustainability Report 2019 36 3%?!353? 1. 5В°C 1} 33%.`
- Duplicate sentences filtering using Jaccard similarity, for removing duplicate sentences from the text.
- Unsupervised outlier detection for revealing texts that are outside of the main data set topic distribution. Four methods are included with package for this purpose:
  - TONMF: Block Coordinate Descent Framework
    ([source article](https://arxiv.org/pdf/1701.01325.pdf),
    [matlab implementation](https://github.com/ramkikannan/outliernmf))
  - RPCA: Robust Principal Component Analysis
    ([source article](https://arxiv.org/pdf/0912.3599.pdf),
    [python implementation](https://github.com/dganguli/robust-pca))
  - SVD: Singular Value Decomposition
    (based on the [NumPy SVD implementation](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html))

**TextCL**'s API documentation can be found [here](https://alinapetukhova.github.io/textcl/docs/).

**TextCL**'s Usage examples can be found [here](https://github.com/alinapetukhova/textcl/blob/master/examples/text_preprocessing_example.ipynb) and [here](https://github.com/alinapetukhova/textcl/blob/master/examples/outlier_detection_functions_plots_example.ipynb)

## Requirements

- Python >= 3.6
- flair >= 0.7
- langdetect >= 1.0.8
- numpy >= 1.16.5, < 1.20.0
- pandas >= 1.0.3
- lxml >= 4.6.2
- protobuf >= 3.14.0
- nltk >= 3.4.5

## How to install

### From PyPI

```bash
pip install textcl
```

### From source

```bash
git clone https://github.com/alinapetukhova/textcl.git
cd textcl
pip install src/
```

The `src/` folder is where the file `setup.py` is located.


## Developers guide

To generate documentation use (it will be placed into the docs folder):

```bash
pdoc3 --html --output-dir docs src/textcl/
```

where `scr/textcl/` is the folder containing the `__init__.py` file.

To perform tests run `pytest` in the root folder:

```bash
pytest
```

To check test coverage, run:

```bash
pytest --cov=textcl --cov-report=html
```

### License

[MIT License](LICENSE)

