Metadata-Version: 2.1
Name: tmtoolkit
Version: 0.7.0
Summary: Text Mining and Topic Modeling Toolkit
Home-page: https://github.com/WZBSocialScienceCenter/tmtoolkit
Author: Markus Konrad
Author-email: markus.konrad@wzb.eu
License: Apache License 2.0
Project-URL: Bug Reports, https://github.com/WZBSocialScienceCenter/tmtoolkit/issues
Project-URL: Source, https://github.com/WZBSocialScienceCenter/tmtoolkit
Keywords: textmining textanalysis text mining analysis preprocessing topicmodeling topic modeling evaluation
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Python: >=2.7
Description-Content-Type: text/markdown
Provides-Extra: excel_export
Provides-Extra: topic_modeling_eval_extra
Provides-Extra: topic_modeling_sklearn
Provides-Extra: topic_modeling_gensim
Provides-Extra: topic_modeling_lda
Provides-Extra: wordclouds
Provides-Extra: topic_modeling_coherence_extra
Provides-Extra: improved_german_lemmatization
Provides-Extra: plotting
Requires-Dist: six (>=1.10.0)
Requires-Dist: numpy (>=1.13.0)
Requires-Dist: scipy (>=1.0.0)
Requires-Dist: pandas (>=0.20.0)
Requires-Dist: nltk (>=3.0.0)
Requires-Dist: pyphen (>=0.9.0)
Requires-Dist: globre (>=0.1.5)
Provides-Extra: excel_export
Requires-Dist: openpyxl; extra == 'excel_export'
Provides-Extra: improved_german_lemmatization
Requires-Dist: pattern; extra == 'improved_german_lemmatization'
Provides-Extra: plotting
Requires-Dist: matplotlib (>=2.2.2); extra == 'plotting'
Provides-Extra: topic_modeling_coherence_extra
Requires-Dist: gensim (>=3.4.0); extra == 'topic_modeling_coherence_extra'
Provides-Extra: topic_modeling_eval_extra
Requires-Dist: gmpy2; extra == 'topic_modeling_eval_extra'
Provides-Extra: topic_modeling_gensim
Requires-Dist: gensim; extra == 'topic_modeling_gensim'
Provides-Extra: topic_modeling_lda
Requires-Dist: lda; extra == 'topic_modeling_lda'
Provides-Extra: topic_modeling_sklearn
Requires-Dist: scikit-learn (>=0.18.0); extra == 'topic_modeling_sklearn'
Provides-Extra: wordclouds
Requires-Dist: wordcloud; extra == 'wordclouds'
Requires-Dist: Pillow; extra == 'wordclouds'

# tmtoolkit – Text Mining and Topic Modeling Toolkit for Python

Markus Konrad <markus.konrad@wzb.eu>

`tmtoolkit` is a set of tools for text mining and topic modeling with Python. It contains functions for text preprocessing like lemmatization, stemming or [POS tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) especially for English and German texts. Preprocessing is done in parallel by using all available processors on your machine. The topic modeling features include topic model evaluation metrics, allowing to calculate models with different parameters in parallel and comparing them (e.g. in order to find the optimal number of topics and other parameters). Topic models can be generated in parallel for different copora and/or parameter sets using the LDA implementations either from [lda](http://pythonhosted.org/lda/), [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) or [gensim](https://radimrehurek.com/gensim/).

## Features

### Text preprocessing

Text preprocessing is built on top of [NLTK](http://www.nltk.org/) and (when run using Python 2.7) [CLiPS pattern](https://www.clips.uantwerpen.be/pattern). Common features include:

* tokenization
* POS tagging (optimized for German and English)
* lemmatization (optimized for German and English)
* stemming
* cleaning tokens
* filtering tokens
* generating n-grams
* generating document-term-matrices

Preprocessing is done in parallel by using all available processors on your machine, greatly improving processing speed as compared to sequential processing on a single processor.

### Topic modeling

Topic models can be generated in parallel for different copora and/or parameter sets using the LDA implementations either from [lda](http://pythonhosted.org/lda/), [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) or [gensim](https://radimrehurek.com/gensim/). They can be evaluated and compared (for example in order to find the optimal number of topics) using several implemented metrics:

* Pair-wise cosine distance method ([Cao Juan et al. 2009](http://doi.org/10.1016/j.neucom.2008.06.011))
* KL divergence method ([Arun et al. 2010](http://doi.org/10.1007/978-3-642-13657-3_43))
* Harmonic mean method ([Griffiths, Steyvers 2004](http://doi.org/10.1073/pnas.0307752101))
* Probability of held-out documents ([Wallach et al. 2009](https://doi.org/10.1145/1553374.1553515))
* Model coherence ([Mimno et al. 2011](https://dl.acm.org/citation.cfm?id=2145462) or with [metrics implemented in Gensim](https://radimrehurek.com/gensim/models/coherencemodel.html))
* the loglikelihood or perplexity methods natively implemented in lda, sklearn or gensim 

Further features include:

* plot evaluation results
* export estimated document-topic and topic-word distributions to Excel
* visualize topic-word distributions and document-topic distributions as word clouds (see [lda_visualization Jupyter Notebook](https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/examples/lda_visualization.ipynb))
* visualize topic-word distributions and document-topic distributions as heatmaps (see [lda_visualization Jupyter Notebook](https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/examples/lda_visualization.ipynb))
* integrate [PyLDAVis](https://pyldavis.readthedocs.io/en/latest/) to visualize results
* coherence for individual topcis (see [model_coherence Jupyter Notebook](https://github.com/WZBSocialScienceCenter/tmtoolkit/blob/master/examples/model_coherence.ipynb))

## Installation

The package is available on [PyPI](https://pypi.org/project/tmtoolkit/) and can be installed via Python package manager *pip*:

```
# recommended installation
pip install -U tmtoolkit[excel_export,plotting,wordclouds]

# minimal installation:
pip install -U tmtoolkit
```

The package is about 15MB big, because it contains some additional German language model data mode POS tagging and lemmatization.

**Upgrade notice:** If upgrading from an older version to 0.6.0 or above, you will need to uninstall tmtoolkit first (run `pip uninstall tmtoolkit`), before re-installing (using one of the commands described above).


### Optional packages

PyPI packages which can be installed via pip are written *italic*.

* for improved lemmatization of German texts: *Pattern* (please note that *Pattern* is only available on Python 2.7)
* for plotting/visualizations: *matplotlib*
* for the word cloud functions: *wordcloud* and *Pillow*
* for Excel export: *openpyxl*
* for topic modeling, one of the LDA implementations: *lda*, *scikit-learn* or *gensim*
* for additional topic model coherence metrics: *gensim*

For LDA evaluation metrics `griffiths_2004` and `held_out_documents_wallach09` it is necessary to install [gmpy2](https://github.com/aleaxit/gmpy) for multiple-precision arithmetic. This in turn requires installing some C header libraries for GMP, MPFR and MPC. On Debian/Ubuntu systems this is done with:  

```
sudo apt install libgmp-dev libmpfr-dev libmpc-dev
```

After that, gmpy2 can be installed via *pip*.

So for the full set of features, you should run the following (optionally adding gmpy2 if you have installed the above requirements):

```
pip install -U Pattern matplotlib wordcloud Pillow openpyxl lda scikit-learn gensim
```

## Requirements

`tmtoolkit` works with Python 2.7 and Python 3.5 or above. When using lemmatization for German texts, *Pyphen* and *pattern* should be installed, the latter being only available for Python 2.7.

Requirements are automatically installed via *pip*. Additional packages can also be installed via *pip* for certain use cases (see optional packages).

**A special note for Windows users**: tmtoolkit has been tested on Windows and works well (I recommend using the [Anaconda distribution for Python](https://anaconda.org/) there). However, you will need to wrap all code that uses multi-processing (i.e. all calls to `TMPreproc` and the parallel topic modeling functions) in a `if __name__ == '__main__'` block like this:

```python
def main():
    # code with multi-processing comes here
    # ...

if __name__ == '__main__':
    main()
```

See the `examples` directory.

### Required packages

* six
* NumPy
* SciPy
* NLTK
* Pandas
* Pyphen

**Please note:** You will need to install several corpora and language models from NLTK if you didn't do so yet. Run `python -c 'import nltk; nltk.download()'` which will open a graphical downloader interface. You will need at least the following data packages:

* `averaged_perceptron_tagger`
* `punkt`
* `stopwords`
* `wordnet`

## Documentation

Documentation for many methods is still missing at the moment and will be added in the future. For the moment, you should have a look at the examples below and in the `examples` directory.

## Examples

Some examples that can be run directly in an IPython session:

### Preprocessing

We will process as small, self-defined toy corpus with German text. It will be tokenized, cleaned and transformed into a document-term-matrix. **You will need to wrap this in a `if __name__ == '__main__'` block if you're using Windows. See "Special note for Windows users" above.**

```python
from tmtoolkit.preprocess import TMPreproc

# a small toy corpus in German, here directly defined as a dict
# to load "real" (text) files use the methods in tmtoolkit.corpus
corpus = {
    u'doc1': u'Ein einfaches Beispiel in einfachem Deutsch.',
    u'doc2': u'Es enthält nur drei sehr einfache Dokumente.',
    u'doc3': u'Die Dokumente sind sehr kurz.',
}

# initialize
preproc = TMPreproc(corpus, language='german')

# run the preprocessing pipeline: tokenize, POS tag, lemmatize, transform to
# lowercase and then clean the tokens (i.e. remove stopwords)
preproc.tokenize().pos_tag().lemmatize().tokens_to_lowercase().clean_tokens()

print(preproc.tokens)
# this will output: 
#  {u'doc1': (u'einfach', u'beispiel', u'einfach', u'deutsch'),
#   u'doc2': (u'enthalten', u'drei', u'einfach', u'dokument'),
#   u'doc3': (u'dokument', u'kurz')}

print(preproc.tokens_with_pos_tags)
# this will output: 
# {u'doc1': [(u'einfach', u'ADJA'),
#   (u'beispiel', u'NN'),
#   (u'einfach', u'ADJA'),
#   (u'deutsch', u'NN')],
# u'doc2': [(u'enthalten', u'VVFIN'),
#   (u'drei', u'CARD'),
#   (u'einfach', u'ADJA'),
#   (u'dokument', u'NN')],
#  u'doc3': [(u'dokument', u'NN'), (u'kurz', u'ADJD')]}

# generate sparse DTM and print it as a data table
doc_labels, vocab, dtm = preproc.get_dtm()

import pandas as pd
print(pd.DataFrame(dtm.todense(), columns=vocab, index=doc_labels))
```

### Topic modeling

We will use the [lda package](http://pythonhosted.org/lda/) for topic modeling. Several models for different numbers of topics and alpha values are generated and compared. The best is chosen and the results are printed.

```python
from tmtoolkit.topicmod import tm_lda
import lda  # for the Reuters dataset

import matplotlib.pyplot as plt
plt.style.use('ggplot')

doc_labels = lda.datasets.load_reuters_titles()
vocab = lda.datasets.load_reuters_vocab()
dtm = lda.datasets.load_reuters()

# evaluate topic models with different parameters
const_params = dict(n_iter=100, random_state=1)  # low number of iter. just for showing how it works here
varying_params = [dict(n_topics=k, alpha=1.0/k) for k in range(10, 251, 10)]

# this will evaluate 25 models (with n_topics = 10, 20, .. 250) in parallel
models = tm_lda.evaluate_topic_models(dtm, varying_params, const_params,
                                      return_models=True)

# plot the results
# note that since we used a low number of iterations, the plot looks quite "unstable"
# for the given metrics.
from tmtoolkit.topicmod.visualize import plot_eval_results
from tmtoolkit.topicmod.evaluate import results_by_parameter

results_by_n_topics = results_by_parameter(models, 'n_topics')
plot_eval_results(results_by_n_topics)
plt.show()

# the peak seems to be around n_topics == 140
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words, print_ldamodel_doc_topics

best_model = dict(results_by_n_topics)[140]['model']
print_ldamodel_topic_words(best_model.topic_word_, vocab)
print_ldamodel_doc_topics(best_model.doc_topic_, doc_labels)
```

More examples can be found in the `examples` directory.

## Example data

The provided samples in `examples/data` come from: 

* `bt18_sample_1000.pickle`: [offenesparlament.de](https://github.com/Datenschule/offenesparlament-data)
* `gutenberg`: [Project Gutenberg](https://www.gutenberg.org/)

## License

Code licensed under [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). See `LICENSE` file. 


