Metadata-Version: 2.1
Name: hashedindex
Version: 0.10.0
Summary: InvertedIndex implementation using hash lists (dictionaries)
Home-page: https://github.com/MichaelAquilina/hashedindex
Author: Michael Aquilina
Author-email: michaelaquilina@gmail.com
License: BSD
Keywords: hashedindex
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8

===============================
hashedindex
===============================

|TravisCI| |AppVeyor| |CodeCov| |PyPi|


Fast and simple InvertedIndex implementation using hash lists (python dictionaries).

Supports Python 3.5+

Free software: BSD license

* Installing_
* Features_
* `Text Parsing`_
* `Stemming`_
* `Integration with Numpy and Pandas`_
* `Reporting Bugs`_


Installing
----------

The easiest way to install hashindex is through pypi

::

    pip install hashedindex


Features
--------

``hashedindex`` provides a simple to use inverted index structure that is flexible enough to work with all kinds of use cases.

Basic Usage:

.. code-block:: python

    import hashedindex
    index = hashedindex.HashedIndex()

    index.add_term_occurrence('hello', 'document1.txt')
    index.add_term_occurrence('world', 'document1.txt')

    index.get_documents('hello')
    Counter({'document1.txt': 1})

    index.items()
    {'hello': Counter({'document1.txt': 1}),
    'world': Counter({'document1.txt': 1})}

    example = 'The Quick Brown Fox Jumps Over The Lazy Dog'

    for term in example.split():
        index.add_term_occurrence(term, 'document2.txt')

``hashedindex`` is not limited to strings, any hashable object can be indexed.

.. code-block:: python

   index.add_term_occurrence('foo', 10)
   index.add_term_occurrence(('fire', 'fox'), 90.2)

   index.items()
   {'foo': Counter({10: 1}), ('fire', 'fox'): Counter({90.2: 1})}

Text Parsing
------------

The ``hashedindex`` module comes included with a powerful textparser module with methods to split
text into tokens.

.. code-block:: python

   from hashedindex import textparser
   list(textparser.word_tokenize("hello cruel world"))
   [('hello',), ('cruel',), ('world',)]

Tokens are wrapped within tuples due to the ability to specify any number of n-grams required:

.. code-block:: python

   list(textparser.word_tokenize("Life is about making an impact, not making an income.", ngrams=2))
   [(u'life', u'is'), (u'is', u'about'), (u'about', u'making'), (u'making', u'an'), (u'an', u'impact'),
    (u'impact', u'not'), (u'not', u'making'), (u'making', u'an'), (u'an', u'income')]

Take a look at the function's docstring for information on how to use ``stopwords``, specify a ``min_length`` for tokens, and configure token output using the ``ignore_numeric``, ``retain_casing`` and ``retain_punctuation`` parameters.

By default, ``word_tokenize`` omits whitespace from the output token stream; whitespaces are rarely useful to include in a document term index.

If you need to tokenize text and re-assemble an output with spacing that matches the input, you may enable the ``tokenize_whitespace`` setting.

.. code-block:: python

    list(textparser.word_tokenize('Conventions.  May. Differ.', tokenize_whitespace=True))
    [('conventions',), ('  ',), ('may',), (' ',), ('differ',)]

Stemming
--------

When building an inverted index, it can be useful to resolve related strings to a common root.

For example, in a corpus relating to animals it might be useful to derive a singular noun for each animal; as a result, documents containing either the word ``dog`` or ``dogs`` could be found under the index entry ``dog``.

The `hashedindex` module's text parser provides optional support for stemming by allowing the caller to specify a custom stemmer:

.. code-block:: python

   class NaivePluralStemmer():
       def stem(self, x):
           return x.rstrip('s')

   list(textparser.word_tokenize('It was raining cats and dogs', stemmer=NaivePluralStemmer()))
   [('it',), ('wa',), ('raining',), ('cat',), ('and',), ('dog',)]


Integration with Numpy and Pandas
---------------------------------

The idea behind ``hashedindex`` is to provide a really quick and easy way of generating
matrices for machine learning with the additional use of numpy, pandas and scikit-learn.
For example:

.. code-block:: python

   from hashedindex import textparser
   import hashedindex
   import numpy as np

   index = hashedindex.HashedIndex()

   documents = ['spam1.txt', 'ham1.txt', 'spam2.txt']
   for doc in documents:
       with open(doc, 'r') as fp:
            for term in textparser.word_tokenize(fp.read()):
                index.add_term_occurrence(term, doc)

   # You *probably* want to use scipy.sparse.csr_matrix for better performance
   X = np.as_matrix(index.generate_feature_matrix(mode='tfidf'))

   y = []
   for doc in index.documents():
       y.append(1 if 'spam' in doc else 0)
   y = np.asarray(doc)

   from sklearn.svm import SVC
   classifier = SVC(kernel='linear')
   classifier.fit(X, y)

You can also extend your feature matrix to a more verbose pandas DataFrame:

.. code-block:: python

   import pandas as pd
   X  = index.generate_feature_matrix(mode='tfidf')
   df = pd.DataFrame(X, columns=index.terms(), index=index.documents())

All methods within the code have high test coverage so you can be sure everything works as expected.

Reporting Bugs
--------------

Found a bug? Nice, a bug found is a bug fixed. Open an Issue or better yet, open a pull request.

.. |TravisCI| image:: https://travis-ci.org/MichaelAquilina/hashedindex.svg?branch=master
   :target: https://travis-ci.org/MichaelAquilina/hashedindex

.. |AppVeyor| image:: https://ci.appveyor.com/api/projects/status/qkhn4bub2pye7skm?svg=true
   :target: https://ci.appveyor.com/project/MichaelAquilina/hashedindex

.. |PyPi| image:: https://badge.fury.io/py/hashedindex.svg
   :target: https://badge.fury.io/py/hashedindex

.. |CodeCov| image:: https://codecov.io/gh/MichaelAquilina/hashedindex/branch/master/graph/badge.svg
   :target: https://codecov.io/gh/MichaelAquilina/hashedindex




History
-------

0.10.0 (2020-10-19)
-------------------
* add `count` optional parameter to `add_term_occurrence` method (@jayadison)

0.9.0 (2020-07-14)
------------------
* support non-ascii characters during tokenization (@jayadison)

0.8.0 (2019-05-08)
------------------
* Add option to retain punctuation in ``word_tokenize`` (@jayadison)
* Add option to include whitespace tokens in ``word_tokenize`` results (@jayadison)

0.7.1 (2019-04-30)
--------------------
* Fix minor issue in history changelog

0.7.0 (2019-04-30)
--------------------
* Add support for retaining token casing in ``word_tokenize`` (Thanks @jayadison)

0.6.0 (2019-12-11)
---------------------

* Add support for running stemming operations with ``word_tokenize`` (Thanks @jayaddison)
* Add official support for python 3.8

0.5.0 (2019-07-21)
---------------------
* Drop support for python 2.7 and 3.4

0.1.0 (2015-01-11)
---------------------

* First release on PyPI.


