Metadata-Version: 2.1
Name: doc-curation
Version: 0.1.2
Summary: A package for curating doc file collections, with ability to sync with youtube and archive.org doc items.
Home-page: https://github.com/sanskrit-coders/doc_curation
Author: Sanskrit programmers
Author-email: sanskrit-programmers@googlegroups.com
License: MIT
Keywords: documents books internet-archive
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Education
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Requires-Dist: pikepdf
Requires-Dist: pdf2image
Requires-Dist: curation-utils
Requires-Dist: indic-transliteration
Requires-Dist: selenium
Requires-Dist: urllib3
Requires-Dist: regex
Requires-Dist: yamldown
Requires-Dist: pandas
Requires-Dist: requests
Requires-Dist: lxml
Requires-Dist: pytest
Requires-Dist: pypandoc
Requires-Dist: setuptools
Requires-Dist: bs4
Requires-Dist: beautifulsoup4
Requires-Dist: toml
Requires-Dist: more-itertools
Provides-Extra: test
Requires-Dist: pytest ; extra == 'test'

^\ |Build status| |Documentation Status| |PyPI version|

doc curation
------------

A package for curating doc file collections. Prominent features:

-  Scrape texts off various sites, such as Wikisource. See example
   `here <https://github.com/sanskrit-coders/doc_curation/blob/master/curation_projects/misc/wikisource.py>`__.
   (PS: Consider contributing to `raw_etexts
   repo <https://github.com/sanskrit/raw_etexts>`__. )
-  OCR some pdf with google drive. Automatically splits into 25 page
   bits and ocrs them individually. See usage example
   `here <https://github.com/sanskrit-coders/doc_curation/blob/master/curation_projects/pdf_tasks.py>`__,
   function
   `here <https://github.com/sanskrit-coders/doc_curation/blob/master/doc_curation/pdf.py#L13>`__.

For users
---------

-  `Autogenerated Docs on readthedocs (might be
   broken) <http://doc_curation.readthedocs.io/en/latest/>`__.
-  Manually and periodically generated docs
   `here <https://sanskrit-coders.github.io/doc_curation/build/html/>`__
-  For detailed examples and help, please see individual module files in
   this package.

Installation or upgrade:
------------------------

-  For stable version ``pip install doc_curation -U``
-  For latest code
   ``pip install git+https://github.com/sanskrit-coders/doc_curation/@master -U``
-  `Web <https://pypi.python.org/pypi/doc_curation>`__.

Usage:
------

-  Enable Google Driver API and download service account key file having
   Google Driver API access.

.. code:: python

   from doc_curation import pdf
   pdf_file = '/home/file.pdf'
   key_file = '/home/key.json'
   pdf.split_and_ocr_on_drive(pdf_file, key_file)

.. _usage-for-the-google_vision_pdfpy-to-ocr-pdf-to-txt-files:

Usage for the ``google_vision_pdf.py`` to OCR pdf to txt files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Follow the instructions here:
   https://cloud.google.com/vision/docs/before-you-begin.
-  Make sure to set the environment variable for
   ``GOOGLE_APPLICATION_CREDENTIALS`` to the path of json containing
   your service account key.
-  Example:

::

   export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"

-  Invoke the script passing in the input file. Eg:

::

   python3 google_vision_pdf.py --input-file <input.pdf>

For contributors
================

Contact
-------

Have a problem or question? Please head to
`github <https://github.com/sanskrit-coders/doc_curation>`__.

Packaging
---------

-  ~/.pypirc should have your pypi login credentials.

::

   python setup.py bdist_wheel
   twine upload dist/* --skip-existing

Build documentation
-------------------

-  sphinx html docs can be generated with ``cd docs; make html``

Testing
-------

Run ``pytest`` in the root directory.

Auxiliary tools
---------------

-  |image1|
-  |Documentation Status|
-  `pyup <https://pyup.io/account/repos/github/sanskrit-coders/doc_curation/>`__

.. |Build status| image:: https://github.com/sanskrit-coders/doc_curation/workflows/Python%20package/badge.svg
   :target: https://github.com/sanskrit-coders/doc_curation/actions
.. |Documentation Status| image:: https://readthedocs.org/projects/doc_curation/badge/?version=latest
   :target: http://doc_curation.readthedocs.io/en/latest/?badge=latest
.. |PyPI version| image:: https://badge.fury.io/py/doc_curation.svg
   :target: https://badge.fury.io/py/doc_curation
.. |image1| image:: https://github.com/sanskrit-coders/doc_curation/workflows/Python%20package/badge.svg


