Metadata-Version: 2.1
Name: leaf-focus
Version: 0.5.2
Summary: Extract structured text from pdf files.
Project-URL: Homepage, https://github.com/anotherbyte-net/leaf-focus
Project-URL: Changelog, https://github.com/anotherbyte-net/leaf-focus/blob/main/CHANGELOG.md
Project-URL: Source, https://github.com/anotherbyte-net/leaf-focus
Project-URL: Tracker, https://github.com/anotherbyte-net/leaf-focus/issues
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Environment :: Console
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Utilities
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: importlib-resources (~=5.9)
Requires-Dist: defusedxml (~=0.7)
Requires-Dist: keras-ocr (~=0.9) ; python_version < "3.10"
Requires-Dist: tensorflow (~=2.10) ; python_version < "3.10"
Requires-Dist: importlib-metadata (~=4.2) ; python_version < "3.8"
Requires-Dist: typing-inspect (~=0.8) ; python_version < "3.8"
Requires-Dist: numpy (~=1.21) ; python_version < "3.8"
Requires-Dist: matplotlib (~=3.5) ; python_version < "3.8"
Requires-Dist: importlib-metadata (~=5.0) ; python_version >= "3.8"
Requires-Dist: numpy (~=1.23) ; python_version >= "3.8"
Requires-Dist: matplotlib (~=3.6) ; python_version >= "3.8"
Provides-Extra: dev
Requires-Dist: pip (~=22.2) ; extra == 'dev'
Requires-Dist: setuptools (~=65.4) ; extra == 'dev'
Requires-Dist: wheel (~=0.37) ; extra == 'dev'
Requires-Dist: build (~=0.8) ; extra == 'dev'
Requires-Dist: twine (~=4.0) ; extra == 'dev'
Requires-Dist: pytest (~=7.1) ; extra == 'dev'
Requires-Dist: pytest-mock (~=3.9) ; extra == 'dev'
Requires-Dist: pytest-cov (~=4.0) ; extra == 'dev'
Requires-Dist: tblib (~=1.7) ; extra == 'dev'
Requires-Dist: tox (~=3.26) ; extra == 'dev'
Requires-Dist: coverage (~=6.5) ; extra == 'dev'
Requires-Dist: hypothesis (~=6.55) ; extra == 'dev'
Requires-Dist: black (~=22.8) ; extra == 'dev'
Requires-Dist: flake8 (~=5.0) ; extra == 'dev'
Requires-Dist: flake8-annotations-coverage (~=0.0) ; extra == 'dev'
Requires-Dist: flake8-black (~=0.3) ; extra == 'dev'
Requires-Dist: flake8-bugbear (~=22.9) ; extra == 'dev'
Requires-Dist: flake8-comprehensions (~=3.10) ; extra == 'dev'
Requires-Dist: flake8-unused-arguments (~=0.0) ; extra == 'dev'
Requires-Dist: flake8-requirements (~=1.7) ; extra == 'dev'
Requires-Dist: mypy (~=0.981) ; extra == 'dev'
Requires-Dist: pylint (~=2.15) ; extra == 'dev'
Requires-Dist: pydocstyle[toml] (~=6.1) ; extra == 'dev'
Requires-Dist: pyright (~=1.1) ; extra == 'dev'
Requires-Dist: types-dateparser (~=1.1) ; extra == 'dev'
Requires-Dist: types-PyYAML (~=6.0) ; extra == 'dev'
Requires-Dist: types-requests (~=2.28) ; extra == 'dev'
Requires-Dist: types-backports (~=0.1) ; extra == 'dev'
Requires-Dist: types-urllib3 (~=1.26) ; extra == 'dev'
Requires-Dist: pdoc (~=12.0) ; extra == 'dev'
Requires-Dist: pyre-check (~=0.9) ; (platform_system != "Windows") and extra == 'dev'
Requires-Dist: pytype (~=2022.8) ; (python_version <= "3.10" and platform_system != "Windows") and extra == 'dev'

# leaf-focus

Extract structured text from pdf files.

## Install

Install from PyPI using pip:

```bash
pip install leaf-focus
```

[![PyPI](https://img.shields.io/pypi/v/leaf-focus)](https://pypi.org/project/leaf-focus/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/leaf-focus)
[![GitHub Workflow Status (branch)](https://img.shields.io/github/workflow/status/anotherbyte-net/leaf-focus/Test%20Package/main)](https://github.com/anotherbyte-net/leaf-focus/actions)

Download the [Xpdf command line tools](https://www.xpdfreader.com/download.html) and extract the executable files.

Provide the directory containing the executable files as `--exe-dir`.


## Usage

```text
usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]
                  [--first FIRST] [--last LAST]
                  [--log-level {debug,info,warning,error,critical}]
                  input_pdf output_dir

Extract structured text from a pdf file.

positional arguments:
  input_pdf             path to the pdf file to read
  output_dir            path to the directory to save the extracted text files

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --exe-dir EXE_DIR     path to the directory containing xpdf executable files
  --page-images         save each page of the pdf as a separate image
  --ocr                 run optical character recognition on each page of the
                        pdf
  --first FIRST         the first pdf page to process
  --last LAST           the last pdf page to process
  --log-level {debug,info,warning,error,critical}
                        the log level: debug, info, warning, error, critical
```

### Examples

```bash
# Extract the pdf information and embedded text.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages

# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr
```

## Dependencies

- [xpdf](https://www.xpdfreader.com/download.html)
- [keras-ocr](https://github.com/faustomorales/keras-ocr)
- [Tensorflow](https://www.tensorflow.org) (can optionally be run more efficiently [using one or more GPUs](https://www.tensorflow.org/install/pip#hardware_requirements))
