Metadata-Version: 2.4
Name: easyocr-unstructured
Version: 1.3.2
Summary: Parse unstructured text from PDFs
Home-page: https://github.com/shorecodeorg/easyocr-unstructured
Author: Kevin Fink
Author-email: kevin@shorecode.org
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: easyocr
Requires-Dist: pdf2image
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary


# EasyOCR Unstructured

EasyOCR Unstructured is a powerful library for Optical Character Recognition (OCR) that can extract text from PDFS, then group the text based on proximity.

It is intended for PDF files that have text that doesn't follow the left to right top to bottom standard of document writing.


## Getting Started

pip install easyocr-unstructured

```
import easyocr_unstructured

# Initialize the EasyOCR Unstructured object
easyocr = EasyocrUnstructured()

# Invoke the OCR process on your PDF file
result = easyocr.invoke('/path/to/your_pdf_file.pdf')

#result will be a list of lists containing strings
from pprint import pprint as pp
pp(result)
```

## Example Output

The output will look something like this:

```python
[
    ["This is the piece of text. Nothing near it"],
    ["This is the second piece of text.", "This is the third piece of text that was close to the second"],
    ["This is the fourth piece of text. Nothing near it"],
    ...
]
```

### Prerequisites

- Python 3.12 +

### Installing

pip install easyocr-unstructured

## Usage

```
import easyocr_unstructured

easyocr = EasyocrUnstructured()
result = easyocr.invoke('/path/to/your_pdf_file.pdf')
```

Keyword arguments for more control:

```
import easyocr_unstructured

easyocr = EasyocrUnstructured(init_reader=False, gpu=True)
result = easyocr.invoke('/path/to/your_pdf_file.pdf', proximity_in_pixels=20, gpu=True, dpi=120, batch_size=3, **kwargs):)
```

- init_reader (bool): Load the EasyOCR reader on class initialization. 
    If set to False will load the reader everytime invoke is called
- proximity_in_pixels (int, optional): The proximity threshold 
    for grouping text entries. Defaults to 20.
- gpu (bool): Toggle to compute on GPU, if True and there is
    no gpu, will use cpu
- dpi (int): DPI setting for parsing PDF, higher value
    will be more accurate but slower and use more memory
- batch_size (int): Will determine the batch size for both
    parsing pdfs and scanning them

## Running the tests

No tests yet

## Built With

- Wing Pro
- Python 3.12
- numpy
- easyocr
- pdf2image
- hashlib

## Contributing

Please do, any sensible and safe change will be added!

## Authors

Kevin Fink

## License

MIT

## Acknowledgments

