Metadata-Version: 2.4
Name: polytext
Version: 0.1.3b1
Summary: Python utilities to simplify document files management
Home-page: https://github.com/docsity/polytext
Author: Matteo Senardi
Author-email: matteo.s@docsity.com
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pypdf==5.5.0
Requires-Dist: PyMuPDF>=1.25.5
Requires-Dist: pycryptodome==3.23.0
Requires-Dist: weasyprint==65.1
Requires-Dist: markdown==3.8
Requires-Dist: python-docx==1.1.2
Requires-Dist: google-api-core>=2.24.2
Requires-Dist: google-cloud-storage>=3.1.0
Requires-Dist: google-genai>=1.16.1
Requires-Dist: boto3>=1.38.19
Requires-Dist: botocore>=1.18.19
Requires-Dist: ffmpeg-python==0.2.0
Requires-Dist: pydub==0.25.1
Requires-Dist: youtube-transcript-api==1.0.3
Requires-Dist: yt-dlp==2025.5.22
Requires-Dist: charset-normalizer==3.4.2
Requires-Dist: requests==2.32.3
Requires-Dist: markitdown==0.1.1
Requires-Dist: pymupdf4llm==0.0.24
Requires-Dist: pathlib==1.0.1
Requires-Dist: retry==0.9.2
Provides-Extra: sentry
Requires-Dist: sentry-sdk==2.29.1; extra == "sentry"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# polytext

# Doc Utils

A Python package for document conversion and text extraction.

## Features

- Convert various document formats (DOCX, ODT, PPT, etc.) to PDF
- Extract text from PDF documents
- Support for both local files and S3 storage
- Multiple PDF parsing backends (PyPDF, PyMuPDF)

## Installation

```bash
# Library only – assumes system requirements are already present
pip install polytext
```

> **Heads-up:** Polytext’s PDF generator relies on [WeasyPrint] under the hood.  
> The PyPI wheel contains *only* Python code; you still need WeasyPrint’s **native libraries** (Pango, Cairo, GDK-PixBuf, HarfBuzz, Fontconfig) installed at the OS level.

### System requirements

| Requirement | Notes                                                                           | macOS (Homebrew) | Ubuntu / Debian |
|-------------|---------------------------------------------------------------------------------|------------------|-----------------|
| **Python**  | ✔️ Tested on **3.12**<br> Older versions may fail to locate WeasyPrint’s dylibs | `brew install python@3.12` | `sudo apt install python3.12` |
| **WeasyPrint – native stack** | installs Pango, Cairo, etc.                                                     | `brew install weasyprint` | `sudo apt install weasyprint` |
| **LibreOffice** | used for Office → PDF conversion                                                | `brew install --cask libreoffice` | `sudo apt install libreoffice` |


## Usage

Converting Documents to PDF

```python
from polytext import convert_to_pdf, ConversionError

try:
    # Convert a document to PDF
    pdf_path = convert_to_pdf('input.docx', 'output.pdf')
    print(f"PDF saved to: {pdf_path}")
except ConversionError as e:
    print(f"Conversion failed: {e}")
```

Text Extraction

```python
from polytext import extract_text_from_file

# Extract text from any supported file
text = extract_text_from_file('document.docx')
print(f"Extracted text: {text}")
```

## License

MIT Licence
