Metadata-Version: 2.4
Name: docsloader
Version: 0.0.13
Summary: This is a documents loader.
Author-email: axiner <atpuxiner@163.com>
Project-URL: Homepage, https://github.com/atpuxiner/docsloader
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: toollib>=1.9.7
Requires-Dist: pydantic>=2.12.5
Requires-Dist: aiofiles>=25.1.0
Requires-Dist: aiohttp>=3.13.2
Requires-Dist: pywin32>=311; platform_system == "Windows"
Provides-Extra: txt
Provides-Extra: csv
Provides-Extra: md
Provides-Extra: html
Requires-Dist: lxml>=6.0.2; extra == "html"
Provides-Extra: xlsx
Requires-Dist: openpyxl>=3.1.5; extra == "xlsx"
Requires-Dist: xlrd>=2.0.2; extra == "xlsx"
Provides-Extra: pptx
Requires-Dist: python-pptx>=1.0.2; extra == "pptx"
Provides-Extra: docx
Requires-Dist: python-docx>=1.2.0; extra == "docx"
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.26.7; extra == "pdf"
Requires-Dist: numpy; extra == "pdf"
Provides-Extra: img
Requires-Dist: rapidocr-onnxruntime>=1.4.4; extra == "img"
Provides-Extra: auto
Requires-Dist: docsloader[txt]; extra == "auto"
Requires-Dist: docsloader[csv]; extra == "auto"
Requires-Dist: docsloader[md]; extra == "auto"
Requires-Dist: docsloader[html]; extra == "auto"
Requires-Dist: docsloader[xlsx]; extra == "auto"
Requires-Dist: docsloader[pptx]; extra == "auto"
Requires-Dist: docsloader[docx]; extra == "auto"
Requires-Dist: docsloader[pdf]; extra == "auto"
Requires-Dist: docsloader[img]; extra == "auto"
Provides-Extra: all
Requires-Dist: docsloader[auto]; extra == "all"
Dynamic: license-file

# docsloader

## What is this?

- by: axiner
- docsloader
- This is a documents loader.

## Installation

This package can be installed using pip (Python>=3.11):
> pip install docsloader

- if you want to install all dependencies: `pip install docsloader[all]`
- if you want to install specific dependencies:
    - txt: `pip install docsloader[txt]`
    - csv: `pip install docsloader[csv]`
    - md: `pip install docsloader[md]`
    - xlsx: `pip install docsloader[xlsx]`
    - pptx: `pip install docsloader[pptx]`
    - docx: `pip install docsloader[docx]`
    - pdf: `pip install docsloader[pdf]`
    - img: `pip install docsloader[img]`
    - auto: `pip install docsloader[auto]`

## Usage

The `docsloader` package provides asynchronous document loaders for various file suffixes. It includes dedicated loaders
for specific file types and an `AutoLoader` that automatically selects the appropriate loader based on file suffix.

### Supported File Suffixes

The package supports loading documents from the following file suffixes:

- **Text Files**: `.txt`
- **CSV Files**: `.csv`
- **Markdown Files**: `.md`
- **HTML Files**: `.html`, `.htm`
- **Excel Files**: `.xlsx`, `.xls`
- **PowerPoint Files**: `.pptx`, `.ppt`
- **Word Files**: `.docx`, `.doc`
- **PDF Files**: `.pdf`
- **Image Files**: `.jpg`, `.jpeg`, `.png`

### Available Loaders

The package provides the following loader classes:

- `TxtLoader`: For Text files
- `CsvLoader`: For CSV files
- `MdLoader`: For Markdown files
- `HtmlLoader`: For HTML files
- `XlsxLoader`: For Excel files
- `PptxLoader`: For PowerPoint files
- `DocxLoader`: For Word files
- `PdfLoader`: For PDF files
- `ImgLoader`: For image files
- `AutoLoader`: Automatically selects the appropriate loader based on file suffix

All loader classes implement asynchronous `load` methods for efficient document processing.

### Example

```python
import asyncio

from docsloader import AutoLoader
from toollib.log import init_logger

logger = init_logger(__name__)


async def main(path_or_url: str):
    loader = AutoLoader(
        path_or_url=path_or_url,
        rm_tmpfile=False,
    )
    async for doc in loader.load():
        logger.info(doc)


if __name__ == "__main__":
    asyncio.run(main(path_or_url=r"E:/NewFolder/测试.docx"))
```

## License

This project is released under the MIT License (MIT). See [LICENSE](LICENSE)
