Metadata-Version: 2.4
Name: gitsource
Version: 0.0.3
Summary: GitHub repository reader with document chunking for RAG/LLM applications
Project-URL: Homepage, https://github.com/alexeygrigorev/gitsource
Project-URL: Repository, https://github.com/alexeygrigorev/gitsource
Project-URL: Issues, https://github.com/alexeygrigorev/gitsource/issues
Author-email: Alexey Grigorev <alexey@datatalks.club>
Maintainer-email: Alexey Grigorev <alexey@datatalks.club>
License: WTFPL
Keywords: chunking,documents,github,llm,rag
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: python-frontmatter
Requires-Dist: requests
Description-Content-Type: text/markdown

# gitsource

GitHub repository reader with document chunking for RAG/LLM applications.

## Features

- Download repositories directly from GitHub using `codeload.github.com` (no git required)
- Filter files by extension and path patterns
- Parse frontmatter from markdown files
- Chunk documents using sliding windows (preserves metadata)
- Lightweight Jupyter notebook parser

## Installation

```bash
pip install gitsource
# or
uv add gitsource
```

## Usage

### Read GitHub Repository

```python
from gitsource import GithubRepositoryDataReader

reader = GithubRepositoryDataReader(
    repo_owner="evidentlyai",
    repo_name="docs",
    allowed_extensions={"md", "mdx"},
)

files = reader.read()
```

### Process Jupyter Notebooks

```python
from gitsource import GithubRepositoryDataReader, notebook_processor

reader = GithubRepositoryDataReader(
    repo_owner="alexeygrigorev",
    repo_name="gitsource",
    branch="master",
    allowed_extensions={"md", "ipynb"},
    filename_filter=lambda fp: fp.startswith("fixtures/"),
    processors={"ipynb": notebook_processor},  # Convert .ipynb to text
)

files = reader.read()
for file in files:
    print(f"{file.filename}: {file.content[:50]}...")
```

### Chunk Documents

```python
from gitsource import chunk_documents

documents = [
    {"content": "Long text here...", "filename": "doc.txt"}
]

chunks = chunk_documents(
    documents,
    size=2000,
    step=1000
)
```

## License

WTFPL
