Metadata-Version: 2.4
Name: codetex-mcp
Version: 0.3.0
Summary: Commit-aware code context manager for LLMs - MCP server and CLI
Project-URL: Homepage, https://github.com/mrosata/codetex-mcp
Project-URL: Repository, https://github.com/mrosata/codetex-mcp
Project-URL: Issues, https://github.com/mrosata/codetex-mcp/issues
Author-email: Michael Rosata <michael.rosata@gmail.com>
License: MIT
License-File: LICENSE
Keywords: code-context,llm,mcp,sqlite,tree-sitter
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.12
Requires-Dist: aiosqlite>=0.20
Requires-Dist: anthropic>=0.40
Requires-Dist: mcp>=1.0
Requires-Dist: pathspec>=0.12
Requires-Dist: rich>=13.0
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: sqlite-vec>=0.1
Requires-Dist: tiktoken>=0.7
Requires-Dist: tree-sitter-cpp>=0.23
Requires-Dist: tree-sitter-go>=0.23
Requires-Dist: tree-sitter-java>=0.23
Requires-Dist: tree-sitter-javascript>=0.23
Requires-Dist: tree-sitter-python>=0.23
Requires-Dist: tree-sitter-ruby>=0.23
Requires-Dist: tree-sitter-rust>=0.23
Requires-Dist: tree-sitter-typescript>=0.23
Requires-Dist: tree-sitter>=0.23
Requires-Dist: typer>=0.9
Description-Content-Type: text/markdown

# codetex-mcp

A commit-aware code context manager for LLMs. Indexes Git repositories into a multi-tier knowledge hierarchy — repo overviews, file summaries, and symbol details — stored in SQLite with vector search. Serves context to LLM clients via the [Model Context Protocol](https://modelcontextprotocol.io/) (MCP) or a local CLI.

## What It Does

codetex builds a structured, searchable index of your codebase that LLMs can query on demand:

- **Tier 1 — Repo Overview:** Purpose, architecture, directory structure, key technologies, entry points
- **Tier 2 — File Summaries:** Per-file purpose, public interfaces, dependencies, roles
- **Tier 3 — Symbol Details:** Function/class signatures, parameters, return types, call relationships

Summaries are generated by an LLM (Anthropic Claude). Embeddings are computed locally with [sentence-transformers](https://www.sbert.net/) for semantic search. Everything is stored in a single SQLite database with [sqlite-vec](https://github.com/asg017/sqlite-vec) for vector queries.

Incremental sync means only changed files are re-analyzed when you update your code.

## Requirements

- Python 3.12+
- Git
- An [Anthropic API key](https://console.anthropic.com/) (for indexing)

## Installation

```bash
# With pip
pip install codetex-mcp

# With uv (recommended)
uv tool install codetex-mcp
```

## Quick Start

### 1. Set your Anthropic API key

```bash
# Via environment variable
export ANTHROPIC_API_KEY=sk-ant-...

# Or via config
codetex config set llm.api_key sk-ant-...
```

### 2. Add a repository

```bash
# Local repo
codetex add /path/to/your/project

# Remote repo (clones to ~/.codetex/repos/)
codetex add https://github.com/user/repo.git
```

### 3. Index it

```bash
# Preview what indexing will cost (no API calls)
codetex index my-project --dry-run

# Build the full index
codetex index my-project
```

### 4. Query your codebase

```bash
# Repo overview (Tier 1)
codetex context my-project

# File summary (Tier 2)
codetex context my-project --file src/auth/login.py

# Symbol detail (Tier 3)
codetex context my-project --symbol authenticate_user

# Semantic search
codetex context my-project --query "how is authentication implemented?"
```

### 5. Keep it up to date

```bash
# Incremental sync — only re-analyzes changed files
codetex sync my-project
```

## MCP Server Setup

The MCP server lets LLM clients (like Claude Code, Cursor, Windsurf, etc.) query your indexed codebases directly.

### Claude Code

Add to your Claude Code MCP settings (`~/.claude/claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "codetex": {
      "command": "codetex",
      "args": ["serve"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}
```

If you installed with `uv tool`, use the full path:

```json
{
  "mcpServers": {
    "codetex": {
      "command": "/path/to/codetex",
      "args": ["serve"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}
```

Find the path with `which codetex` or `uv tool dir`.

### Other MCP Clients

Any client that supports MCP stdio transport can use codetex. The server command is:

```bash
codetex serve
```

### Available MCP Tools

Once connected, the LLM has access to 7 tools:

| Tool | Description |
|------|-------------|
| `get_repo_overview` | Tier 1 repo overview (architecture, technologies, entry points) |
| `get_file_context` | Tier 2 file summary with symbol list |
| `get_symbol_detail` | Tier 3 full symbol detail (signature, params, relationships) |
| `search_context` | Semantic search across all indexed context |
| `get_repo_status` | Index status (staleness, file/symbol counts, last indexed) |
| `sync_repo` | Trigger incremental sync from within the LLM session |
| `list_repos` | List all registered repositories |

## CLI Reference

### `codetex add <target>`

Register a git repository. Accepts a local path or remote URL.

```bash
codetex add .                                    # Current directory
codetex add /path/to/repo                        # Local path
codetex add https://github.com/user/repo.git     # Remote (clones locally)
codetex add git@github.com:user/repo.git         # SSH remote
```

### `codetex index <repo-name>`

Build a full index for a registered repository.

```bash
codetex index my-project                # Full index
codetex index my-project --dry-run      # Preview (files, symbols, estimated LLM calls/tokens)
codetex index my-project --path src/    # Index only files under src/
```

### `codetex sync <repo-name>`

Incremental sync to the current HEAD. Only files changed since the last indexed commit are re-analyzed.

```bash
codetex sync my-project                 # Sync changes
codetex sync my-project --dry-run       # Preview what would change
codetex sync my-project --path src/     # Sync only changes under src/
```

### `codetex context <repo-name>`

Query indexed context at any tier.

```bash
codetex context my-project                              # Tier 1: repo overview
codetex context my-project --file src/main.py           # Tier 2: file summary
codetex context my-project --symbol MyClass             # Tier 3: symbol detail
codetex context my-project --query "error handling"     # Semantic search
```

### `codetex status <repo-name>`

Show index status: indexed commit, current HEAD, staleness, file/symbol counts, token usage.

### `codetex list`

List all registered repositories with their index status.

### `codetex config show`

Display the current configuration.

### `codetex config set <key> <value>`

Update a configuration value.

```bash
codetex config set llm.api_key sk-ant-...
codetex config set llm.model claude-sonnet-4-5-20250929
codetex config set indexing.max_file_size_kb 1024
codetex config set indexing.max_concurrent_llm_calls 10
```

## Configuration

Configuration is loaded in layers (last wins):

1. **Defaults** — sensible out-of-the-box values
2. **TOML file** — `~/.codetex/config.toml`
3. **Environment variables** — override everything

### Config file

```toml
# ~/.codetex/config.toml

[storage]
data_dir = "~/.codetex"                  # Base directory for DB and cloned repos

[llm]
provider = "anthropic"                   # LLM provider (currently: anthropic)
model = "claude-sonnet-4-5-20250929"     # Model used for summarization
api_key = "sk-ant-..."                   # Anthropic API key

[indexing]
max_file_size_kb = 512                   # Skip files larger than this
max_concurrent_llm_calls = 5             # Parallel LLM requests during indexing
tier1_rebuild_threshold = 0.10           # Rebuild repo overview if >=10% of files changed on sync

[embedding]
model = "all-MiniLM-L6-v2"              # Sentence-transformers model for embeddings
```

### Environment variables

| Variable | Maps to | Example |
|----------|---------|---------|
| `ANTHROPIC_API_KEY` | `llm.api_key` | `sk-ant-...` |
| `CODETEX_DATA_DIR` | `storage.data_dir` | `/custom/path` |
| `CODETEX_LLM_PROVIDER` | `llm.provider` | `anthropic` |
| `CODETEX_LLM_MODEL` | `llm.model` | `claude-sonnet-4-5-20250929` |
| `CODETEX_MAX_FILE_SIZE_KB` | `indexing.max_file_size_kb` | `1024` |
| `CODETEX_MAX_CONCURRENT_LLM` | `indexing.max_concurrent_llm_calls` | `10` |
| `CODETEX_TIER1_THRESHOLD` | `indexing.tier1_rebuild_threshold` | `0.15` |
| `CODETEX_EMBEDDING_MODEL` | `embedding.model` | `all-MiniLM-L6-v2` |

## File Exclusion

Files are filtered through multiple stages:

1. **Default excludes** — `node_modules/`, `__pycache__/`, `.git/`, `dist/`, `build/`, `.venv/`, `*.lock`, `*.min.js`, `*.pyc`, `*.so`, etc.
2. **`.gitignore`** — standard gitignore rules from your repo
3. **`.codetexignore`** — same syntax as `.gitignore`, placed in your repo root. Use `!pattern` to un-ignore files
4. **File size** — files exceeding `max_file_size_kb` are skipped
5. **Binary detection** — files with null bytes in the first 8 KB are skipped

## Language Support

| Language | Tree-sitter (full AST) | Fallback (regex) |
|----------|:----------------------:|:-----------------:|
| Python | Yes | Yes |
| JavaScript | Yes | Yes |
| TypeScript | Yes | Yes |
| Go | Yes | Yes |
| Rust | Yes | Yes |
| Java | Yes | Yes |
| Ruby | Yes | Yes |
| C/C++ | Yes | Yes |
| All others | — | Yes |

Tree-sitter grammars for all 8 languages are installed automatically. For other languages, the fallback parser uses regex patterns to extract functions, classes, and imports.

## Architecture

```
CLI (Typer) ──┐
              ├──▶ Core Services (Indexer, Syncer, ContextStore, SearchEngine)
MCP (FastMCP)─┘         │              │              │
                    Analysis        LLM Provider    Embeddings
                 (tree-sitter +    (Anthropic)    (sentence-transformers)
                  regex fallback)       │              │
                         └──────────────┴──────────────┘
                                        │
                                   SQLite + sqlite-vec
```

- **Two entry points** (CLI and MCP server) share the same core service layer
- **No DI framework** — services are wired via a `create_app()` factory
- **All core services are async** — CLI bridges with `asyncio.run()`
- **Embeddings are local** — no external API calls for vector search (model auto-downloads on first run, ~90 MB)
- **Single SQLite database** — 6 main tables + 2 vector tables (384-dimensional embeddings)

## Development

```bash
git clone https://github.com/mrosata/codetex-mcp.git
cd codetex-mcp

# Install dependencies (including dev)
uv sync

# Run tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=codetex_mcp

# Lint and format
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Type check
uv run mypy src/
```

## Releasing

Releases are automated via GitHub Actions and [python-semantic-release](https://python-semantic-release.readthedocs.io/). Version bumps are driven by **conventional commit messages** on `main`.

### Commit message format

| Prefix | Effect | Example |
|--------|--------|---------|
| `fix: ...` | Patch bump (0.1.0 → 0.1.1) | `fix: handle missing gitignore` |
| `feat: ...` | Minor bump (0.1.0 → 0.2.0) | `feat: add Ruby tree-sitter support` |
| `feat!: ...` | Major bump (0.1.0 → 1.0.0) | `feat!: redesign context API` |
| `docs:`, `chore:`, `ci:`, `test:`, `refactor:` | No release | `docs: update README` |

A `BREAKING CHANGE:` line in the commit body also triggers a major bump.

### How it works

1. Push or merge a PR to `main`
2. CI runs lint, type check, and tests
3. The release workflow analyzes commits since the last tag
4. If a version bump is needed, it:
   - Updates the version in `pyproject.toml`
   - Creates a git tag (e.g., `v0.2.0`)
   - Publishes a GitHub Release with a changelog
   - Builds and publishes the package to PyPI

### Manual release (not recommended)

If you need to release without the automation:

```bash
uv build
uv publish
```

## License

MIT
