Metadata-Version: 2.3
Name: okb
Version: 2.3.3
Summary: Personal knowledge base with semantic search for LLMs
Requires-Python: >=3.11
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: all
Provides-Extra: docx
Provides-Extra: llm
Provides-Extra: llm-bedrock
Provides-Extra: pdf
Provides-Extra: slack
Provides-Extra: todoist
Provides-Extra: web
Requires-Dist: PyGithub (>=2.0.0)
Requires-Dist: anthropic (>=0.40.0) ; extra == "all"
Requires-Dist: anthropic (>=0.40.0) ; extra == "llm"
Requires-Dist: anthropic (>=0.40.0) ; extra == "llm-bedrock"
Requires-Dist: boto3 (>=1.28.0) ; extra == "llm-bedrock"
Requires-Dist: botocore (>=1.31.0) ; extra == "llm-bedrock"
Requires-Dist: click (>=8.0.0)
Requires-Dist: dateparser (>=1.1.0)
Requires-Dist: dropbox (>=12.0.0)
Requires-Dist: einops (>=0.7.0)
Requires-Dist: mcp (>=1.0.0)
Requires-Dist: modal (>=1.0.0)
Requires-Dist: pgvector (>=0.2.0)
Requires-Dist: psycopg[binary] (>=3.1.0)
Requires-Dist: pymupdf (>=1.23.0) ; extra == "all"
Requires-Dist: pymupdf (>=1.23.0) ; extra == "pdf"
Requires-Dist: python-docx (>=1.1.0) ; extra == "all"
Requires-Dist: python-docx (>=1.1.0) ; extra == "docx"
Requires-Dist: pyyaml (>=6.0)
Requires-Dist: requests (>=2.28.0) ; extra == "all"
Requires-Dist: requests (>=2.28.0) ; extra == "slack"
Requires-Dist: sentence-transformers (>=2.2.0)
Requires-Dist: todoist-api-python (>=3.0.0) ; extra == "all"
Requires-Dist: todoist-api-python (>=3.0.0) ; extra == "todoist"
Requires-Dist: trafilatura (>=1.6.0) ; extra == "all"
Requires-Dist: trafilatura (>=1.6.0) ; extra == "web"
Requires-Dist: watchdog (>=3.0.0)
Requires-Dist: yoyo-migrations (>=8.0.0)
Project-URL: Homepage, https://github.com/username/okb
Project-URL: Issues, https://github.com/username/okb/issues
Project-URL: Repository, https://github.com/username/okb
Description-Content-Type: text/markdown

# Owned Knowledge Base (OKB)

A local-first semantic search system for personal documents with Claude Code integration via MCP.

## Installation

pipx - preferred!
```bash
pipx install okb
```

Or pip:
```bash
pip install okb
```

## Quick Start

```bash
# 1. Start the database
okb db start

# 2. (Optional) Deploy Modal embedder for faster batch ingestion
okb modal deploy

# 3. Ingest your documents
okb ingest ~/notes ~/docs

# 4. Configure Claude Code MCP (see below)
```

## CLI Commands

| Command | Description |
|---------|-------------|
| `okb db start` | Start pgvector database container |
| `okb db stop` | Stop database container |
| `okb db status` | Show database status |
| `okb db migrate [name]` | Apply pending migrations (optionally for specific db) |
| `okb db list` | List configured databases |
| `okb db destroy` | Remove container and volume (destructive) |
| `okb db snapshot save [name]` | Create database snapshot (default: timestamp) |
| `okb db snapshot list` | List available snapshots |
| `okb db snapshot restore <name>` | Restore from snapshot (creates pre-restore backup) |
| `okb db snapshot restore <name> --no-backup` | Restore without pre-restore backup |
| `okb db snapshot delete <name>` | Delete a snapshot |
| `okb ingest <paths>` | Ingest documents into knowledge base |
| `okb ingest <paths> --local` | Ingest using local GPU/CPU embedding (no Modal) |
| `okb serve` | Start MCP server (stdio, for Claude Code) |
| `okb serve --http` | Start HTTP MCP server with token auth |
| `okb watch <paths>` | Watch directories for changes |
| `okb config init` | Create default config file |
| `okb config show` | Show current configuration |
| `okb config path` | Print config file path |
| `okb modal deploy` | Deploy GPU embedder to Modal |
| `okb token create` | Create API token for HTTP server |
| `okb token list` | List tokens for a database |
| `okb token revoke [TOKEN] --id <n>` | Revoke token by full value or ID |
| `okb sync list` | List available API sources (plugins) |
| `okb sync list-projects <source>` | List projects from source (for config) |
| `okb sync run <sources>` | Sync data from external APIs |
| `okb sync auth <source>` | Interactive OAuth setup (e.g., dropbox-paper) |
| `okb sync status` | Show last sync times |
| `okb rescan` | Check indexed files for changes, re-ingest stale |
| `okb rescan --dry-run` | Show what would change without executing |
| `okb rescan --delete` | Also remove documents for missing files |
| `okb llm status` | Show LLM config and connectivity |
| `okb llm deploy` | Deploy Modal LLM for open model inference |
| `okb llm clear-cache` | Clear LLM response cache |
| `okb synthesize run` | Generate knowledge synthesis proposals |
| `okb synthesize run --dry-run` | Preview what would be sampled |
| `okb synthesize pending` | List pending synthesis proposals |
| `okb synthesize approve <id>` | Approve a synthesis proposal |
| `okb synthesize reject <id>` | Reject a synthesis proposal |
| `okb synthesize review` | Interactive review loop (A/E/R/S/Q) |
| `okb synthesize analyze` | Analyze database and update description/topics |
| `okb synthesize analyze --stats-only` | Show stats without LLM call |
| `okb schedule add <source> <interval>` | Schedule periodic sync via systemd timer |
| `okb schedule remove <source>` | Remove a scheduled sync timer |
| `okb schedule list` | List all active sync timers |
| `okb service install` | Install systemd user services for background operation |
| `okb service uninstall` | Remove systemd user services |
| `okb service status` | Show service status |
| `okb service start` | Start okb services |
| `okb service stop` | Stop okb services |
| `okb service restart` | Restart services (use after upgrading okb) |
| `okb service logs [-f]` | Show service logs (optionally follow) |


## Configuration

Configuration is loaded from `~/.config/okb/config.yaml` (or `$XDG_CONFIG_HOME/okb/config.yaml`).

Create default config:
```bash
okb config init
```

Example config:
```yaml
databases:
  personal:
    url: postgresql://knowledge:localdev@localhost:5433/personal_kb
    default: true    # Used when --db not specified (only one can be default)
    managed: true    # okb manages via Docker
  work:
    url: postgresql://knowledge:localdev@localhost:5433/work_kb
    managed: true

docker:
  port: 5433
  container_name: okb-pgvector

chunking:
  chunk_size: 512
  chunk_overlap: 64
```

Use `--db <name>` to target a specific database with any command.

Environment variables override config file settings:
- `OKB_DATABASE_URL` - Database connection string
- `OKB_DOCKER_PORT` - Docker port mapping
- `OKB_CONTAINER_NAME` - Docker container name
- `OKB_SERVER_URL` - Remote server URL (overrides default server)
- `OKB_TOKEN` - Remote server token (overrides default server)

**Config file permissions**: Config files must be mode 0600 (not readable by group/other)
since they may contain secrets. OKB checks on load and errors if too open.

### Project-Local Config

Override global config per-project with `.okbconf.yaml` (searched from CWD upward):

```yaml
# .okbconf.yaml
default_database: work  # Use 'work' db in this project

extensions:
  skip_directories:     # Extends global list
    - test_fixtures
```

Merge: scalars replace, lists extend, dicts deep-merge.

### Remote Servers (Client Mode)

Connect to remote OKB HTTP servers:

```yaml
servers:
  personal:
    url: http://localhost:8080/mcp
    token: ${OKB_PERSONAL_TOKEN}
    default: true
  work:
    url: http://work-host:8080/mcp
    token: ${OKB_WORK_TOKEN}
```

Only one server can be `default: true`. If none is marked, the first is used.

Local config can override the default server per-project:
```yaml
# .okbconf.yaml
default_server: work
```

### Per-Database Source Overrides

Databases can override global plugin source configs (full replacement per source, no merge):

```yaml
databases:
  work:
    url: postgresql://...
    managed: true
    sources:
      github:
        enabled: true
        token: ${WORK_GITHUB_TOKEN}
      todoist:
        enabled: false
```

### LLM Integration (Optional)

Enable LLM-based document classification, filtering, and synthesis:

```yaml
llm:
  provider: claude          # "claude", "modal", or null (disabled)
  model: claude-haiku-4-5-20251001
  timeout: 30
  cache_responses: true
```

**Providers:**
| Provider | Setup | Cost |
|----------|-------|------|
| `claude` | `export ANTHROPIC_API_KEY=...` | ~$0.25/1M tokens |
| `modal` | `okb llm deploy` | ~$0.02/min GPU |

**Modal LLM Setup** (no API key needed, runs on Modal's GPUs):

```yaml
llm:
  provider: modal
  model: microsoft/Phi-3-mini-4k-instruct  # Recommended: no gating
```

Non-gated models (work immediately):
- `microsoft/Phi-3-mini-4k-instruct` - Good quality, 4K context
- `Qwen/Qwen2-1.5B-Instruct` - Smaller/faster

Gated models (require HuggingFace approval + token):
- `meta-llama/Llama-3.2-3B-Instruct` - Requires accepting license at HuggingFace
- Setup: `modal secret create huggingface HF_TOKEN=hf_...`

Deploy after configuring:
```bash
okb llm deploy
```

**Pre-ingest filtering** - skip low-value content during sync:
```yaml
plugins:
  sources:
    dropbox-paper:
      llm_filter:
        enabled: true
        prompt: "Skip meeting notes and drafts"
        action_on_skip: discard  # or "archive"
```

### Knowledge Synthesis

LLM-based synthesis generates topic summaries, entity profiles, and cross-cutting insights
from your knowledge base:

```bash
okb synthesize run                        # Generate synthesis proposals
okb synthesize run --project myproject    # Scope to specific project
okb synthesize run --max-proposals 5      # Limit proposals
okb synthesize run --dry-run              # Preview what would be sampled
okb synthesize pending                    # List pending proposals
okb synthesize approve <id>              # Approve → creates searchable document
okb synthesize reject <id>               # Reject proposal
okb synthesize review                    # Interactive review (A/E/R/S/Q)
```

Analyze the knowledge base to generate/update description and topics:
```bash
okb synthesize analyze                   # Analyze and update metadata
okb synthesize analyze --stats-only      # Show stats without LLM call
okb synthesize analyze --project myproj  # Analyze specific project
```

CLI commands:
```bash
okb llm status              # Show config and connectivity
okb llm deploy              # Deploy Modal LLM (for provider: modal)
okb llm clear-cache         # Clear response cache
```

## Claude Code MCP Config

### stdio mode (default)

Add to your Claude Code MCP configuration:

```json
{
  "mcpServers": {
    "knowledge-base": {
      "command": "okb",
      "args": ["serve"]
    }
  }
}
```

### HTTP mode (for remote/shared servers)

First, start the HTTP server and create a token:

```bash
# Create a token
okb token create --db default -d "Claude Code"
# Output: okb_default_rw_a1b2c3d4e5f6g7h8

# Start HTTP server
okb serve --http --host 0.0.0.0 --port 8080
```

The server supports two transports:

**Streamable HTTP** (primary, RFC 9728 compliant):
- `POST /mcp` - Send JSON-RPC messages, receive SSE response
- `GET /mcp` - Establish SSE connection for server notifications
- `DELETE /mcp` - Terminate session
- `/sse` is an alias for `/mcp`

**Legacy SSE** (for older MCP clients):
- `GET /legacy/sse` - Establish SSE stream
- `POST /legacy/messages` - Send JSON-RPC messages

Configure your MCP client to connect:

```json
{
  "mcpServers": {
    "knowledge-base": {
      "type": "sse",
      "url": "http://localhost:8080/mcp",
      "headers": {
        "Authorization": "Bearer okb_default_rw_a1b2c3d4e5f6g7h8"
      }
    }
  }
}
```

## MCP Tools available to LLM

| Tool | Purpose |
|------|---------|
| `search_knowledge` | Semantic search with natural language queries |
| `keyword_search` | Exact keyword/symbol matching |
| `hybrid_search` | Combined semantic + keyword (RRF fusion) |
| `get_document` | Retrieve full document by path |
| `list_sources` | Show indexed document stats |
| `list_projects` | List known projects |
| `list_documents_by_project` | List all documents for a specific project |
| `get_project_stats` | List projects with document counts |
| `rename_project` | Rename a project across all documents |
| `set_document_project` | Set/clear project for a single document |
| `recent_documents` | Show recently indexed files |
| `save_knowledge` | Save knowledge from Claude (`source_type`: `claude-note` or `synthesis`) |
| `update_knowledge` | Update an existing document in-place |
| `delete_knowledge` | Delete any document by source path |
| `get_actionable_items` | Query tasks/events with structured filters |
| `get_database_info` | Get database description, topics, and stats |
| `set_database_description` | Update database description/topics (LLM can self-document) |
| `add_todo` | Create a TODO item in the knowledge base |
| `trigger_sync` | Sync API sources (runs in background, use `list_sync_sources` to check) |
| `trigger_rescan` | Check indexed files for changes and re-ingest (background) |
| `list_sync_sources` | List API sync sources with status (idle/running/error) |
| `ingest_documents` | Ingest pre-parsed documents (chunking, embedding, storage) |
| `analyze_knowledge_base` | Analyze content and generate description/topics |
| `get_synthesis_samples` | Get document samples and stats for LLM-driven synthesis |
| `synthesize_knowledge` | Analyze DB and propose synthetic knowledge documents |
| `list_pending_synthesis` | List pending synthesis proposals |
| `approve_synthesis` | Approve a proposal, creating a searchable document |
| `reject_synthesis` | Reject a pending synthesis proposal |
| `edit_pending_synthesis` | Edit a proposal before approve/reject |
| `save_snapshot` | Create database snapshot for backup |
| `list_snapshots` | List available database snapshots |
| `restore_snapshot` | Restore database from snapshot |

## Claude.ai Integration (OAuth Shim)

Claude.ai requires OAuth 2.1 for MCP server connections. The `oauth/` directory contains a
Cloudflare Worker that bridges OAuth 2.1 to OKB's bearer token auth:

```
Claude.ai ──OAuth 2.1──▶ Cloudflare Worker ──Bearer token──▶ OKB HTTP server
                              │
                         GitHub login → maps to pre-existing OKB token
```

See [`oauth/README.md`](oauth/README.md) for setup instructions.

## Contextual Chunking

Documents are chunked with context for better retrieval:

```
Document: Django Performance Notes
Project: student-app          ← inferred from path or frontmatter
Section: Query Optimization   ← extracted from markdown headers
Topics: django, performance   ← from frontmatter tags
Content: Use `select_related()` to avoid N+1 queries...
```

### Frontmatter Example

```markdown
---
tags: [django, postgresql, performance]
project: student-app
category: backend
---

# Your Document Title

Content here...
```

## Plugin System

OKB supports plugins for custom file parsers and API data sources (GitHub, Todoist, etc).

### Creating a Plugin

```python
# File parser plugin
from okb.plugins import FileParser, Document

class EpubParser:
    extensions = ['.epub']
    source_type = 'epub'

    def can_parse(self, path): return path.suffix.lower() == '.epub'
    def parse(self, path, extra_metadata=None) -> Document: ...

# API source plugin
from okb.plugins import APISource, SyncState, Document

class GitHubSource:
    name = 'github'
    source_type = 'github-issue'

    def configure(self, config): ...
    def fetch(self, state: SyncState | None) -> tuple[list[Document], SyncState]: ...
```

### Registering Plugins

In your plugin's `pyproject.toml`:
```toml
[project.entry-points."okb.parsers"]
epub = "okb_epub:EpubParser"

[project.entry-points."okb.sources"]
github = "okb_github:GitHubSource"
```

### Configuring API Sources

```yaml
# ~/.config/okb/config.yaml
plugins:
  sources:
    github:
      enabled: true
      token: ${GITHUB_TOKEN}  # Resolved from environment
      repos: [owner/repo1, owner/repo2]
    todoist:
      enabled: true
      token: ${TODOIST_TOKEN}
      include_completed: false     # Sync completed tasks
      completed_days: 30           # Days of completed history
      include_comments: false      # Include task comments (1 API call per task)
      project_filter: []           # List of project IDs (use sync list-projects to find)
    dropbox-paper:
      enabled: true
      # Option 1: Refresh token (recommended, auto-refreshes)
      app_key: ${DROPBOX_APP_KEY}
      app_secret: ${DROPBOX_APP_SECRET}
      refresh_token: ${DROPBOX_REFRESH_TOKEN}
      # Option 2: Access token (short-lived, expires after ~4 hours)
      # token: ${DROPBOX_TOKEN}
      folders: [/]            # Optional: filter to specific folders
```

**Dropbox Paper OAuth Setup:**
```bash
okb sync auth dropbox-paper
```
This interactive command will guide you through getting a refresh token from Dropbox.

## License

MIT

