🚀 Ragdex Architecture

Complete Technical Documentation - Transform Your Documents & Emails into an AI-Powered Knowledge Base

✅ FULLY OPERATIONAL - Package structure, 17 tools, 5 prompts, 4 resources, 768-dim embeddings, Email support
v0.2.0 MCP Compatible Python 3.10-3.12 ARM64 Ready 100% Local
17
MCP Tools
5
Prompt Templates
4
Dynamic Resources
768
Embedding Dimensions
10+
File Formats
100%
Local Processing

System Overview

Ragdex (formerly Personal Document Library MCP Server) is a sophisticated Model Context Protocol server that enables Claude Desktop to access and analyze a personal collection of documents and emails through RAG (Retrieval-Augmented Generation). The system features a complete MCP protocol implementation with tools, prompts, and resources, supporting multiple document formats and intelligent email filtering.

📚
Documents & Emails

PDFs, Word, EPUB, MOBI, EMLX, OLM

⚙️
Ragdex Indexer

Background Processing & OCR

🗄️
ChromaDB

768-dim Vector Storage

🔌
MCP Server

17 Tools + 5 Prompts + 4 Resources

🤖
Claude Desktop

AI Assistant Interface

Package Architecture

📦 Python Package Structure NEW

Properly structured as personal_doc_library package with pyproject.toml configuration

  • CLI commands: ragdex, ragdex-mcp, ragdex-index, ragdex-web
  • Package installation: pip install ragdex
  • Development: pip install -e .
  • With extras: pip install -e ".[document-processing,services]"

🔌 MCP Complete Server

Full Model Context Protocol implementation with lazy initialization for < 1s startup

  • Location: src/personal_doc_library/servers/mcp_complete_server.py
  • 17 specialized tools for document interaction
  • 5 prompt templates for workflows
  • 4 dynamic resources for real-time info
  • Async request handling with error recovery

🧠 SharedRAG Engine

Core retrieval-augmented generation system with vector embeddings

  • Model: sentence-transformers/all-mpnet-base-v2
  • 768-dimensional embeddings
  • Semantic search with synthesis
  • Cross-document analysis
  • MD5 hash-based deduplication
  • 30-minute lock timeout protection

📊 ChromaDB Vector Store

High-performance vector database for document embeddings

  • Persistent storage with WAL mode
  • Metadata filtering support
  • Cosine similarity search
  • Storage: ~1MB per 100 pages
  • Telemetry disabled: CHROMA_TELEMETRY=false

📁 Document Loaders

Comprehensive file format support with intelligent processing

  • PDF with OCR (ocrmypdf integration)
  • Word/PowerPoint/Excel (LibreOffice)
  • EPUB (ebooklib + pandoc)
  • MOBI/AZW/AZW3 (Calibre ebook-convert)
  • Apple Mail EMLX (native parser)
  • Outlook OLM (archive extraction)

Background Services

Automatic monitoring and indexing services

  • Index Monitor with file watching
  • LaunchAgent service support (macOS)
  • Web Dashboard at localhost:8888
  • Progress tracking and status updates
  • Failed document management
  • Automatic retry with backoff

MCP Protocol Implementation

🔧 Tools (17 Available)

Search & Discovery

search
Semantic search with optional book/email filtering and AI synthesis. Returns chunks with relevance scores.
list_books
List documents by pattern (fnmatch), author, or directory. Includes metadata like chunks and page counts.
recent_books
Find recently indexed documents by hours or days. Groups by time periods (today, yesterday, this week).
find_practices
Discover specific techniques and practices. Uses category filtering for practice-focused content.

Content Extraction

extract_pages
Extract specific pages from PDFs. Supports ranges (1-5), lists [1,3,5], or single pages.
extract_quotes
Find notable quotes on topics. Uses regex patterns for quote detection with attribution.
summarize_book
Generate AI summaries of documents. Options for brief (10-15 chunks) or detailed (all chunks).

Analysis & Synthesis

compare_perspectives
Compare viewpoints across multiple sources. Creates comparison matrices with similarities and differences.
question_answer
Direct Q&A from library. Analyzes question type (factual/conceptual/practical) for optimization.
daily_reading
Suggested daily passages. Maintains reading history to avoid repetition.

System Management

library_stats
Comprehensive statistics: document count, chunks, storage size, indexing status.
index_status
Detailed indexing progress with PID checking and stale lock detection.
refresh_cache
Refresh search cache and reload book index. Forces reconnection to ChromaDB.
warmup
Initialize RAG system to prevent timeouts. Pre-loads ~4GB embedding model.
find_unindexed
Find unindexed documents. Categorizes as New, Modified, or Failed.
reindex_book
Force reindex specific document. Full cleanup before reprocessing.
clear_failed
Clear failed document list. Creates backup before clearing.

📝 Prompts (5 Templates) NEW

analyze_theme
Analyze a theme across your entire library with cross-referencing
compare_authors
Compare writing styles, perspectives, and approaches of different authors
extract_practices
Extract practical techniques, exercises, and actionable insights
research_topic
Deep research on a specific topic with comprehensive synthesis
daily_wisdom
Get daily wisdom passages with reflection questions

📊 Resources (4 Dynamic) NEW

library://stats
Real-time library statistics and metrics (document count, storage size, etc.)
library://recent
Recently indexed documents with timestamps and metadata
library://search-tips
Search tips, query examples, and usage patterns
library://config
Current configuration settings and environment variables

File Structure & Organization

Package Structure v0.2.0

  • src/personal_doc_library/ - Main package directory
    • __init__.py - Package initialization
    • cli.py - Command-line interface (ragdex command)
    • core/ - Core functionality
      • config.py - Configuration management with env vars
      • shared_rag.py - RAG implementation with singleton pattern
      • logging_config.py - Centralized logging setup
      • timeout_handler.py - 15-minute timeout protection
    • servers/ - MCP server implementation
      • mcp_complete_server.py - Main MCP server with all protocols
    • indexing/ - Document indexing
      • index_monitor.py - Background monitoring service
      • execute_indexing.py - Core indexing logic
      • complete_indexing.py - Full indexing workflow
      • handle_large_pdf.py - Large PDF processing
      • manage_failed_pdfs.py - Failed document management
    • loaders/ - Email loaders NEW
      • email_loaders.py - Base email loader with filtering
      • emlx_loader.py - Apple Mail EMLX support
      • outlook_loader.py - Outlook OLM support
    • monitoring/ - Web interface
      • monitor_web_enhanced.py - Flask dashboard (localhost:8888)
    • utils/ - Utility modules
      • check_indexing_status.py - Status checking
      • find_unindexed.py - Find new documents
      • ocr_manager.py - OCR processing manager
      • show_config.py - Display configuration

Configuration Files

  • pyproject.toml - Package configuration with dependencies and entry points
  • requirements.txt - Direct dependencies list
  • MANIFEST.in - Package distribution configuration
  • .gitignore - Git ignore patterns

Data Directories

  • books/ or $PERSONAL_LIBRARY_DOC_PATH - Document library location
  • chroma_db/ or $PERSONAL_LIBRARY_DB_PATH - Vector database (55MB for 68 books)
  • logs/ or $PERSONAL_LIBRARY_LOGS_PATH - Application and service logs
  • venv_mcp/ - Python 3.12 virtual environment (ARM64 compatible)

Service Scripts

  • scripts/run.sh - Main runner for MCP server
  • scripts/install_service.sh - Install as LaunchAgent service
  • scripts/index_monitor.sh - Start background monitor
  • scripts/start_web_monitor.sh - Start web dashboard
  • install_ragdex_services.sh - One-click service installer

Installation & Configuration

🚀 Quick Start (PyPI)

# Using uv (recommended, faster)
uv venv ~/ragdex_env
uv pip install ragdex

# Or standard pip
python -m venv ~/ragdex_env
source ~/ragdex_env/bin/activate
pip install ragdex

# With optional extras
pip install ragdex[document-processing,services]

📦 Development Installation

# Clone repository
git clone https://github.com/hpoliset/ragdex
cd ragdex

# Install in editable mode
pip install -e .

# Or with extras
pip install -e ".[document-processing,services]"

# Use CLI commands
ragdex --help
ragdex-mcp          # Start MCP server
ragdex-index        # Start indexer
ragdex-web          # Start web dashboard

⚙️ Claude Desktop Configuration

Location: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "ragdex": {
      "command": "/Users/YOUR_USERNAME/ragdex_env/bin/ragdex-mcp",
      "env": {
        "PYTHONUNBUFFERED": "1",
        "CHROMA_TELEMETRY": "false",
        "PERSONAL_LIBRARY_DOC_PATH": "/path/to/documents",
        "PERSONAL_LIBRARY_DB_PATH": "/path/to/database",
        "PERSONAL_LIBRARY_LOGS_PATH": "/path/to/logs"
      }
    }
  }
}

🔧 Environment Variables

# Core paths
export PERSONAL_LIBRARY_DOC_PATH="/path/to/documents"
export PERSONAL_LIBRARY_DB_PATH="/path/to/database"
export PERSONAL_LIBRARY_LOGS_PATH="/path/to/logs"

# Email settings (v0.2.0+)
export PERSONAL_LIBRARY_INDEX_EMAILS=true
export PERSONAL_LIBRARY_EMAIL_SOURCES=apple_mail,outlook_local
export PERSONAL_LIBRARY_EMAIL_MAX_AGE_DAYS=365
export PERSONAL_LIBRARY_EMAIL_EXCLUDED_FOLDERS=Spam,Junk,Trash,Deleted Items,Drafts

# Performance tuning
export INDEXING_MEMORY_LIMIT_GB=8
export CHROMA_TELEMETRY=false
export TOKENIZERS_PARALLELISM=false

Technical Implementation Details

Document Processing Pipeline

📄
Discovery

Scan for supported formats

🔐
MD5 Hash

Check for changes

✂️
Chunking

1200 chars, 150 overlap

🏷️
Categorize

practice, energy_work, philosophy, general

🧮
Embeddings

768-dim vectors

💾
Storage

ChromaDB + Index

Key Architectural Patterns

Performance Metrics

< 1s
MCP Startup Time
~1.75s
First Search Query
100-500ms
Subsequent Searches
10-20/min
Document Indexing
1-2 pages/min
OCR Processing
~4GB
Embedding Model RAM

Email Filtering Intelligence v0.2.0

Smart Filtering: Automatically excludes marketing, promotional, and spam emails

  • Pattern Matching: Detects unsubscribe links, marketing language, order confirmations
  • Domain Filtering: Excludes known marketing domains (amazon.com, ebay.com, etc.)
  • Folder Exclusion: Skips Spam, Junk, Trash, Deleted Items, Drafts folders
  • Date Filtering: Configurable max age (default 365 days)
  • Whitelist Support: Important senders can bypass filters

Technology Stack

Python 3.10-3.12 LangChain 0.1.0 ChromaDB 0.4.22 Sentence Transformers PyPDF2 Flask 3.0.0 MCP Protocol EMLX 1.0.0 Pandoc Calibre OCRmyPDF LibreOffice

Known Issues & Solutions

LaunchAgent Permissions (macOS)

Issue: LaunchAgent services restricted by sandboxing

Solution: Use shell script wrapper (scripts/index_monitor_service.sh)

Details: See docs/LAUNCHAGENT_PERMISSIONS_SOLUTION.md

Common Troubleshooting

  • If indexing finds 0 documents, check CloudDocs permissions
  • For "Empty content" errors, documents may need OCR processing
  • Stale locks automatically cleaned after 30 minutes
  • MOBI files require Calibre's ebook-convert tool
  • Large PDFs (>100MB) may need 2-5 minutes to process

Best Practices

  • Always use venv_mcp/bin/python for consistency
  • Run ragdex warmup before heavy usage to pre-load models
  • Monitor indexing progress via web dashboard at localhost:8888
  • Use ragdex manage-failed to handle problematic documents
  • Set INDEXING_MEMORY_LIMIT_GB for large libraries