# RAG Memory - AI Agent Usage Guide

## System Overview

**IMPORTANT:** Ingestion operations have cost (they process your content). Query operations are free (they search existing knowledge).

---

## 1. COLLECTION DISCIPLINE (CRITICAL)

**MUST review collections before ingesting ANY content:**

1. `list_collections()` - See available collections
2. `get_collection_info(name)` - Review purpose, domain, metadata schema
3. Choose collection matching content's domain/topic
4. If no good match: `create_collection()` with clear domain/purpose

**Why:** Collections partition BOTH vector search AND knowledge graph. Poor collection choices degrade knowledge quality and search relevance.

**Never:**
- Dump unrelated content into same collection
- Ignore collection descriptions when choosing where to ingest
- Create collections without clear, focused domain/purpose
- Ingest before reviewing what collections already exist

**Pattern:**
```
list_collections()
  → review purposes/domains
  → choose best fit OR create new
  → ingest_*(collection_name=chosen)
```

---

## 2. SEARCH: USE FULL QUESTIONS (NOT KEYWORDS)

**Semantic search matches MEANING, not exact words.**

✅ Good: "How do I configure authentication in the system?"
❌ Bad: "authentication configuration"

Applies to: `search_documents`, `query_relationships`, `query_temporal`

---

## 3. INGESTION WORKFLOWS

### Duplicate Detection & Reingest

**AUTOMATIC DUPLICATE DETECTION:**
All ingest tools automatically detect duplicates when using mode='ingest' (default).
If content already exists, you'll receive a clear error with the duplicate's ID and a suggestion to use mode='reingest'.

**How It Works:**
- **ingest_text**: Checks for existing document with same title in collection
- **ingest_file**: Checks for existing file_path metadata in collection
- **ingest_directory**: Checks all files' file_path metadata in collection
- **ingest_url**: Checks for existing crawl_root_url metadata in collection

**Workflow:**
```
# STEP 1: Try ingesting (mode='ingest' is default)
ingest_*(content, collection_name)

# STEP 2: If duplicate error occurs, decide:
  → Content unchanged? SKIP (no action needed)
  → Content updated? Use mode='reingest' (deletes old, ingests new)
  → Minor edit only? Use update_document() (no re-chunking)
```

**Reingest vs Update:**
- **mode='reingest'**: Deletes old document completely, re-ingests fresh content
  - **When:** Content changed significantly, need fresh chunking/embeddings/graph
  - **Result:** Complete replacement with new document ID
  - **Tools:** ingest_text, ingest_file, ingest_directory, ingest_url

- **update_document()**: Updates existing document in-place
  - **When:** Minor metadata changes, small content tweaks
  - **Result:** Same document ID, only specified fields updated
  - **Note:** Content updates trigger re-chunking (same cost as reingest)

### Analyze Before Multi-Page Ingests (REQUIRED)
```
# STEP 1: Analyze website structure
analysis = analyze_website(url, include_url_lists=True)
  → Returns: total_urls, pattern_stats, elapsed_seconds
  → Note: May take up to 50 seconds for large sites

# STEP 2: Review scope and plan strategy
  → If total_urls <= 20: Single targeted ingest
  → If total_urls > 20: Multiple targeted ingests (max_pages=20 per ingest)
  → Review pattern_stats to identify sections (/api, /guides, /reference)

# STEP 3: Execute ingest(s)
ingest_url(
    url=target_url,
    follow_links=True,
    max_pages=20  # default=10, max=20
)
```

**Why:**
- Helps plan targeted ingests vs full-site ingestion
- max_pages=20 hard limit requires multiple ingests for large sites
- Pattern stats show actual site structure for informed decisions

**Website Analysis:**
analyze_website() discovers publicly accessible URL patterns:
- Works with sites that have sitemaps
- Also discovers URLs from sites without sitemaps
- Returns up to 150 URLs grouped by path pattern
- 50-second timeout (very large sites may time out - try analyzing subsections instead)
Example: "https://docs.example.com/api" → discovers /api section structure

### Use Reingest for Website Updates
```
Instead of: delete_document() + ingest_url()
Use: ingest_url(url, mode="reingest", ...)
```

**Why:** Safer, maintains metadata tracking, cleaner knowledge base.

---

## 4. QUERY STRATEGIES

**Use `search_documents` for:**
- Finding content by meaning/topic
- "What does knowledge base say about X?"
- Returns relevant documents and sections

**Use `query_relationships` for:**
- Discovering how concepts connect
- "What is related to X?" or "How are A and B connected?"
- Returns connections and relationships

**Use `query_temporal` for:**
- Tracking how information changes over time
- "How has X changed since 2023?"
- Returns evolution and timeline of knowledge

**Pro tip:** Combine multiple query types for comprehensive research.

---

## 5. EFFICIENCY & COST AWARENESS

**Ingestion operations have cost:**
- Every `ingest_*` call processes your content
- Cost varies by document size and complexity

**CRITICAL - INGESTION TIMING & TIMEOUT HANDLING:**

Processing time is non-deterministic and varies significantly. Examples observed:
- Single document: ~30 seconds to several minutes
- Directory: several minutes to extended processing time
- Website crawl: several minutes to extended processing time

Assess the scope of your specific request to estimate duration:
- Content size and complexity
- Number of files/documents/pages
- Crawl parameters (follow_links, max_depth, recursion)

**If client times out:** The operation CONTINUES on the server. Use
list_documents(collection_name, include_details=True) to verify completion
after waiting appropriate time based on your scope assessment.

Progress notifications track long-running operations for supporting clients.

**DUPLICATE REQUEST PROTECTION:**

If you submit the same ingestion request while one is already processing, you will receive:
```json
{"error": "This exact request is already processing (started Xs ago).
           Please wait for the current operation to complete.",
 "status": "duplicate_request"}
```

This prevents data corruption from concurrent identical operations. **If you see this error:**
1. **WAIT** - The original request is still processing on the server
2. **DO NOT retry immediately** - You'll get the same duplicate error
3. **Verify completion** using `list_documents(collection_name, include_details=True)`
4. **Only retry** after confirming the original request completed or failed

**Why this matters:** After a timeout, some MCP bridges (like OpenAI's) automatically retry with a new session. The duplicate protection catches this and prevents double-ingestion, which would corrupt your knowledge base with redundant data.

**Query operations are FREE:**
- `search_documents`
- `query_relationships`
- `query_temporal`
- All list/view operations

**Best practices for efficiency:**
- Let automatic duplicate detection catch duplicates (see #3)
- Use mode='reingest' for updated content (complete replacement)
- Use `update_document()` only for minor edits (in-place updates)
- Analyze large ingests before proceeding (see #3)
- Use reingest mode for website updates (see #3)

---

## 6. COMMON PATTERNS

**Documentation Ingestion (Single Section):**
```
1. analyze_website(url) - understand scope and structure
2. Review total_urls and pattern_stats
3. create_collection(name, domain, description) - organize by source
4. ingest_url(url, follow_links=True, max_pages=20)
5. get_collection_info(name) - verify completion
```

**Documentation Ingestion (Multiple Sections):**
```
1. analyze_website("https://docs.example.com") - understand site structure
   → Shows: /api (45 pages), /guides (30 pages), /reference (25 pages)
2. create_collection(name, domain, description)

# Execute multiple targeted ingests based on pattern analysis
3. ingest_url("https://docs.example.com/api", follow_links=True, max_pages=20)
4. ingest_url("https://docs.example.com/guides", follow_links=True, max_pages=20)
5. ingest_url("https://docs.example.com/reference", follow_links=True, max_pages=20)

# Each ingest is independent - plan based on pattern_stats from analyze_website()
```

**Research Query:**
```
1. search_documents(query, collection) - find relevant content
2. query_relationships(query, collection) - find connections
3. Synthesize findings from both sources
```

**Maintenance:**
```
1. list_documents(collection) - identify stale docs
2. Refresh changed content:
   - For websites: ingest_url(url, mode="reingest")
   - For files: ingest_file(path, mode="reingest")
   - For text: ingest_text(content, mode="reingest")
   - For minor edits: update_document(id, content/metadata)
```

---

## 7. KEY IMPERATIVES

- **MUST** review collections before ingesting (see #1)
- **MUST** use full questions for search, not keywords (see #2)
- **MUST** use mode='reingest' to update existing content (automatic duplicate detection will guide you) (see #3)
- **MUST** run analyze_website() before multi-page ingests (see #3)
- **MUST** limit max_pages to 20 per ingest for large sites
- **SHOULD** present scope to user for large operations and get confirmation
- **SHOULD** use mode='reingest' for content updates instead of delete+ingest (see #3)
- **SHOULD** combine semantic + relationship queries for comprehensive research (see #4)
- **SHOULD** use pattern_stats from analyze_website() to plan targeted ingests

**For tool-specific details:** See individual tool docstrings.
