Metadata-Version: 2.4
Name: sourceweave-web-search
Version: 0.2.3
Summary: MCP server and CLI for search-first web research with batched page reading, focused extraction, and direct URL support
Keywords: crawl4ai,mcp,model-context-protocol,openwebui,search,web-search
Author: Mohammad ElNaqa
Author-email: Mohammad ElNaqa <55245971+MRNAQA@users.noreply.github.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: aiohttp
Requires-Dist: loguru
Requires-Dist: markitdown[docx,pdf,pptx,xlsx]
Requires-Dist: mcp>=1.8.0
Requires-Dist: pydantic
Requires-Dist: redis>=5.0
Requires-Python: >=3.12
Project-URL: Homepage, https://github.com/MRNAQA/sourceweave-web-search
Project-URL: Repository, https://github.com/MRNAQA/sourceweave-web-search
Project-URL: Issues, https://github.com/MRNAQA/sourceweave-web-search/issues
Description-Content-Type: text/markdown

# SourceWeave Web Search

<!-- mcp-name: io.github.MRNAQA/sourceweave-web-search -->

SourceWeave Web Search is an MCP server and CLI for search-first web research plus follow-up page reading.

It uses SearXNG for discovery, Crawl4AI for cleaned page extraction, and Redis or Valkey as the canonical persisted page cache.

For most users, the setup is simple:

1. run the supporting services locally in containers, or point at existing external endpoints
2. start the MCP server with `uvx`
3. connect your MCP client to the running server over `stdio` or local HTTP

## Key Features

- MCP server with `stdio`, `sse`, and `streamable-http` transports
- search-first source discovery plus batched page reading for MCP clients
- explicit per-URL document conversion for PDFs and other supported documents
- focused reads, direct URL reads, related-link limits, image metadata, and page-quality hints
- publishable Python package, container image, and generated OpenWebUI artifact
- compatible with OpenCode, VS Code Copilot, and other MCP clients

## Requirements

- Python `3.12+`
- a reachable SearXNG endpoint
- a reachable Crawl4AI endpoint
- a reachable Redis or Valkey instance

Optional:

- Docker and Docker Compose for the repo-local stack

## Recommended Local Deployment

Start the supporting services locally:

```bash
git clone https://github.com/MRNAQA/sourceweave-web-search.git
cd sourceweave-web-search
cp .env.example .env
docker compose up -d redis crawl4ai searxng
```

Then start the MCP server from the published package with `uvx` and point it at those local endpoints:

```bash
SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL="http://127.0.0.1:19080/search?format=json&q=<query>" \
SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL="http://127.0.0.1:19235" \
SOURCEWEAVE_SEARCH_CACHE_REDIS_URL="redis://127.0.0.1:16379/2" \
uvx --from sourceweave-web-search sourceweave-search-mcp
```

For a local HTTP MCP endpoint instead of `stdio`:

```bash
SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL="http://127.0.0.1:19080/search?format=json&q=<query>" \
SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL="http://127.0.0.1:19235" \
SOURCEWEAVE_SEARCH_CACHE_REDIS_URL="redis://127.0.0.1:16379/2" \
uvx --from sourceweave-web-search sourceweave-search-mcp \
  --transport streamable-http \
  --host 127.0.0.1 \
  --port 8000
```

You can also point the same `uvx` command at externally hosted SearXNG, Crawl4AI, and Redis or Valkey endpoints by changing the environment variables.

## Installation Options

### Python package

Published releases can be installed from PyPI:

```bash
pip install sourceweave-web-search
```

Or run directly without a global install:

```bash
uvx --from sourceweave-web-search sourceweave-search-mcp
uvx --from sourceweave-web-search sourceweave-search --query "python programming"
```

### Repo checkout

For local development or source-based runs:

```bash
git clone https://github.com/MRNAQA/sourceweave-web-search.git
cd sourceweave-web-search
uv sync --locked --group dev
uv run sourceweave-search-mcp
```

### Container image

The release workflow can publish a container image to:

- `ghcr.io/mrnaqa/sourceweave-web-search-mcp`

Example runtime:

```bash
docker run --rm -p 8000:8000 \
  -e SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL="http://host.docker.internal:19080/search?format=json&q=<query>" \
  -e SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL="http://host.docker.internal:19235" \
  -e SOURCEWEAVE_SEARCH_CACHE_REDIS_URL="redis://host.docker.internal:16379/2" \
  ghcr.io/mrnaqa/sourceweave-web-search-mcp:latest
```

Example `docker compose` recipe:

```yaml
services:
  redis:
    image: valkey/valkey:9-alpine
    command: ["redis-server", "--appendonly", "no"]

  crawl4ai:
    image: unclecode/crawl4ai:0.8.6

  searxng:
    image: searxng/searxng:2026.4.11-9e08a6771

  sourceweave-mcp:
    image: ghcr.io/mrnaqa/sourceweave-web-search-mcp:latest
    depends_on:
      - redis
      - crawl4ai
      - searxng
    environment:
      SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL: http://searxng:8080/search?format=json&q=<query>
      SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL: http://crawl4ai:11235
      SOURCEWEAVE_SEARCH_CACHE_REDIS_URL: redis://redis:6379/2
      FASTMCP_HOST: 0.0.0.0
      FASTMCP_PORT: 8000
    ports:
      - "8000:8000"
```

That gives you a local HTTP MCP endpoint at `http://127.0.0.1:8000/mcp` with the SourceWeave container linked to the supporting services by container name.

The repo's own `docker compose up -d --build mcp` path also builds and runs this same publishable image locally.

## Runtime Configuration

Set these environment variables:

| Variable | Purpose |
| --- | --- |
| `SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL` | SearXNG URL template. Must contain `<query>`. |
| `SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL` | Crawl4AI base URL. |
| `SOURCEWEAVE_SEARCH_CACHE_REDIS_URL` | Redis or Valkey URL used for caching. |
| `FASTMCP_HOST` | Host for `sse` or `streamable-http` transport. |
| `FASTMCP_PORT` | Port for `sse` or `streamable-http` transport. |

Example:

```bash
SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL="http://127.0.0.1:19080/search?format=json&q=<query>" \
SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL="http://127.0.0.1:19235" \
SOURCEWEAVE_SEARCH_CACHE_REDIS_URL="redis://127.0.0.1:16379/2" \
sourceweave-search --query "python programming" --read-first-pages 2
```

## Quick Start

The CLI is useful for smoke testing the runtime outside an MCP client.

Search and immediately read the first results:

```bash
sourceweave-search --query "python programming" --read-first-pages 2
```

Read a discovered page and include stored related links:

```bash
sourceweave-search \
  --query "react useEffect cleanup example" \
  --read-first-page \
  --related-links-limit 3
```

Read a direct URL without running `search_web` first:

```bash
sourceweave-search \
  --read-url "https://packaging.python.org/en/latest/" \
  --max-chars 2000
```

Force document conversion for an explicit URL:

```bash
sourceweave-search \
  --query "guide pdf" \
  --url '{"url": "https://example.com/guide.pdf", "convert_document": true}'
```

## MCP Server

Run over stdio:

```bash
sourceweave-search-mcp
```

Run as a local HTTP endpoint:

```bash
sourceweave-search-mcp --transport streamable-http --host 127.0.0.1 --port 8000
```

## What MCP Clients Get

MCP clients receive a simple two-step flow:

- `search_web`: discover relevant sources with compact summaries, key points, metadata, and stable `page_id` handles for follow-up work
- `read_pages`: read by `page_id` after `search_web` or use it as a standalone direct-URL reader, batch related pages in one call, optionally focus the extraction, and retrieve stored related-link and page-quality context when useful


Human operators usually only need to know how to run the server and where to point the runtime endpoints. MCP clients handle the exact tool parameters.

## MCP Client Setup

### OpenCode

Example `opencode.json` / `opencode.jsonc` / `~/.config/opencode/opencode.json`:

```jsonc
{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "sourceweave": {
      "type": "local",
      "command": [
        "uvx",
        "--from",
        "sourceweave-web-search",
        "sourceweave-search-mcp"
      ],
      "environment": {
        "SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL": "http://127.0.0.1:19080/search?format=json&q=<query>",
        "SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL": "http://127.0.0.1:19235",
        "SOURCEWEAVE_SEARCH_CACHE_REDIS_URL": "redis://127.0.0.1:16379/2"
      },
      "enabled": true,
      "timeout": 30000
    }
  }
}
```

For a shared HTTP endpoint instead:

```json
{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "sourceweave": {
      "type": "remote",
      "url": "http://127.0.0.1:18000/mcp",
      "enabled": true,
      "timeout": 30000
    }
  }
}
```

### VS Code Copilot

Example `.vscode/mcp.json`:

```json
{
  "servers": {
    "sourceweave": {
      "type": "stdio",
      "command": "uvx",
      "args": [
        "--from",
        "sourceweave-web-search",
        "sourceweave-search-mcp"
      ],
      "env": {
        "SOURCEWEAVE_SEARCH_SEARXNG_BASE_URL": "http://127.0.0.1:19080/search?format=json&q=<query>",
        "SOURCEWEAVE_SEARCH_CRAWL4AI_BASE_URL": "http://127.0.0.1:19235",
        "SOURCEWEAVE_SEARCH_CACHE_REDIS_URL": "redis://127.0.0.1:16379/2"
      }
    }
  }
}
```

For a shared HTTP endpoint instead:

```json
{
  "servers": {
    "sourceweave": {
      "type": "http",
      "url": "http://127.0.0.1:18000/mcp"
    }
  }
}
```

## OpenWebUI

This repo also ships a generated standalone OpenWebUI tool file at `artifacts/sourceweave_web_search.py`.

From a repo checkout, verify it is in sync with the canonical implementation:

```bash
uv run sourceweave-build-openwebui --check
```

Paste that artifact into OpenWebUI when you want the standalone tool-file deployment path.

## Defaults

Default host-side endpoints used by the package:

- SearXNG: `http://127.0.0.1:19080/search?format=json&q=<query>`
- Crawl4AI: `http://127.0.0.1:19235`
- Redis: `redis://127.0.0.1:16379/2`

Default repo-local ports:

- SearXNG: `19080`
- Crawl4AI: `19235`
- Redis: `16379`
- MCP: `8000` when run directly with `uvx`; `18000` at `/mcp` when using the repo's `mcp` compose service
