Metadata-Version: 2.4
Name: wscrape-cli
Version: 1.5.1
Summary: Fast async web scraper — CLI tool, Python library, and AI agent skill.
Author: wscrape contributors
License: MIT
Project-URL: Homepage, https://github.com/wscrape/wscrape
Project-URL: Repository, https://github.com/wscrape/wscrape
Project-URL: Issues, https://github.com/wscrape/wscrape/issues
Keywords: scraper,web,crawler,async,ai-agent,cli
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: httpx>=0.25
Requires-Dist: lxml>=4.9
Requires-Dist: cssselect>=1.2
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: respx>=0.20; extra == "dev"

# scrape

**Fast async web scraper — CLI tool, Python library, and AI agent skill.**

`scrape` is a production-ready Python package that scrapes websites, optionally crawls links recursively, and returns structured data suitable for AI consumption.

## Installation

```bash
pip install scrape
```

For development:

```bash
pip install scrape[dev]
```

## CLI Usage

```bash
# Scrape a single page
scrape https://example.com

# Deep crawl with depth limit
scrape https://example.com --deep --max-depth 2

# Save results to a file
scrape https://example.com --deep --save --output data.json

# CSV output
scrape https://example.com --format csv --save --output data.csv

# JSON Lines output (streaming-friendly)
scrape https://example.com --format jsonl --save --output data.jsonl

# Custom rate limiting and concurrency
scrape https://example.com --deep --rate 500 --concurrency 3

# Limit max pages
scrape https://example.com --deep --max-pages 50

# Allow external domain crawling
scrape https://example.com --deep --allow-external

# Use a proxy
scrape https://example.com --proxy http://proxy:8080

# Quiet mode (errors only)
scrape https://example.com --quiet
```

### CLI Options

| Option | Description | Default |
|---|---|---|
| `<url>` | Target URL to scrape | *(required)* |
| `--deep` | Recursively crawl same-domain links | `False` |
| `--save` | Save output to a file | `False` |
| `--output <file>` | Output file path | `output.json` |
| `--format <json\|csv\|jsonl>` | Output format | `json` |
| `--max-depth <n>` | Maximum recursion depth | `1` |
| `--max-pages <n>` | Maximum pages to scrape | `100` |
| `--concurrency <n>` | Concurrent requests | `1` |
| `--rate <ms>` | Delay between requests in ms | `200` |
| `--allow-external` | Allow crawling external domains | `False` |
| `--proxy <url>` | Proxy URL | `None` |
| `--verbose` / `-v` | Enable debug logging | `False` |
| `--quiet` / `-q` | Suppress output | `False` |

## Python Library Usage

```python
from scrape import run_scraper

# Synchronous — works everywhere
results = run_scraper("https://example.com", deep=True, max_depth=2)
for page in results:
    print(page["url"], page["title"])
```

### Async Usage

```python
import asyncio
from scrape import async_run_scraper

results = asyncio.run(
    async_run_scraper("https://example.com", deep=True, max_depth=2)
)
```

### Batch Scraping

```python
from scrape import scrape_urls

# Scrape multiple URLs concurrently
results = scrape_urls(["https://example.com", "https://example.org"])
```

## AI Agent Skill

`scrape` is designed to be called programmatically by AI agents (Claude, OpenAI, etc.):

```python
from scrape import run_scraper

# Returns list[dict] with keys: url, title, text, links, images, meta_description, meta_author, depth
data = run_scraper("https://example.com")
```

## Output Format

Each scraped page produces:

```json
{
  "url": "https://example.com",
  "title": "Example Domain",
  "text": "Example Domain This domain is for use in illustrative examples ...",
  "links": ["https://www.iana.org/domains/example"],
  "images": [],
  "meta_description": "",
  "meta_author": "",
  "depth": 0
}
```

## Architecture

```
scrape/
  __init__.py      # Public API & AI skill interface (run_scraper, async_run_scraper, scrape_urls)
  __main__.py      # python -m scrape support
  cli.py           # CLI entry point (argparse)
  config.py        # ScrapeConfig dataclass
  exceptions.py    # Custom exception hierarchy
  core/
    fetcher.py     # Async HTTP client with retries, backoff, jitter, proxy support
    parser.py      # HTML parsing (lxml) — title, text, links, images, metadata, CSS, XPath
    crawler.py     # BFS crawler with concurrency, rate limiting & domain policy
    output.py      # JSON / CSV / JSONL serialisation & file output
tests/
  test_fetcher.py  # Fetcher unit tests (mocked HTTP)
  test_parser.py   # Parser unit tests
  test_crawler.py  # Crawler unit tests (mocked HTTP)
  test_cli.py      # CLI integration tests
```

## Development

```bash
git clone https://github.com/wscrape/scrape.git
cd scrape
pip install -e ".[dev]"
pytest
```

## License

MIT
