Metadata-Version: 2.4
Name: contentapi
Version: 0.2.0
Summary: Official Python SDK for ContentAPI — extract content from any URL
Project-URL: Homepage, https://getcontentapi.com
Project-URL: Documentation, https://docs.getcontentapi.com
Project-URL: Repository, https://github.com/contentapi/contentapi-python
Project-URL: Issues, https://github.com/contentapi/contentapi-python/issues
Author-email: ContentAPI <support@getcontentapi.com>
License: MIT
License-File: LICENSE
Keywords: api-client,content-extraction,contentapi,web-scraping,youtube-transcript
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: httpx<1.0.0,>=0.25.0
Requires-Dist: pydantic<3.0.0,>=2.0.0
Provides-Extra: dev
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Description-Content-Type: text/markdown

# ContentAPI Python SDK

Official Python SDK for [ContentAPI](https://getcontentapi.com) — extract structured content from any URL.

[![PyPI version](https://img.shields.io/pypi/v/contentapi.svg)](https://pypi.org/project/contentapi/)
[![Python](https://img.shields.io/pypi/pyversions/contentapi.svg)](https://pypi.org/project/contentapi/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

## Features

- 🌐 **Web extraction** — Clean markdown/text from any webpage, with JS rendering
- 🎬 **YouTube** — Transcripts, metadata, comments, chapters, summaries, channels, playlists
- 🐦 **Twitter/X** — Tweet and thread extraction
- 🤖 **Reddit** — Post extraction
- 🔍 **Web search** — Search the web programmatically
- 🧠 **AI extraction** — Extract structured data with JSON schema or natural language
- 📝 **AI summarization** — Summarize any content with AI
- 🔗 **Site crawling** — Crawl entire websites (async with polling)
- 🔄 **URL monitoring** — Detect changes on web pages
- 📦 **Batch** — Extract multiple URLs in a single request
- ⚡ **Async support** — Full async/await with `httpx`
- 🔄 **Auto-retry** — Exponential backoff on rate limits and server errors
- 📐 **Type-safe** — Pydantic v2 models with full type hints

## Installation

```bash
pip install contentapi
```

## Quick Start

```python
from contentapi import ContentAPI

client = ContentAPI(api_key="sk_live_...")

# Extract web content
result = client.web.extract("https://example.com")
print(result.title)       # "Example Domain"
print(result.content)     # Extracted content as markdown
print(result.word_count)  # 17
```

## Usage

### Web Extraction

```python
# Default extraction
result = client.web.extract("https://example.com")

# JavaScript rendering (for SPAs)
result = client.web.extract("https://spa-app.com", render_js=True)

# Bypass robots.txt
result = client.web.extract("https://example.com", ignore_robots=True)

# RAG chunking
result = client.web.extract("https://example.com", chunk_size=500, chunk_overlap=50)

# Access structured data
print(result.title)
print(result.content)
print(result.word_count)
```

### YouTube

```python
# Get transcript (with Whisper AI fallback for videos without captions)
transcript = client.youtube.transcript("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(transcript.title)
print(transcript.full_text)

for segment in transcript.segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

# Get video metadata
metadata = client.youtube.metadata("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(metadata.view_count, metadata.published_at)

# Get top comments
comments = client.youtube.comments("https://youtube.com/watch?v=dQw4w9WgXcQ", limit=20)
for c in comments.comments:
    print(f"@{c.author}: {c.text} ({c.likes} likes)")

# Get chapters from description
chapters = client.youtube.chapters("https://youtube.com/watch?v=dQw4w9WgXcQ")
for ch in chapters.chapters:
    print(f"{ch.formatted_time} - {ch.title}")

# AI-generated summary
summary = client.youtube.summary("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(summary.summary)
print(summary.key_points)
print(summary.topics)

# Channel metadata + recent videos
channel = client.youtube.channel("@mkbhd")
print(f"{channel.name} - {channel.subscribers} subscribers")
for video in channel.recent_videos:
    print(f"  {video.title} ({video.views} views)")

# Playlist extraction
playlist = client.youtube.playlist("https://youtube.com/playlist?list=PLrAXt...")
for video in playlist.videos:
    print(f"#{video.position} {video.title}")
```

### AI Schema Extraction

```python
# Extract structured data with a JSON schema
data = client.ai.extract(
    url="https://news.ycombinator.com",
    schema={
        "top_stories": [{"title": "string", "points": "number", "url": "string"}]
    }
)
print(data.extracted)  # Structured data matching your schema

# Or use natural language
data = client.ai.extract(
    url="https://amazon.com/product/...",
    prompt="Extract the product name, price, and rating"
)
print(data.extracted)
```

### AI Summarization

```python
result = client.ai.summarize(
    content="Long article text here...",
    title="Optional title"
)
print(result.summary)     # Concise 2-3 sentence summary
print(result.key_points)  # List of key takeaways
print(result.topics)      # Auto-detected topics
```

### Site Crawling

```python
# Start an async crawl
crawl = client.crawl.start(
    url="https://docs.example.com",
    max_pages=50,
    include_patterns=["/docs/*"],
    webhook_url="https://myapp.com/hook"  # Optional: get notified when done
)
print(f"Crawl started: {crawl.crawl_id}")

# Poll for results
import time
while True:
    status = client.crawl.get(crawl.crawl_id)
    if status.status in ("completed", "failed"):
        break
    print(f"Progress: {status.pages_completed}/{status.pages_found}")
    time.sleep(5)

# Access results
for page in status.results:
    print(f"{page.url} — {page.word_count} words")
    print(page.content[:200])
```

### URL Monitoring (Change Detection)

```python
# Create a monitor
monitor = client.monitor.create(
    url="https://competitor.com/pricing",
    interval_hours=24,
    webhook_url="https://myapp.com/changes"
)
print(f"Monitor active: {monitor.monitor_id}")

# List all monitors
monitors = client.monitor.list()
for m in monitors.monitors:
    print(f"{m.url} — next check: {m.next_check}")

# Get change history
details = client.monitor.get(monitor.monitor_id)
for check in details.checks:
    if check.changed:
        print(f"Changed at {check.checked_at}: {check.diff_summary}")

# Delete a monitor
client.monitor.delete(monitor.monitor_id)
```

### Twitter / X

```python
tweet = client.twitter.tweet("https://x.com/user/status/123456789")
print(tweet.content)

thread = client.twitter.thread("https://x.com/user/status/123456789")
for tweet in thread.tweets:
    print(tweet.text, tweet.likes)
```

### Reddit

```python
post = client.reddit.post("https://reddit.com/r/Python/comments/abc123/my_post/")
print(post.title, post.score)
print(post.content)
```

### Web Search

```python
results = client.search("python RAG tutorial", count=5)
for item in results.results:
    print(f"{item.title}: {item.url}")
```

### Batch Extraction

```python
batch = client.batch([
    "https://example.com",
    "https://youtube.com/watch?v=dQw4w9WgXcQ",
])
print(f"{batch.summary.succeeded}/{batch.summary.total} succeeded")
```

### Async Usage

```python
import asyncio
from contentapi import ContentAPI

async def main():
    async with ContentAPI(api_key="sk_live_...", async_mode=True) as client:
        # Parallel requests
        web, yt = await asyncio.gather(
            client.web.extract("https://example.com"),
            client.youtube.transcript("https://youtube.com/watch?v=dQw4w9WgXcQ"),
        )
        print(web.title, yt.full_text[:100])

asyncio.run(main())
```

## Error Handling

```python
from contentapi import (
    ContentAPI,
    ContentAPIError,
    AuthenticationError,
    RateLimitError,
    QuotaExceededError,
    ExtractionError,
)

try:
    result = client.web.extract("https://example.com")
except AuthenticationError:
    print("Invalid API key!")
except RateLimitError as e:
    print(f"Rate limited! Retry after {e.retry_after}s")
except QuotaExceededError:
    print("Out of credits!")
except ExtractionError as e:
    print(f"Extraction failed: {e.message}")
```

The SDK automatically retries on 429 and 503 errors with exponential backoff.

## Configuration

```python
client = ContentAPI(
    api_key="sk_live_...",               # Required
    base_url="https://api.example.com",  # Custom base URL
    timeout=60.0,                        # Request timeout (seconds)
    max_retries=3,                       # Max retry attempts
)
```

## Also Available

- **TypeScript SDK** — `npm install contentapi`
- **MCP Server** — `npx @contentapi/mcp-server` (for Claude, Cursor, Windsurf)
- **LangChain** — `pip install langchain-contentapi`
- **LlamaIndex** — `pip install llamaindex-contentapi`

## Requirements

- Python ≥ 3.9
- `httpx` ≥ 0.25
- `pydantic` ≥ 2.0

## License

MIT — see [LICENSE](LICENSE).
