Metadata-Version: 2.1
Name: pulse-engine
Version: 0.2.0.dev20260502150119
Summary: Pulse Engine — Hybrid framework for building Pulse products
Author: Pulse Team
Requires-Python: >=3.11,<3.13
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: alembic (>=1.14.0,<2.0.0)
Requires-Dist: asyncpg (>=0.30.0,<0.31.0)
Requires-Dist: beautifulsoup4 (>=4.12.0,<5.0.0)
Requires-Dist: boto3 (>=1.35.0,<2.0.0)
Requires-Dist: celery[redis] (>=5.4.0,<6.0.0)
Requires-Dist: cookiecutter (>=2.6.0,<3.0.0)
Requires-Dist: curl-cffi (>=0.15.0,<0.16.0)
Requires-Dist: email-validator (>=2.3.0,<3.0.0)
Requires-Dist: fastapi (>=0.115.0,<0.116.0)
Requires-Dist: httpx (>=0.28.0,<0.29.0)
Requires-Dist: langchain (>=1.2.13,<2.0.0)
Requires-Dist: langchain-anthropic (>=1.4.0,<2.0.0)
Requires-Dist: langchain-openai (>=1.1.11,<2.0.0)
Requires-Dist: langdetect (>=1.0.9,<2.0.0)
Requires-Dist: mcp[cli] (>=1.0.0,<2.0.0)
Requires-Dist: opensearch-py[async] (>=3.1.0,<4.0.0)
Requires-Dist: pydantic-settings (>=2.7.0,<3.0.0)
Requires-Dist: pymupdf (>=1.27.2.3,<2.0.0.0)
Requires-Dist: python-jose[cryptography] (>=3.3.0,<4.0.0)
Requires-Dist: redis (>=5.0.0,<6.0.0)
Requires-Dist: sqlalchemy[asyncio] (>=2.0,<3.0)
Requires-Dist: structlog (>=24.4.0,<25.0.0)
Requires-Dist: tiktoken (>=0.8.0,<0.9.0)
Requires-Dist: twikit (>=2.3.3,<3.0.0)
Requires-Dist: typer (>=0.15.0,<0.16.0)
Requires-Dist: uvicorn[standard] (>=0.34.0,<0.35.0)
Requires-Dist: yt-dlp (>=2026.3.17,<2027.0.0)
Description-Content-Type: text/markdown

# Pulse Engine

Hybrid Python framework for building multi-tenant data products. Products `pip install pulse-engine`, declare a manifest, and get a full FastAPI app with OpenSearch, Athena, Celery, Prefect, and MCP — out of the box.

## How It Works

```
┌──────────────────────────────────────────────────┐
│  Product (pip install pulse-engine)           │
│  manifest = ProductManifest(                     │
│    extractors=[...], preprocessor=..., ...       │
│  )                                               │
└──────────────┬───────────────────────────────────┘
               │
┌──────────────▼───────────────────────────────────┐
│  pulse-engine                               │
│  Base ABCs · Default implementations · App       │
│  factory · Storage connectors · Job lifecycle    │
│  · CLI · Testing utilities                       │
└──────────────┬───────────────────────────────────┘
               │
┌──────────────▼───────────────────────────────────┐
│  Shared Infrastructure                           │
│  Prefect · OpenSearch · Redis · PostgreSQL        │
└──────────────────────────────────────────────────┘
```

Products customize behaviour by:
- Implementing `BaseExtractor` subclasses for data extraction
- Overriding pipeline stages (preprocessor, core, postprocessor) or using defaults
- Adding product-specific API routes, MCP tools, and Celery tasks
- Declaring everything in a `ProductManifest`

## Prerequisites

- Python 3.11-3.12
- [Poetry](https://python-poetry.org/docs/#installation)
- Docker & Docker Compose (for shared infrastructure)

## Quick Start

### For Engine Development

```bash
git clone <repo-url> && cd pulse-engine

# Install deps and pre-commit hooks
make install

# Copy env and configure
cp .env.example .env

# Run database migrations
make migrate

# Start the dev server
make run
```

### For Building a New Product

```bash
# Install the engine
pip install pulse-engine

# Scaffold a new product
pulse init my-product
cd pulse-my-product

# Set up the product
make install
cp .env.example .env

# Validate and test
make validate
make test

# Run
make run
```

See **[docs/building-a-product.md](docs/building-a-product.md)** for the full guide, including how to add time-based filters (e.g. "last 2 days", "last 5 years") via the `config` dict in the trigger payload.

## CLI

The `pulse` command is the primary interface:

| Command | Description |
|---------|-------------|
| `pulse init <name>` | Scaffold a new product from template |
| `pulse validate [module]` | Validate a product manifest |
| `pulse run` | Discover manifest and start FastAPI server |
| `pulse run-worker` | Discover manifest and start Celery worker |
| `pulse run-mcp` | Discover manifest and start MCP server |

## Product Manifest

Products declare their components via a `ProductManifest`:

```python
from pulse_engine.registry import ProductManifest

manifest = ProductManifest(
    name="my-product",
    version="0.1.0"
)
```

Products register via a `pyproject.toml` entry point:

```toml
[tool.poetry.plugins."pulse_engine.products"]
my_product = "pulse_my_product:manifest"
```

## Project Structure

```
src/pulse_engine/
├── main.py                  # App factory (create_app)
├── config.py                # Settings (pydantic-settings)
├── registry.py              # ProductManifest, validation, discovery
├── worker.py                # Celery app factory
├── database.py              # SQLAlchemy async setup
├── dependencies.py          # FastAPI dependency injection
├── client.py                # PulseEngineClient (container → engine HTTP)
├── s3.py                    # S3Stage (NDJSON inter-stage data exchange)
├── chain_recovery.py        # Background task for stalled pipeline recovery
├── cli/
│   ├── main.py              # pulse CLI (typer)
│   └── templates/           # Cookiecutter product template
├── api/v1/
│   ├── router.py            # v1 router aggregation
│   └── health.py            # Health check endpoint
├── core/
│   ├── security.py          # JWT verification (Cognito + Job-scoped)
│   ├── job_token.py         # Job-scoped JWT issuance & verification
│   ├── scope.py             # require_scope() FastAPI dependency
│   ├── exceptions.py        # Exception hierarchy
│   ├── error_handlers.py    # Global error handlers
│   └── logging.py           # Structured logging (structlog)
├── middleware/
│   ├── request_id.py        # X-Request-ID middleware
│   ├── security_headers.py  # Defensive HTTP security headers (CSP, HSTS, etc.)
│   ├── rate_limit.py        # Sliding-window per-IP rate limiter (100 req/60 s)
│   └── tenant.py            # Dual-token middleware (Cognito + Job JWT)
├── deployment/
│   ├── models.py            # DeploymentModel ORM
│   ├── repository.py        # DeploymentRepository
│   ├── service.py           # DeploymentService
│   ├── router.py            # POST /api/v1/deployments
│   └── schemas.py           # Registration request/response
├── extractor/
│   ├── base.py              # BaseExtractor ABC
│   ├── models.py            # SQLAlchemy ORM: job_records
│   ├── stage_models.py      # SQLAlchemy ORM: job_stages
│   ├── repository.py        # JobRepository
│   ├── stage_repository.py  # StageRepository
│   ├── service.py           # JobService (stage-aware)
│   ├── router.py            # /api/v1/jobs/ endpoints
│   ├── schemas.py           # Pydantic models
│   └── orchestrator/
│       ├── base.py          # BaseOrchestratorAdapter ABC
│       ├── prefect.py       # PrefectAdapter (deployments + flow runs)
│       └── noop.py          # NoopAdapter
├── processor/
│   ├── base.py              # BasePreprocessor, BaseCoreProcessor,
│   │                        # BasePostprocessor ABCs
│   ├── pipeline.py          # Pluggable ProcessingPipeline
│   ├── router.py            # /api/v1/process/ endpoints
│   ├── schemas.py           # ProcessingContext, options
│   ├── defaults/            # Default stage implementations
│   ├── preprocessor/        # clean_html, normalize, detect_language
│   ├── core/                # chunking, NER, sentiment, topics
│   └── postprocessor/       # embeddings, dedup, quality scoring
├── storage/
│   ├── knowledge_base.py    # KnowledgeBaseService
│   ├── router.py            # /api/v1/kb/ endpoints
│   ├── schemas.py           # Document, SearchQuery, etc.
│   └── connectors/
│       ├── base.py          # BaseStorageConnector ABC
│       ├── opensearch.py    # OpenSearch connector
│       └── athena.py        # Athena connector
├── mcp/
│   ├── server.py            # FastMCP instance
│   ├── tools_kb.py          # KB tools (6)
│   ├── tools_jobs.py        # Jobs tools (6)
│   ├── tools_processor.py   # Processor tools (5)
│   ├── tools_pipelines.py   # Pipeline tools (5)
│   └── tools_modules.py     # Module registry tools (3)
├── services/
│   ├── bootstrap.py         # ServiceContainer, bootstrap_services()
│   └── opensearch.py        # OpenSearch client wrapper
└── testing/
    ├── fixtures.py          # Reusable pytest fixtures
    └── mocks.py             # MockStorageConnector, MockExtractor, etc.

infra/
├── docker-compose.yml       # Prefect, Redis, OpenSearch, PostgreSQL
└── terraform/               # AWS modules (networking, ECS, ECR, ALB)

tests/
├── unit/                    # Unit tests
│   ├── framework/           # Manifest, pipeline, base class tests
│   ├── processor/
│   ├── storage/
│   ├── deployment/          # Deployment registration tests
│   └── extractor/
└── integration/             # Integration tests
    ├── api/
    ├── mcp/
    └── pipelines/
```

## Code Quality

Pre-commit hooks run on every commit:

| Hook | What it checks |
|------|----------------|
| `trailing-whitespace` | No trailing whitespace |
| `end-of-file-fixer` | Files end with a newline |
| `check-yaml` | Valid YAML syntax |
| `ruff` | Linting (auto-fix enabled) |
| `ruff-format` | Code formatting |
| `mypy` | Strict static type-checking |

```bash
make lint    # run all hooks manually
```

## CI/CD

### PR Checks (`pr-checks.yml`)

Runs on every pull request to `dev`, `uat`, `prod`:
- **lint** — ruff check + format
- **typecheck** — mypy strict
- **test** — unit tests with coverage
- **trivy** — vulnerability scan

### Deploy (`deploy.yml`)

Runs on push to `dev`, `uat`, `prod`:

| Branch | PyPI Target | Infrastructure |
|--------|-------------|----------------|
| `dev` | PyPI (`.dev` suffix) | VM via docker-compose |
| `staging` | PyPI (`.dev` suffix) | VM via docker-compose |
| `prod` | PyPI (stable) | ECS cluster |

The pipeline: test → publish to PyPI → build Docker → push ECR → deploy → health check.

Terraform runs separately on `infra/terraform/**` changes: plan → apply → sync outputs to GitHub Secrets (pipeline infra vars auto-flow to `.env` on next deploy).

### Required Secrets (per GitHub environment)

| Secret | Description |
|--------|-------------|
| `AWS_ROLE_ARN` | OIDC role for GitHub → AWS auth |
| `ECR_REPOSITORY_URL` | ECR repository URL |
| `PYPI_TOKEN` | PyPI API token |

## MCP Server

Exposes 25 tools for AI agents via the Model Context Protocol:

| Category | Tools |
|---|---|
| **Jobs** (6) | `jobs_register`, `jobs_get`, `jobs_list`, `jobs_push_status`, `jobs_cancel`, `jobs_delete` |
| **Knowledge Base** (6) | `kb_store_documents`, `kb_retrieve_document`, `kb_search`, `kb_delete_document`, `kb_get_stats`, `kb_run_query` |
| **Processor** (5) | `process_pipeline`, `process_preprocess`, `process_analyze`, `process_postprocess`, `process_chunk` |
| **Pipelines** (5) | `pipelines_trigger`, `pipelines_status`, `pipelines_list`, `pipelines_cancel`, `pipelines_steps` |
| **Modules** (3) | `modules_register`, `modules_list`, `modules_delete` |

```bash
pulse run-mcp
```

Products register additional tools via `mcp_tool_modules` in the manifest.

## Environment Variables

### Core

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `APP_ENV` | No | `development` | Environment name (`development`, `production`, etc.) |
| `APP_VERSION` | No | `0.1.0` | Application version |
| `LOG_LEVEL` | No | `INFO` | Logging level |
| `AWS_REGION` | Yes | `ap-south-1` | AWS region |
| `AWS_ACCESS_KEY_ID` | No | — | AWS credentials (use IAM role in production) |
| `AWS_SECRET_ACCESS_KEY` | No | — | AWS credentials (use IAM role in production) |

### Authentication (Cognito)

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `COGNITO_USER_POOL_ID` | Yes | — | Cognito User Pool ID |
| `COGNITO_APP_CLIENT_ID` | Yes | — | Cognito App Client ID |
| `COGNITO_APP_CLIENT_SECRET` | No | — | Client secret (required if app client has one) |

### OpenSearch

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `OPENSEARCH_URL` | Yes | — | OpenSearch endpoint |
| `OPENSEARCH_USERNAME` | No | — | Basic-auth username (AWS managed domains) |
| `OPENSEARCH_PASSWORD` | No | — | Basic-auth password |
| `OPENSEARCH_USE_SSL` | No | `true` | Enable TLS |
| `OPENSEARCH_VERIFY_CERTS` | No | `true` | Verify TLS certificates |
| `OPENSEARCH_INDEX_PREFIX` | No | `pulse_kb` | Index name prefix per tenant |
| `EMBEDDING_DIMENSION` | No | `1536` | Vector dimension for kNN indexes |

### Database & Cache

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `DATABASE_URL` | Yes | — | Async PostgreSQL DSN (`postgresql+asyncpg://...`) |
| `REDIS_URL` | No | `redis://localhost:6379/0` | Redis URL (enables Celery) |
| `CELERY_BROKER_URL` | No | — | Celery broker (defaults to `REDIS_URL`) |
| `CELERY_RESULT_BACKEND` | No | — | Celery result backend (defaults to `REDIS_URL`) |

### Athena

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `ATHENA_AWS_ACCESS_KEY_ID` | No | — | Athena-specific AWS credentials |
| `ATHENA_AWS_SECRET_ACCESS_KEY` | No | — | Athena-specific AWS credentials |
| `ATHENA_OUTPUT_LOCATION` | Yes* | — | S3 URI for Athena query results |
| `ATHENA_WORKGROUP` | No | `primary` | Athena workgroup |
| `ATHENA_QUERY_TIMEOUT_SECONDS` | No | `60` | Athena query timeout |

### Orchestrator (Prefect)

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `PULSE_ORCHESTRATOR_BACKEND` | No | `none` | `prefect` or `none` |
| `PREFECT_API_URL` | No | — | Prefect API endpoint |
| `PREFECT_API_KEY` | No | — | Prefect Cloud API key |
| `PREFECT_ECS_WORK_POOL_NAME` | No | `products-worker-pool` | ECS work pool name |
| `PREFECT_LAMBDA_WORK_POOL_NAME` | No | `lambda-worker-pool` | Lambda work pool name |
| `PREFECT_LAMBDA_FUNCTION_NAME_TEMPLATE` | No | `{product}-{stage}` | Lambda function name pattern |
| `PREFECT_K8S_WORK_POOL_NAME` | No | `k8s-worker-pool` | Kubernetes work pool name |
| `PREFECT_K8S_NAMESPACE` | No | `pulse-jobs` | Kubernetes namespace |
| `PREFECT_K8S_DEFAULT_CPU` | No | `500m` | Default CPU request |
| `PREFECT_K8S_DEFAULT_MEMORY` | No | `1Gi` | Default memory request |

### LLM & Embeddings

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `PULSE_LLM_PROVIDER` | No | `openai` | LLM provider |
| `PULSE_LLM_MODEL` | No | `gpt-4o-mini` | LLM model ID |
| `PULSE_LLM_API_KEY` | No | — | LLM API key (also used as embedding fallback) |
| `PULSE_LLM_TEMPERATURE` | No | `0.0` | LLM sampling temperature |
| `PULSE_EMBEDDING_PROVIDER` | No | `openai` | Embedding provider |
| `PULSE_OPENAI_EMBEDDING_MODEL` | No | `text-embedding-3-small` | OpenAI embedding model |
| `PULSE_OPENAI_API_KEY` | No | — | OpenAI API key (overrides `PULSE_LLM_API_KEY`) |

### Pipeline & Jobs

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `PULSE_ENGINE_URL` | No | — | Public URL containers use for callbacks |
| `PULSE_JOB_TOKEN_SECRET` | No | — | HMAC secret for job-scoped JWTs |
| `PULSE_S3_BUCKET` | No | — | S3 bucket for inter-stage NDJSON data |
| `PULSE_CHAIN_GRACE_PERIOD_SECONDS` | No | `300` | Seconds before chain recovery auto-triggers |
| `PULSE_MAX_CONCURRENT_JOBS_PER_TENANT` | No | `10` | Max concurrent jobs per tenant |
| `PULSE_DEFAULT_CHUNK_SIZE` | No | `512` | Default chunk token size |
| `PULSE_DEFAULT_CHUNK_STRATEGY` | No | `token_count` | Default chunking strategy |
| `PULSE_DEDUP_SIMILARITY_THRESHOLD` | No | `0.95` | Cosine similarity dedup threshold |

### Pipeline Infrastructure (from Terraform)

These are auto-synced from Terraform outputs to GitHub Secrets, then written to `.env` at deploy time:

| Variable | Description |
|----------|-------------|
| `PIPELINE_TASK_DEFINITION` | ECS task definition family for pipeline steps |
| `PIPELINE_CLUSTER_NAME` | ECS cluster for pipeline step tasks |
| `PIPELINE_EXECUTION_ROLE_ARN` | ECS task execution role (ECR pull, logs, secrets) |
| `PIPELINE_TASK_ROLE_ARN` | ECS task role (S3, Lambda invoke, ECS dispatch) |
| `PIPELINE_LOG_GROUP` | CloudWatch log group for ECS pipeline steps |
| `PIPELINE_SUBNETS` | Comma-separated private subnet IDs |
| `PIPELINE_SECURITY_GROUPS` | Comma-separated security group IDs |
| `LAMBDA_EXECUTION_ROLE_ARN` | Lambda execution role for pipeline functions |
| `LAMBDA_SUBNETS` | Comma-separated subnet IDs for Lambda VPC config |
| `LAMBDA_SECURITY_GROUPS` | Comma-separated security group IDs for Lambda |
| `LAMBDA_LOG_GROUP` | CloudWatch log group for Lambda pipeline steps |

### Data Source Adapters

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `YT_DLP_COOKIES_SECRET_ID` | No | — | Secrets Manager secret ID for YouTube cookies (Netscape format). Required for age-restricted or member-only videos. |
| `YT_DLP_PLAYER_CLIENTS` | No | `tv_embedded,web` | Comma-separated yt-dlp player client override. |
| `TWITTER_COOKIES_SECRET_ID` | No | — | Secrets Manager secret ID for Twitter cookies (JSON format). Takes precedence over `TWITTER_COOKIES_PATH`. |
| `TWITTER_COOKIES_PATH` | No | — | Local filesystem path to Twitter cookies JSON file. Used when Secrets Manager is not configured. |
| `OPENAI_API_KEY` | No | — | OpenAI API key for Whisper transcription in `YouTubeAudioAdapter`. |

### MCP Server

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `MCP_TRANSPORT` | No | `sse` | Transport mode: `sse` or `stdio` |
| `MCP_SSE_HOST` | No | `127.0.0.1` | MCP SSE server host |
| `MCP_SSE_PORT` | No | `8001` | MCP SSE server port |

## API Authentication

All endpoints (except `/api/v1/health` and `/api/v1/auth/login`) require a JWT token.

> **Rate limits** — enforced per IP address:
> - **Login** (`POST /api/v1/auth/login`): 5 attempts per 60 seconds. Returns `429` with `Retry-After: 60` on breach.
> - **Global**: 100 requests per 60 seconds across all endpoints. Responses include `X-RateLimit-Limit` and `X-RateLimit-Remaining` headers.
>
> **Note** — Swagger UI (`/docs`) and ReDoc (`/redoc`) are disabled in production (`APP_ENV=production`).

### Get a Token

```bash
curl -X POST https://api.dev.pulse.mananalabs.ai/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email": "$PULSE_AUTH_EMAIL", "password": "$PULSE_AUTH_PASSWORD"}'
```

Response:

```json
{
  "id_token": "eyJ...",
  "access_token": "eyJ...",
  "refresh_token": "eyJ...",
  "expires_in": 3600,
  "token_type": "Bearer",
  "tenant_id": "tenant-dev-001",
  "email": "dev@pulse-engine.com"
}
```

### Use the Token

Pass the `id_token` as a Bearer token in the `Authorization` header:

```bash
curl -H "Authorization: Bearer <id_token>" \
  https://api.dev.pulse.mananalabs.ai/api/v1/kb/stats
```

Tokens expire after **1 hour**. Call the login endpoint again to get a new one.

## API Documentation

- Swagger UI: https://api.dev.pulse.mananalabs.ai/docs
- ReDoc: https://api.dev.pulse.mananalabs.ai/redoc

## Library Usage

Pulse Engine can also be used as a standalone library for content processing:

```python
from pulse_engine.processor.core.topic_splitter import TopicSplitter

splitter = TopicSplitter(provider="openai", api_key="sk-...")
result = splitter.split([
    (1, "Hi there"),
    (2, "Let's discuss Q1 metrics"),
    (3, "Revenue grew 20%"),
])
print(result.topic_shifts)
```

See **[docs/pulse_engine_library.md](docs/pulse_engine_library.md)** for full documentation on the Topic Splitter, LLM configuration, and configurable embeddings.

## Data Source Adapters

`pulse_engine.adapters` provides standalone data-fetching utilities. Each adapter is an independently importable class with no dependency on the Pulse pipeline — use them in any Python context.

All adapter data models are plain `@dataclass` instances defined in `pulse_engine.adapters.models`. Credentials are never passed as constructor parameters (except where noted); they are read from environment variables.

### YouTube Metadata

Fetches video metadata from a YouTube channel via the YouTube Data API v3.

```python
import asyncio
from pulse_engine.adapters.youtube_metadata import YouTubeMetadataAdapter

adapter = YouTubeMetadataAdapter(channel_name="@BBCNews", api_key="YOUR_YT_API_KEY")
videos = asyncio.run(adapter.fetch(max_results=50))
for v in videos:
    print(v.title, v.published_at, v.url)
```

Returns `list[VideoMetadata]` with fields: `video_id`, `title`, `description`, `published_at`, `channel_id`, `channel_name`, `thumbnail_url`, `duration`, `view_count`, `url`.

### YouTube Audio (Download + Transcription)

Downloads audio from a YouTube video and transcribes it with OpenAI Whisper.

```python
import asyncio
from pulse_engine.adapters.youtube_metadata import YouTubeMetadataAdapter
from pulse_engine.adapters.youtube_audio import YouTubeAudioAdapter

meta_adapter = YouTubeMetadataAdapter(channel_name="@BBCNews", api_key="YOUR_YT_API_KEY")
audio_adapter = YouTubeAudioAdapter(openai_api_key="sk-...")

videos = asyncio.run(meta_adapter.fetch(max_results=5))
audio = asyncio.run(audio_adapter.fetch(videos[0]))
print(audio.transcript)
```

Returns `VideoAudio` with fields: `video_id`, `title`, `transcript`, `segments` (`list[TranscriptSegment]`), `language`, `duration_seconds`.

Set `YT_DLP_COOKIES_SECRET_ID` (Secrets Manager) to enable age-restricted video download. See the **Secrets Manager cookie format** note below.

### Speeches (inc.in)

Scrapes speech metadata and extracts full text from PDFs published on inc.in.

```python
import asyncio
from pulse_engine.adapters.speech_metadata import SpeechMetadataAdapter
from pulse_engine.adapters.speech_content import SpeechContentAdapter

meta_adapter = SpeechMetadataAdapter(base_url="https://www.inc.in/en/media/speeches")
content_adapter = SpeechContentAdapter()

speeches = asyncio.run(meta_adapter.fetch(pages=3))
results = asyncio.run(content_adapter.fetch(speeches))
for r in results:
    print(r.metadata.title, r.content.page_count, "pages")
    print(r.content.text[:500])
```

`SpeechMetadata` fields: `title`, `speaker`, `date`, `pdf_url`, `source_url`.
`SpeechResult` fields: `metadata` (`SpeechMetadata`), `content` (`SpeechContent` with `text` and `page_count`).

### Twitter / X

Searches tweets or looks up user profiles via `twikit` (unofficial Twitter API client).

```python
import asyncio
from pulse_engine.adapters.twitter import TwitterAdapter

# Reads TWITTER_COOKIES_SECRET_ID or TWITTER_COOKIES_PATH from env
adapter = TwitterAdapter()

tweets = asyncio.run(adapter.fetch_tweets(query="Budget 2025", count=50))
for t in tweets:
    print(t.author, t.created_at, t.text[:80])

user_results = asyncio.run(adapter.fetch_user(username="FinanceMinIndia"))
```

`TweetMetadata` fields: `tweet_id`, `text`, `author`, `created_at`, `like_count`, `retweet_count`, `reply_count`, `url`.
`TwitterUserResult` fields: `user_id`, `username`, `display_name`, `description`, `followers_count`, `following_count`, `tweet_count`, `verified`, `url`.

### Secrets Manager Cookie Format

Cookie secrets stored in AWS Secrets Manager must use the **wrapper JSON** format:

```json
{"YT_DLP_COOKIES": "<full Netscape cookie file content>"}
```

```json
{"TWITTER_COOKIES": "{\"auth_token\": \"...\", \"ct0\": \"...\"}"}
```

The adapter fetches the secret, JSON-parses the wrapper, and extracts the inner value automatically. The key name must match the convention used when the secret was stored.

### Example Runner Scripts

Ready-to-run CLI scripts are provided under `examples/`:

| Script | What it does |
|--------|-------------|
| `examples/fetch_youtube.py` | Fetch channel videos + transcribe first result |
| `examples/fetch_speeches.py` | Scrape speech listing + extract PDF text |
| `examples/fetch_tweets.py` | Search tweets or look up a Twitter user |

```bash
# YouTube
YT_API_KEY=... OPENAI_API_KEY=sk-... python examples/fetch_youtube.py --channel @BBCNews

# Speeches
python examples/fetch_speeches.py --pages 2

# Tweets
TWITTER_COOKIES_PATH=~/.twitter_cookies.json python examples/fetch_tweets.py --query "Budget 2025"
```

## Further Reading

- [Building a Product](docs/building-a-product.md) — step-by-step guide to creating a new product
- [Design Decisions](docs/design-decisions.md) — architectural decisions and rationale
- [Infrastructure](docs/infrastructure.md) — AWS deployment architecture
- [Library Usage](docs/pulse_engine_library.md) — topic splitting, LLM config, and embeddings

