Metadata-Version: 2.4
Name: speechflow
Version: 0.4.0
Summary: Async-first TTS (Text-to-Speech) wrapper library for Python
Author-email: minamik <mia@sync.dev>
License-Expression: MIT
License-File: LICENSE
Keywords: fishaudio,gemini,kokoro,openai,speech-synthesis,style-bert-vits2,text-to-speech,tts
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: numpy>=1.26.4
Provides-Extra: all
Requires-Dist: elevenlabs>=2.0.0; extra == 'all'
Requires-Dist: en-core-web-sm<4.0.0,>=3.8.0; extra == 'all'
Requires-Dist: fish-audio-sdk>=1.2.0; extra == 'all'
Requires-Dist: google-genai>=1.18.0; extra == 'all'
Requires-Dist: kokoro>=0.9.4; extra == 'all'
Requires-Dist: misaki[ja]>=0.9.4; extra == 'all'
Requires-Dist: openai>=1.84.0; extra == 'all'
Requires-Dist: qwen-tts>=0.1.1; extra == 'all'
Requires-Dist: setuptools<82; extra == 'all'
Requires-Dist: sounddevice>=0.5.0; extra == 'all'
Requires-Dist: style-bert-vits2>=2.5.0; extra == 'all'
Requires-Dist: torch; extra == 'all'
Provides-Extra: elevenlabs
Requires-Dist: elevenlabs>=2.0.0; extra == 'elevenlabs'
Provides-Extra: fishaudio
Requires-Dist: fish-audio-sdk>=1.2.0; extra == 'fishaudio'
Provides-Extra: gemini
Requires-Dist: google-genai>=1.18.0; extra == 'gemini'
Provides-Extra: kokoro
Requires-Dist: en-core-web-sm<4.0.0,>=3.8.0; extra == 'kokoro'
Requires-Dist: kokoro>=0.9.4; extra == 'kokoro'
Requires-Dist: misaki[ja]>=0.9.4; extra == 'kokoro'
Requires-Dist: torch; extra == 'kokoro'
Provides-Extra: openai
Requires-Dist: openai>=1.84.0; extra == 'openai'
Provides-Extra: player
Requires-Dist: sounddevice>=0.5.0; extra == 'player'
Provides-Extra: qwen3tts
Requires-Dist: qwen-tts>=0.1.1; extra == 'qwen3tts'
Provides-Extra: stylebert
Requires-Dist: setuptools<82; extra == 'stylebert'
Requires-Dist: style-bert-vits2>=2.5.0; extra == 'stylebert'
Requires-Dist: torch; extra == 'stylebert'
Description-Content-Type: text/markdown

# SpeechFlow

A unified async-first Python TTS (Text-to-Speech) library with multiple engine support.

## Features

- **Multiple TTS Engines**: OpenAI, Google Gemini, FishAudio, ElevenLabs, Kokoro (local), Qwen3-TTS (local), Style-Bert-VITS2 (local)
- **Async-First Design**: Native async/await API with sync wrappers for convenience
- **Streaming Support**: Real-time audio streaming for supported engines
- **Decoupled Architecture**: Engines, player, and writer are independent components
- **Optional Dependencies**: Core requires only numpy; each engine is installable as an extra

## Installation

```bash
# Core only (no engines)
uv add speechflow

# Install with specific engine
uv add "speechflow[openai]"

# Install with audio playback
uv add "speechflow[openai,player]"

# Install everything
uv add "speechflow[all]"
```

### Available Extras

| Extra | Engine | Type |
|-------|--------|------|
| `openai` | OpenAI TTS | Cloud |
| `gemini` | Google Gemini TTS | Cloud |
| `fishaudio` | FishAudio TTS | Cloud |
| `elevenlabs` | ElevenLabs TTS | Cloud |
| `kokoro` | Kokoro TTS (includes PyTorch) | Local |
| `qwen3tts` | Qwen3-TTS (includes PyTorch) | Local |
| `stylebert` | Style-Bert-VITS2 (includes PyTorch) | Local |
| `player` | Audio playback via sounddevice | Utility |
| `all` | All of the above | - |

<details>
<summary>Using pip instead of uv</summary>

```bash
pip install "speechflow[openai]"
pip install "speechflow[openai,player]"
pip install "speechflow[all]"
```
</details>

### GPU Support (Kokoro / Qwen3-TTS / Style-Bert-VITS2)

Local engines pull PyTorch as a dependency. By default, CPU-only PyTorch is installed. For GPU acceleration, install PyTorch with CUDA **before** installing speechflow:

```bash
# uv
uv add torch torchvision torchaudio --index https://download.pytorch.org/whl/cu121
uv add "speechflow[kokoro]"

# pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install "speechflow[kokoro]"
```

Replace `cu121` with your CUDA version (e.g., `cu118`, `cu124`).

## Quick Start

### Async (Primary API)

```python
import asyncio
from speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter

async def main():
    engine = OpenAITTSEngine(api_key="your-api-key")
    player = AudioPlayer()
    writer = AudioWriter()

    # Generate audio
    audio = await engine.get("Hello, world!")

    # Play audio
    await player.play(audio)

    # Save to file
    await writer.save(audio, "output.wav")

asyncio.run(main())
```

### Sync Wrappers

```python
from speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter

engine = OpenAITTSEngine(api_key="your-api-key")
player = AudioPlayer()
writer = AudioWriter()

audio = engine.get_sync("Hello, world!")
player.play_sync(audio)
writer.save_sync(audio, "output.wav")
```

### Streaming

```python
import asyncio
from speechflow import OpenAITTSEngine, AudioPlayer

async def main():
    engine = OpenAITTSEngine(api_key="your-api-key")
    player = AudioPlayer()

    # Stream and play (returns combined AudioData)
    combined = await player.play_stream(engine.stream("This is a long text that will be streamed..."))

asyncio.run(main())
```

Streaming notes:
- **OpenAI**: True streaming with multiple chunks.
- **Gemini**: Returns complete audio in a single chunk (API limitation).
- **FishAudio**: True streaming.
- **ElevenLabs**: True streaming.
- **Kokoro / Style-Bert-VITS2 / Qwen3-TTS**: Sentence-by-sentence streaming.

## Engine-Specific Features

### OpenAI TTS

```python
engine = OpenAITTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello",
    voice="alloy",           # ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
    model="gpt-4o-mini-tts", # tts-1, tts-1-hd
    speed=1.0,
    instructions="Speak in a cheerful tone",
)

# Streaming
async for chunk in engine.stream("Long text..."):
    pass
```

### Google Gemini TTS

```python
engine = GeminiTTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello",
    model="gemini-2.5-flash-preview-tts",  # gemini-2.5-pro-preview-tts
    voice="Leda",                           # Puck, Charon, Kore, Fenrir, Aoede, ...
)
```

### FishAudio TTS

```python
engine = FishAudioTTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello world",
    model="s1",                  # s1-mini, speech-1.6, speech-1.5, agent-x0
    voice="your-voice-id",
    speed=1.0,                   # Speech speed
    volume=1.0,                  # Volume
)

# Streaming
async for chunk in engine.stream("Streaming text..."):
    pass
```

### ElevenLabs TTS

```python
engine = ElevenLabsTTSEngine(api_key="your-api-key")
audio = await engine.get(
    "Hello",
    voice="21m00Tcm4TlvDq8ikWAM",  # Voice ID from ElevenLabs dashboard
    model="eleven_multilingual_v2", # eleven_turbo_v2_5, eleven_turbo_v2, eleven_monolingual_v1
    output_format="pcm_24000",      # pcm_16000, pcm_22050, pcm_44100
    stability=0.5,
    similarity_boost=0.75,
    speed=1.0,
)

# Streaming
async for chunk in engine.stream("Streaming text..."):
    pass
```

### Qwen3-TTS

```python
# CustomVoice model (default) — choose from built-in speakers
engine = Qwen3TTSEngine()
audio = await engine.get(
    "Hello, world!",
    speaker="Chelsie",  # Ethan, Chelsie, etc.
    language="en",
)

# Base model — voice cloning with reference audio
engine = Qwen3TTSEngine(model_id="Qwen/Qwen3-TTS-0.6B-Base")
engine.set_voice_profile(ref_audio=audio_bytes, ref_text="transcript")
audio = await engine.get("Clone this voice", language="en")

# Sentence-by-sentence streaming
async for chunk in engine.stream("Long text for streaming...", language="ja"):
    pass
```

Supported languages: Chinese (`zh`), English (`en`), Japanese (`ja`), Korean (`ko`), and more.

### Kokoro TTS

```python
# Default: American English
engine = KokoroTTSEngine()
audio = await engine.get(
    "Hello world",
    voice="af_heart",
    speed=1.0,
)

# Japanese (dictionary auto-downloads on first use)
engine = KokoroTTSEngine(lang_code="j")
audio = await engine.get("こんにちは、世界", voice="af_heart")
```

If Japanese dictionary download fails, run manually: `python -m unidic download`

Supported languages: American English (`a`), British English (`b`), Spanish (`e`), French (`f`), Hindi (`h`), Italian (`i`), Japanese (`j`), Brazilian Portuguese (`p`), Mandarin Chinese (`z`)

### Style-Bert-VITS2

```python
# Pre-trained model (auto-downloads on first use)
engine = StyleBertTTSEngine(model_name="jvnv-F1-jp")
audio = await engine.get(
    "こんにちは、世界",
    style="Happy",       # Neutral, Happy, Sad, Angry, Fear, Surprise, Disgust
    style_weight=5.0,    # Emotion strength (0.0-10.0)
    speed=1.0,
    pitch=0.0,           # Pitch shift in semitones
    speaker_id=0,
)

# Custom model
engine = StyleBertTTSEngine(model_path="/path/to/your/model")

# Sentence-by-sentence streaming
async for chunk in engine.stream("長い文章を文ごとに生成します。"):
    pass
```

Pre-trained models: `jvnv-F1-jp`, `jvnv-F2-jp` (female), `jvnv-M1-jp`, `jvnv-M2-jp` (male)

Optimized for Japanese. GPU recommended for best performance.

## License

MIT
