Metadata-Version: 2.4
Name: whisper-input
Version: 0.8.0
Summary: 跨平台语音输入工具 —— 按住快捷键说话,松开自动输入(Qwen3-ASR ONNX 本地推理,中英日韩粤 + 技术术语)
Project-URL: Homepage, https://github.com/pkulijing/whisper-input
Project-URL: Repository, https://github.com/pkulijing/whisper-input
Project-URL: Issues, https://github.com/pkulijing/whisper-input/issues
Author-email: pkuyplijing@gmail.com
License: MIT
Keywords: asr,modelscope,qwen3-asr,speech-recognition,stt,voice-input
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: MacOS X
Classifier: Environment :: X11 Applications
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Chinese (Simplified)
Classifier: Natural Language :: English
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Requires-Python: <3.13,>=3.12.13
Requires-Dist: evdev>=1.7.0; sys_platform == 'linux'
Requires-Dist: modelscope>=1.35.4
Requires-Dist: numpy>=1.24.0
Requires-Dist: onnxruntime>=1.24.4
Requires-Dist: packaging>=23.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pygobject>=3.56.2; sys_platform == 'linux'
Requires-Dist: pynput>=1.7.6; sys_platform == 'darwin'
Requires-Dist: pyobjc-framework-cocoa>=10.0; sys_platform == 'darwin'
Requires-Dist: pystray>=0.19.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: sounddevice>=0.4.6
Requires-Dist: soundfile>=0.13.1
Requires-Dist: structlog>=24.0
Requires-Dist: tokenizers>=0.20
Description-Content-Type: text/markdown

**English** | [中文](README.zh-CN.md)

# Whisper Input

[![Build](https://github.com/pkulijing/whisper-input/actions/workflows/build.yml/badge.svg)](https://github.com/pkulijing/whisper-input/actions/workflows/build.yml)
[![codecov](https://codecov.io/gh/pkulijing/whisper-input/branch/master/graph/badge.svg)](https://codecov.io/gh/pkulijing/whisper-input)
[![PyPI](https://img.shields.io/pypi/v/whisper-input.svg)](https://pypi.org/project/whisper-input/)

Cross-platform voice input tool — hold a hotkey, speak, release to have speech transcribed and typed into the focused window.

Uses Alibaba Qwen team's [Qwen3-ASR](https://www.modelscope.cn/models/zengshuishui/Qwen3-ASR-onnx) as the STT engine — an encoder-decoder LLM-style ASR with strong multilingual coverage (Chinese, English, Japanese, Korean, Cantonese, and more), built-in punctuation, inverse text normalization, and casing. Direct inference via Microsoft `onnxruntime`, fully offline after first download. Two variants are available via the settings page: **0.6B** (default, ~990 MB, ~1.5s for a 10s utterance on Apple Silicon) and **1.7B** (~2.4 GB, highest accuracy).

Supports **Linux (X11)** and **macOS**.

## Features

- Local speech recognition, works offline
- Multi-language mixed input (Chinese, English, etc.)
- Configurable hotkey (distinguishes left/right modifier keys)
- Browser-based settings UI + system tray
- Auto-start on login
- Automatic platform detection with matching backend

## System Requirements

### Linux
- **Ubuntu 24.04+ / Debian 13+** (X11 desktop environment)
- Any x86_64 CPU (`onnxruntime` CPU inference, RTF ~ 0.1, latency < 1s for short utterances)

### macOS
- macOS 12+ (Monterey or later)
- Apple Silicon (recommended) or Intel Mac, both use CPU ONNX inference

## Installation

### One-liner (recommended)

On macOS or Linux:

```bash
curl -LsSf https://raw.githubusercontent.com/pkulijing/whisper-input/master/install.sh | sh
```

The script interactively picks a language (中文 / English), then installs `uv`, Python 3.12, required system libraries, and `whisper-input` itself. It runs `whisper-input --init` (pre-downloads the ~990 MB Qwen3-ASR 0.6B ONNX model; on macOS also installs `~/Applications/Whisper Input.app`) and finally asks whether to launch the app immediately. It's safe to re-run — already-installed pieces are skipped, and `uv tool install --upgrade` upgrades `whisper-input` to the latest version.

On Linux the script will offer to add the current user to the `input` group (requires `sudo`; takes effect after a logout/login cycle).

> **Note**: `curl | sh` trusts this repo. If you want to review the script first, download it with `curl -LsSf <URL> -o install.sh` and inspect it before running.

### Manual installation

#### macOS

```bash
# Install system dependency
brew install portaudio

# Install the tool (--compile-bytecode skips the first-run .pyc compile step)
uv tool install --compile-bytecode whisper-input

# One-time setup: install .app bundle + download STT model (~990 MB for Qwen3-ASR 0.6B)
whisper-input --init

# Run
whisper-input
```

**First-run permissions required in System Settings > Privacy & Security:**

1. **Accessibility** (for global hotkey listening and text input)
2. **Microphone** (for voice recording; the system will prompt on first recording)

> **Note**: On first run (or via `whisper-input --init`), the tool installs a minimal `.app` bundle at `~/Applications/Whisper Input.app`. macOS permission dialogs and System Settings entries will show "Whisper Input" — grant Accessibility to that entry. To fully uninstall, run `whisper-input --uninstall` before `uv tool uninstall whisper-input`.

#### Linux

```bash
# Install system dependencies (see table below for details)
sudo apt install xdotool xclip pulseaudio-utils libportaudio2 \
                 libgirepository-2.0-dev libcairo2-dev gir1.2-gtk-3.0 \
                 gir1.2-ayatanaappindicator3-0.1

# Add yourself to the input group (evdev needs /dev/input/* access)
sudo usermod -aG input $USER && newgrp input

# Install the tool (--compile-bytecode skips the first-run .pyc compile step)
uv tool install --compile-bytecode whisper-input

# One-time setup: download STT model (~990 MB for Qwen3-ASR 0.6B)
whisper-input --init

# Run
whisper-input
```

**System dependency reference:**

| Package | Purpose | Notes |
|---------|---------|-------|
| `xdotool`, `xclip` | Text input | xclip for X11 clipboard, xdotool to simulate Shift+Insert paste |
| `libportaudio2` | Audio recording | PortAudio library, runtime dependency of Python `sounddevice` |
| `pulseaudio-utils` | Sound notifications | Provides `paplay` for start/stop recording sounds |
| `libgirepository-2.0-dev`, `libcairo2-dev` | Build dependencies | Headers for compiling `pygobject` and `pycairo` C extensions |
| `gir1.2-gtk-3.0` | Recording overlay | GTK 3 typelib for the recording status overlay |
| `gir1.2-ayatanaappindicator3-0.1` | System tray icon | AppIndicator typelib, runtime dependency of `pystray` on Linux |

On first run, `whisper-input` downloads the Qwen3-ASR ONNX model (~990 MB for the 0.6B default) via `modelscope.snapshot_download` to `~/.cache/modelscope/hub/`. After one successful download, the app is fully offline. You can switch to the 1.7B variant later from the in-app settings page (pulls an additional ~2.4 GB).

#### From Source (Contributors)

```bash
git clone https://github.com/pkulijing/whisper-input
cd whisper-input
bash scripts/setup.sh
uv run whisper-input
```

## Usage

```bash
# Specify hotkey
whisper-input -k KEY_FN          # macOS: Fn/Globe key
whisper-input -k KEY_RIGHTALT    # Linux: Right Alt key

# More options
whisper-input --help
```

A browser settings page opens automatically on startup; you can also access it via the system tray icon.

### How to use

1. Start the app, then hold the hotkey to begin recording
   - macOS default: Right Command key
   - Linux default: Right Ctrl key
2. Speak into the microphone
3. Release the hotkey, wait for recognition
4. The recognized text is automatically typed at the cursor position

## Release Flow (Maintainers)

PyPI distribution via GitHub Actions tag trigger + Trusted Publishing (OIDC):

1. Bump `version` in `pyproject.toml`
2. `git commit -am "release: v0.5.1"` and push to master
3. `git tag v0.5.1 && git push --tags`
4. [`.github/workflows/release.yml`](.github/workflows/release.yml) triggers automatically: verify tag matches version -> `uv build` -> publish to PyPI via `pypa/gh-action-pypi-publish` -> create GitHub Release

## Configuration

Config file `config.yaml`, also editable via the browser settings UI:

| Setting | Description | macOS Default | Linux Default |
|---------|-------------|--------------|--------------|
| `hotkey` | Trigger hotkey | `KEY_RIGHTMETA` | `KEY_RIGHTCTRL` |
| `qwen3.variant` | STT model size (`0.6B` / `1.7B`) | `0.6B` | `0.6B` |
| `sound.enabled` | Recording sound notification | `true` | `true` |
| `ui.language` | Interface language (zh/en/fr) | `zh` | `zh` |

## Known Limitations

- Linux supports X11 only; Wayland is not yet supported
- Super/Win key is intercepted by GNOME desktop, not recommended as hotkey
- macOS requires Accessibility permission for global hotkey monitoring
- First run downloads the Qwen3-ASR 0.6B ONNX model (~990 MB from ModelScope); switching to 1.7B later pulls another ~2.4 GB
- Current flow is press-to-talk / release-to-transcribe (batch mode) — real-time streaming is planned for a future release

## Technical Architecture

The project uses src layout with all Python code under `src/whisper_input/`, installable as a standard package. The entry point is the `whisper-input` console script (equivalent to `python -m whisper_input`).

```
Hold hotkey -> HotkeyListener (whisper_input.backends) -> AudioRecorder (sounddevice)
Release     -> stt.Qwen3ASRSTT (onnxruntime) -> InputMethod -> Text typed into focused window
```

Platform backends (`whisper_input.backends`) auto-select at runtime via `sys.platform`:
- **Linux**: evdev for keyboard events + xclip/xdotool clipboard paste
- **macOS**: pynput global keyboard listener + pbcopy/pbpaste + Cmd+V paste

STT inference (`whisper_input.stt.qwen3`):
- Model: Qwen3-ASR ONNX int8 from `zengshuishui/Qwen3-ASR-onnx` on ModelScope, downloaded via `modelscope.snapshot_download` to `~/.cache/modelscope/hub/`. Two variants side-by-side (0.6B / 1.7B), switchable via the settings page
- Runtime: Microsoft official `onnxruntime`, no torch / transformers dependency
- 3-stage pipeline: `conv_frontend.onnx` → `encoder.int8.onnx` → `decoder.int8.onnx` (28-layer KV-cache autoregressive decoder)
- Log-mel feature extraction: ~100 lines of pure numpy, bit-aligned with Whisper's reference extractor (rtol=1e-4)
- Tokenization: HuggingFace `tokenizers` (Rust byte-level BPE, ~10 MB) loading Qwen3-ASR's `vocab.json` + `merges.txt` directly — no `transformers` dependency
- Dependency tree: `onnxruntime + tokenizers + modelscope + numpy` (modelscope base is only 36 MB, no torch/transformers)

Common features:
- 300ms delay on modifier key press to distinguish combos (e.g., Ctrl+C) from single triggers
- Clipboard paste instead of key simulation, avoiding CJK encoding issues
- Unified CPU inference path, zero code difference between macOS/Linux

## License

MIT
