Metadata-Version: 2.4
Name: cortex-llm
Version: 1.0.4
Summary: GPU-Accelerated LLM Terminal for Apple Silicon
Home-page: https://github.com/faisalmumtaz/Cortex
Author: Cortex Development Team
License: MIT
Project-URL: Homepage, https://github.com/faisalmumtaz/Cortex
Project-URL: Bug Tracker, https://github.com/faisalmumtaz/Cortex/issues
Project-URL: Documentation, https://github.com/faisalmumtaz/Cortex/wiki
Keywords: llm,gpu,metal,mps,apple-silicon,ai,machine-learning,terminal,mlx,pytorch
Platform: darwin
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: MacOS
Classifier: Environment :: Console
Classifier: Environment :: GPU
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: mlx>=0.10.0
Requires-Dist: mlx-lm>=0.10.0
Requires-Dist: transformers>=4.36.0
Requires-Dist: safetensors>=0.4.0
Requires-Dist: huggingface-hub>=0.19.0
Requires-Dist: accelerate>=0.25.0
Requires-Dist: llama-cpp-python>=0.2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: rich>=13.0.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: packaging>=23.0
Requires-Dist: requests>=2.31.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Provides-Extra: optional
Requires-Dist: sentencepiece>=0.1.99; extra == "optional"
Requires-Dist: auto-gptq>=0.7.0; extra == "optional"
Requires-Dist: autoawq>=0.2.0; extra == "optional"
Requires-Dist: bitsandbytes>=0.41.0; extra == "optional"
Requires-Dist: optimum>=1.16.0; extra == "optional"
Requires-Dist: torchvision>=0.16.0; extra == "optional"
Requires-Dist: torchaudio>=2.1.0; extra == "optional"
Dynamic: home-page
Dynamic: license-file
Dynamic: platform
Dynamic: requires-python

# Cortex - LLM Terminal Client for Apple Silicon

Cortex is an LLM terminal interface designed for Apple Silicon, using MLX and PyTorch MPS frameworks for GPU-accelerated inference.

## What It Does

- **GPU-accelerated inference** via MLX (primary) and PyTorch MPS backends
- **Apple Silicon required** - leverages unified memory architecture
- **Multiple model formats** - MLX, GGUF, SafeTensors, PyTorch, GPTQ, AWQ
- **Built-in fine-tuning** - LoRA-based model customization via interactive wizard
- **Chat template auto-detection** - automatic format detection with confidence scoring
- **Conversation persistence** - SQLite-backed chat history with branching

## Features

- **GPU-Accelerated Inference** - Delegates to MLX and PyTorch MPS for Metal-based execution
- **Apple Silicon Only** - Requires Metal GPU; exits if GPU acceleration is unavailable
- **Model Format Support**:
  - MLX (Apple's format, loaded via `mlx_lm`)
  - GGUF (via `llama-cpp-python` with Metal backend)
  - SafeTensors (via HuggingFace `transformers`)
  - PyTorch models (via HuggingFace `transformers` with MPS device)
  - GPTQ quantized (via `auto-gptq`)
  - AWQ quantized (via `awq`)
- **Quantization** - 4-bit, 5-bit, 8-bit, and mixed-precision quantization via MLX conversion pipeline
- **Model Conversion** - Convert HuggingFace models to MLX format with configurable quantization recipes
- **Template Registry** - Automatic detection of chat templates (ChatML, Llama, Alpaca, Gemma, Reasoning) with confidence scoring and real-time token filtering for reasoning models
- **Rotating KV Cache** - MLX-based KV cache for long context handling (default 4096 tokens)
- **Fine-Tuning** - LoRA-based model customization with interactive 6-step wizard
- **Terminal UI** - ANSI terminal interface with streaming output

## Installation

```bash
# Clone and install
git clone https://github.com/faisalmumtaz/Cortex.git
cd Cortex
./install.sh
```

The installer:
- Checks for Apple Silicon (arm64) compatibility
- Creates a Python virtual environment
- Installs dependencies via `pip install -e .` (from `pyproject.toml`)
- Sets up the `cortex` command in your PATH

### Quick Install (pipx)

If you just want the CLI without cloning the repo, use pipx:

```bash
pipx install cortex-llm
```

## Quick Start

```bash
# After installation, just run:
cortex
```

### Downloading Models

```bash
# Inside Cortex, use the download command:
cortex
# Then type: /download
```

The download feature:
- **HuggingFace integration** - download any model by repository ID
- **Automatic loading** - option to load model immediately after download

## Documentation

### User Documentation
- **[Installation Guide](docs/installation.md)** - Complete setup instructions
- **[CLI Reference](docs/cli.md)** - Commands and user interface
- **[Configuration](docs/configuration.md)** - System settings and optimization
- **[Model Management](docs/model-management.md)** - Loading and managing models
- **[Template Registry](docs/template-registry.md)** - Automatic chat template detection and management
- **[Fine-Tuning Guide](docs/fine-tuning.md)** - Customize models with LoRA
- **[Troubleshooting](docs/troubleshooting.md)** - Common issues and solutions

### Technical Documentation
- **[MLX Acceleration](docs/mlx-acceleration.md)** - MLX framework integration and optimization
- **[GPU Validation](docs/gpu-validation.md)** - Hardware requirements and detection
- **[Inference Engine](docs/inference-engine.md)** - Text generation architecture
- **[Conversation Management](docs/conversation-management.md)** - Chat history and persistence
- **[Development Guide](docs/development.md)** - Contributing and architecture

## System Requirements

- Apple Silicon Mac (M1/M2/M3/M4 - all variants supported)
- macOS 13.3+ (required by MLX framework)
- Python 3.11+
- 16GB+ unified memory (24GB+ recommended for larger models)
- Xcode Command Line Tools

## Performance

Performance depends on your Apple Silicon chip, model size, and quantization level. The inference engine measures tokens/second, first-token latency, and memory usage at runtime.

To check that GPU acceleration is working:

```bash
source venv/bin/activate
python tests/test_apple_silicon.py
```

You should see:
- All validation checks passing
- Measured GFLOPS from matrix operations
- Confirmation of Metal and MLX availability

## GPU Acceleration Architecture

Cortex uses a multi-layer approach, delegating all GPU computation to established frameworks:

1. **MLX Framework (Primary Backend)**
   - Apple's ML framework with native Metal support
   - Quantization support (4-bit, 5-bit, 8-bit, mixed-precision)
   - Rotating KV cache for long contexts
   - JIT compilation via `mx.compile`
   - Operation fusion for reduced kernel launches

2. **PyTorch MPS Backend**
   - Metal Performance Shaders for PyTorch models
   - FP16 optimization and channels-last tensor format

3. **llama.cpp (GGUF Backend)**
   - Metal-accelerated inference for GGUF models

4. **Memory Management**
   - Pre-allocated memory pools with best-fit/first-fit allocation strategies
   - Automatic pool sizing (60% of available memory, capped at 75% of total)
   - Defragmentation support

### Understanding "Skipping Kernel" Messages

When loading GGUF models, you may see messages like:
```
ggml_metal_init: skipping kernel_xxx_bf16 (not supported)
```

**These are NORMAL!** They indicate:
- BF16 kernels being skipped (your GPU uses FP16 instead)
- GPU acceleration is still fully active
- The system automatically uses optimal alternatives

## Troubleshooting

If you suspect GPU isn't being used:

1. **Run validation**: `python tests/test_apple_silicon.py`
2. **Check output**: Should see passing checks and measured GFLOPS
3. **Monitor tokens/sec**: Displayed during inference
4. **Verify Metal**: Ensure Xcode Command Line Tools installed

Common issues:
- **Low performance**: Run `python tests/test_apple_silicon.py` to diagnose
- **Memory errors**: Reduce `gpu_memory_fraction` in config.yaml

## MLX Model Conversion

Cortex includes an MLX model converter:

```python
from cortex.metal.mlx_converter import MLXConverter, ConversionConfig, QuantizationRecipe

converter = MLXConverter()
config = ConversionConfig(
    quantization=QuantizationRecipe.SPEED_4BIT,  # 4-bit quantization
    compile_model=True  # JIT compilation
)

success, message, output_path = converter.convert_model(
    "microsoft/DialoGPT-medium",
    config=config
)
```

### Quantization Options

- **4-bit**: Maximum speed, 75% size reduction
- **5-bit**: Balanced speed and quality
- **8-bit**: Higher quality, 50% size reduction
- **Mixed Precision**: Custom per-layer quantization

## MLX as Primary Backend

Cortex uses MLX (Apple's machine learning framework) as the primary acceleration backend:
- **Metal Support**: GPU execution via MLX's built-in Metal operations
- **Quantization**: Support for 4-bit, 5-bit, 8-bit, and mixed-precision quantization
- **Model Conversion**: Convert HuggingFace models to MLX format

## Built With

- [MLX](https://github.com/ml-explore/mlx) - Apple's machine learning framework
- [mlx-lm](https://github.com/ml-explore/mlx-examples) - LLM utilities and LoRA fine-tuning for MLX
- [PyTorch](https://pytorch.org/) - With Metal Performance Shaders backend
- [llama.cpp](https://github.com/ggerganov/llama.cpp) - Metal-accelerated GGUF support
- [Rich](https://github.com/Textualize/rich) - Terminal formatting
- [HuggingFace](https://huggingface.co/) - Model hub and transformers

## Contributing

We welcome contributions! Please see the [Development Guide](docs/development.md) for contributing guidelines and setup instructions.

## License

MIT License - See [LICENSE](LICENSE) for details.

---

**Note**: Cortex requires Apple Silicon. Intel Macs are not supported.
