Metadata-Version: 2.4
Name: justembed
Version: 0.1.1a3
Summary: LENS - Language Embedder With No Synthesizer. Offline-first semantic search for everyday laptops.
Author-email: Krishnamoorthy Sankaran <krishnamoorthy.sankaran@sekrad.org>
License: MIT
Project-URL: Homepage, https://github.com/sekarkrishna/justembed
Project-URL: Repository, https://github.com/sekarkrishna/justembed
Project-URL: Issues, https://github.com/sekarkrishna/justembed/issues
Keywords: semantic-search,embeddings,offline,onnx,nlp,justembed,lens
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.104.0
Requires-Dist: uvicorn[standard]>=0.24.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: onnxruntime>=1.15.0
Requires-Dist: tokenizers>=0.13.0
Requires-Dist: numpy<2.0.0,>=1.20.0
Requires-Dist: duckdb>=0.9.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: skl2onnx>=1.15.0
Requires-Dist: onnx<1.19.0,>=1.14.0
Requires-Dist: ml-dtypes<0.5.0,>=0.4.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: httpx>=0.25.0; extra == "dev"
Dynamic: license-file

# JustEmbed - LENS

**Language Embedder with No Synthesizer**

Offline-first semantic search for everyday laptops. Train custom domain-specific models in seconds, no GPU required.

[![PyPI version](https://badge.fury.io/py/justembed.svg)](https://badge.fury.io/py/justembed)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Philosophy

JustEmbed is built on three core principles:

1. **Offline-First**: Everything runs locally. No API keys, no cloud dependencies, no internet required.
2. **Laptop-Friendly**: Designed for everyday hardware. CPU-only, fast training (<5 seconds), small models (~8 MB).
3. **Domain-Specific**: Train custom models on your text to learn domain-specific synonyms (pyrexia ↔ fever, renal ↔ kidney).

### Why JustEmbed?

Most embedding solutions require:
- GPU hardware
- Cloud API keys and costs
- Hours of training time
- Large model files (GB)
- Internet connectivity

JustEmbed requires:
- ✅ Any laptop with Python
- ✅ No API keys or costs
- ✅ Seconds of training time
- ✅ Small models (8 MB)
- ✅ Works completely offline

## What's Working

### ✅ Core Features (v0.1.1a1)

- **E5-Small Embeddings**: General-purpose 384-dim embeddings via ONNX
- **Custom Model Training**: Train domain-specific models from your text
- **Knowledge Bases**: Create multiple KBs with different models
- **Semantic Search**: Query with natural language, get relevant results
- **Web UI**: Browser-based interface for all operations
- **CLI**: Command-line interface for automation
- **Offline Operation**: No internet required after installation

### ✅ Custom Model Training

Train models that learn your domain's vocabulary:

```python
# Medical domain example
# Training text contains: "pyrexia" and "fever"
# After training, model learns: pyrexia ↔ fever (similarity: 0.83)

# Legal domain example  
# Training text contains: "plaintiff" and "claimant"
# After training, model learns: plaintiff ↔ claimant (similarity: 0.85)
```

**Training Performance**:
- Time: <5 seconds for 1000-word corpus
- Hardware: CPU-only (no GPU needed)
- Model size: ~8 MB
- Embedding dim: 64-256 (configurable)

### ✅ Search Quality

**Precision**: High-quality results with scores 0.6-0.9
**Recall**: Finds synonyms and related concepts
**Speed**: <100ms query latency

Example query results:
```
Query: "fever"
Results:
  1. Score: 0.862 - "...fever in the context of infection..."
  2. Score: 0.862 - "...pyrexia, commonly referred to as fever..."
  3. Score: 0.836 - "Body temperature regulation..."
```

## Quick Start

### Installation

```bash
pip install justembed
```

### Start the Server

```bash
justembed begin --workspace ~/my_docs --port 5424
```

Open browser to http://localhost:5424

### Train a Custom Model

1. Click "🚀 Train Custom Model"
2. Upload your domain-specific text file (.txt or .md)
3. Enter model name (e.g., "medical_v1")
4. Click "Train Model" (takes ~5 seconds)

### Create a Knowledge Base

1. Enter KB name (e.g., "medical_kb")
2. Select model type: "Custom Model"
3. Select your trained model
4. Click "Create KB"

### Upload Documents

1. Choose your document file
2. Select the KB
3. Click "Upload & Preview Chunks"
4. Review chunks and click "Apply Chunking"
5. Wait for embedding to complete

### Query

1. Enter search query (e.g., "fever", "pyrexia")
2. Select KB or "All KBs"
3. Click "Search"
4. View results with relevance scores

## Use Cases

### Medical Documentation
Train on medical texts to learn:
- pyrexia ↔ fever
- renal ↔ kidney
- UTI ↔ urinary tract infection
- hypertension ↔ high blood pressure

### Legal Documents
Train on legal texts to learn:
- plaintiff ↔ claimant
- defendant ↔ respondent
- tort ↔ civil wrong
- litigation ↔ lawsuit

### Technical Documentation
Train on technical texts to learn:
- API ↔ application programming interface
- REST ↔ representational state transfer
- CRUD ↔ create read update delete
- microservices ↔ service-oriented architecture

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                        Web UI / CLI                          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                      FastAPI Server                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Training   │  │   Embedding  │  │    Query     │      │
│  │   Pipeline   │  │   Pipeline   │  │   Pipeline   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Embedder Layer                            │
│  ┌──────────────┐              ┌──────────────┐            │
│  │  E5-Small    │              │   Custom     │            │
│  │  (ONNX)      │              │   Models     │            │
│  │  384-dim     │              │   (ONNX)     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Storage Layer                             │
│  ┌──────────────┐              ┌──────────────┐            │
│  │   DuckDB     │              │   File       │            │
│  │   (KBs)      │              │   System     │            │
│  └──────────────┘              └──────────────┘            │
└─────────────────────────────────────────────────────────────┘
```

## Custom Model Training

### How It Works

1. **TF-IDF Vectorization**: Extract features from your text
2. **MLP Training**: Neural network learns to compress features
3. **ONNX Export**: Portable model format for fast inference
4. **L2 Normalization**: Consistent similarity scores

### Training Pipeline

```
Text Corpus → Chunking → TF-IDF (5000 features)
                              ↓
                    MLP (512 → 256 → 128)
                              ↓
                    ONNX Export (~8 MB)
                              ↓
                    Custom Embedder
```

### Model Configuration

- **Embedding Dimension**: 64-256 (default: 128)
- **Max Features**: 1000-10000 (default: 5000)
- **Hidden Layers**: 512 → 256
- **Activation**: ReLU
- **Optimizer**: Adam

## Performance

### Training
- **Time**: <5 seconds (1000-word corpus)
- **Hardware**: CPU-only
- **Memory**: <500 MB
- **Model Size**: ~8 MB

### Inference
- **Query Latency**: <100ms
- **Embedding Speed**: ~1000 docs/second
- **Memory**: <200 MB per model

### Quality
- **Precision**: 0.6-0.9 similarity scores
- **Synonym Learning**: 0.8+ for domain terms
- **Semantic Understanding**: Related concepts found

## Requirements

- Python 3.8+
- 500 MB disk space
- 1 GB RAM
- CPU (no GPU required)

## Dependencies

Core:
- FastAPI (web server)
- ONNX Runtime (model inference)
- DuckDB (storage)
- scikit-learn (training)

Full list in `pyproject.toml`

## CLI Commands

```bash
# Start server
justembed begin --workspace ~/docs --port 5424

# Start with custom host
justembed begin --workspace ~/docs --host 0.0.0.0 --port 8000

# Show version
justembed --version

# Show help
justembed --help
```

## Python API

```python
from justembed.embedder import E5Embedder, CustomEmbedder
from justembed.training.trainer import CustomModelTrainer

# Train custom model
trainer = CustomModelTrainer()
model_dir = trainer.train(
    corpus=["text1", "text2", "text3"],
    model_name="my_model",
    embedding_dim=128,
    max_features=5000,
)

# Use custom embedder
embedder = CustomEmbedder("my_model")
embeddings = embedder.embed(["query text"])
query_emb = embedder.embed_query("search query")

# Use E5 embedder
e5 = E5Embedder()
embeddings = e5.embed(["text1", "text2"])
```

## Configuration

Models stored in: `~/.cache/justembed/`
- `custom_models/` - Custom trained models
- `tokenizer.json` - E5 tokenizer

Workspace structure:
```
workspace/
├── kb/
│   ├── kb1.duckdb
│   ├── kb2.duckdb
│   └── _history.duckdb
```

## Roadmap

### v0.1.x (Current)
- ✅ E5-Small embeddings
- ✅ Custom model training
- ✅ Web UI
- ✅ CLI
- ✅ Knowledge bases
- ✅ Semantic search


## License

MIT License - see LICENSE file for details

## Author

**Krishnamoorthy Sankaran**
- Email: krishnamoorthy.sankaran@sekrad.org
- GitHub: https://github.com/sekarkrishna/justembed

## Citation

If you use JustEmbed in your research, please cite:

```bibtex
@software{justembed2024,
  title = {JustEmbed: Offline-First Semantic Search for Everyday Laptops},
  author = {Sankaran, Krishnamoorthy},
  year = {2024},
  url = {https://github.com/sekarkrishna/justembed}
}
```

## Acknowledgments

- E5-Small model by Microsoft
- ONNX Runtime by Microsoft
- FastAPI by Sebastián Ramírez
- DuckDB by DuckDB Labs

## Support

- Issues: https://github.com/sekarkrishna/justembed/issues
- Discussions: https://github.com/sekarkrishna/justembed/discussions
- Email: krishnamoorthy.sankaran@sekrad.org

## Changelog

### v0.1.1a1 (2026-02-14)

**New Features**:
- Custom model training from text files
- Domain-specific synonym learning
- Model selection in KB creation
- Improved text chunking (sentence-based fallback)
- Web UI for model training

**Improvements**:
- Reduced minimum training corpus to 500 words
- Better error messages
- Model metadata display in UI
- Query results show model used

**Bug Fixes**:
- Fixed text chunking for continuous text
- Fixed ONNX shape handling for custom models
- Fixed model caching

### v0.1.0 (2026-01-15)

- Initial release
- E5-Small embeddings
- Basic web UI
- Knowledge base management
- Semantic search

---

**JustEmbed** - Semantic search that just works. Offline. On your laptop. In seconds.
