Metadata-Version: 2.4
Name: code-similarity-engine
Version: 0.1.0
Author: Jonathan Louis
License: GPL-3.0-or-later
Project-URL: Repository, https://github.com/jonathanlouis/code-similarity-engine
Keywords: code,analysis,refactoring,cli
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: numpy>=1.21
Requires-Dist: scikit-learn>=1.0
Requires-Dist: tree-sitter>=0.21
Requires-Dist: tree-sitter-python>=0.21
Requires-Dist: llama-cpp-python>=0.2.0
Requires-Dist: huggingface-hub>=0.20.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# Code Similarity Engine
## Code Abstraction Opportunities Emerge Naturally

Code that performs similar functions can be abstracted, but traditional tools are not robust against syntactical differences, and therefore miss many abstraciton opportunities.
But how is our system more robust? We present a simple CLI tool that performs semantic embedding, comparison, and then k-clustering to find regions of code showing high semantic similarity. This method finds code regions with similar *meaning*, and using reranking and setting minimums for similarity, this system can find regions of code that can be abstracted reliably and systematically. 
However, this only shows you *where* you can make abstractions, but not how. Using Qwen3 0.6B locally, we can go even further to provide a step-by-step guide to abstracting these similarities to reduce code duplication and provide clearer design and syntax throughout your project! 

## ALWAYS Private, NO TELEMERY!
We explicitly chose to use offline models for everything, and thus this program should never phone home or make any downloads after the initial download of the LLM, Embedder, and Reranking models.

## Installation

```bash
pip install code-similarity-engine
```

After installation, download the required models (~1.4 GB):

```bash
cse --download-models
```

Or use the dedicated command:

```bash
cse-download-models
```

## Usage

```bash
# Basic analysis
cse ./src

# With verbose output
cse ./src -v

# Focus on specific files
cse ./src --focus "*.swift" --focus "*.py"

# Higher precision threshold
cse ./src --threshold 0.90

# With LLM analysis (explains similarities)
cse ./src --analyze

# With reranking (improves cluster quality)
cse ./src --rerank

# Output as JSON
cse ./src -o json > report.json

# Output as Markdown
cse ./src -o markdown > report.md

# Check model status
cse --model-status
```

## Options

```
cse <path> [options]

Core Options:
  -t, --threshold FLOAT      Similarity threshold 0.0-1.0 (default: 0.80)
  -m, --min-cluster INT      Minimum chunks per cluster (default: 2)
  -o, --output FORMAT        text | markdown | json (default: text)
  -v, --verbose              Show progress for all stages

Filtering:
  -f, --focus PATTERN        Only analyze matching paths (repeatable)
  -e, --exclude PATTERN      Glob patterns to exclude (repeatable)
  -l, --lang LANG            Force language detection

LLM Analysis:
  --analyze / --no-analyze   Use LLM to explain clusters
  --llm-model PATH           Path to LLM GGUF model (auto-detected)
  --max-analyze INT          Max clusters to analyze (default: 20)

Reranking:
  --rerank / --no-rerank     Use reranker to improve cluster quality
  --rerank-model PATH        Path to reranker GGUF model (auto-detected)
  --rerank-threshold FLOAT   Minimum score to keep (default: 0.5)

Model Management:
  --download-models          Download all required models and exit
  --model-status             Show status of all models and exit
  --clear-cache              Clear cached embeddings
```

## Language Support

| Language | Parser |
|----------|--------|
| Python | tree-sitter |
| Swift | tree-sitter |
| Rust | tree-sitter |
| JavaScript/TypeScript | tree-sitter |
| Go | tree-sitter |
| Other | sliding window |

## Models

CSE uses three Qwen3 GGUF models (~1.4 GB total):

| Model | Size | Purpose |
|-------|------|---------|
| Qwen3-Embedding-0.6B | 610 MB | Code embeddings |
| Qwen3-0.6B | 378 MB | LLM analysis |
| Qwen3-Reranker-0.6B | 378 MB | Cluster quality |

Models are downloaded to `~/.cache/cse/models/` on first use or via `--download-models`.

## License

GPL-3.0-or-later


