Metadata-Version: 2.4
Name: code-eval
Version: 0.1.1
Summary: Automated evaluation pipeline for AI-generated code
Author: patrick Lee
License-Expression: MIT
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: click<9.0,>=8.0
Requires-Dist: docker<8.0,>=7.0
Requires-Dist: pydantic<3.0,>=2.0
Requires-Dist: python-dotenv<2.0,>=1.0
Description-Content-Type: text/markdown

# Code Eval

Automated evaluation pipeline for AI-generated code. Supports **two evaluation modes** — full-project `eval` and lightweight `snippet` — covering Python and Java (Maven).

## Two Modes

| | `code-eval eval` | `code-eval snippet` |
|---|---|---|
| **Purpose** | Full-project evaluation with tests, lint, security, and complexity | Quick static analysis of a single code snippet |
| **Input** | Directory / file paths / git diff | Inline code (`-c`) or single file (`--file`) |
| **Scanners** | All 9 scanners (incl. test runners & dependency auditors) | Static-analysis only (no pytest / maven-test / pip-audit) |
| **Scoring** | 4 dimensions: correctness, quality, security, maintainability | 3 dimensions: quality, security, maintainability (no correctness) |
| **Output** | `evaluation.json` — full report with metrics, issues, scores | Compact `SnippetResult` JSON with score (0-100) and issues |
| **Use Case** | CI/CD pipelines, batch project evaluation | Code review, quick checks, editor integration |

## Features

- **Two evaluation modes**: `eval` (project) and `snippet` (single file / inline code)
- **Three input modes** (eval): directory, file path, git-diff
- **Two language adapters**: Python + Java (Maven)
- **Nine scanners**:
  - Python: pytest, ruff, bandit, radon, pip-audit
  - Java: maven-test, java-lint, java-security, java-complexity
- **Multi-dimensional scoring**: correctness (0.40), quality (0.25), security (0.20), maintainability (0.15)
- **Two-layer diff awareness**: file-level + line-level tracking (`in_diff` tagging)
- **Configurable Docker sandbox**: optional container isolation with resource limits
- **Batch evaluation**: concurrent target processing with progress reporting
- **Structured output**: `evaluation.json` with metrics, issues, scores, and summary

## Installation

```bash
pip install code-eval
```

Or install from source:

```bash
pip install -e .
```

---

## Mode 1: `code-eval eval`

Full-project evaluation — runs all scanners (tests, lint, security, complexity) and produces a comprehensive structured report.

### Directory mode

Evaluate a project directory (language auto-detected by markers such as `pyproject.toml` or `pom.xml`):

```bash
code-eval eval --targets ./my_project
```

### File mode

Evaluate specific files:

```bash
code-eval eval --targets ./src/auth.py ./src/api.py
```

For Java, file mode also works (project root resolved via `pom.xml`):

```bash
code-eval eval --targets ./my-java-project/src/main/java/com/example/App.java
```

### Git diff mode

Evaluate only files changed since `main`:

```bash
code-eval eval --git-diff --base main
```

### Multiple targets

```bash
code-eval eval --targets ./project_a ./project_b
```

### Save output to file

```bash
code-eval eval --targets ./my_project --output evaluation.json
```

### Generate markdown summary

```bash
code-eval eval --targets ./my_project --output evaluation.json --summary summary.md
```

### Custom configuration

```bash
code-eval eval --targets ./my_project --config .env.production
```

### Eval Output Format

The `evaluation.json` output contains:

```json
{
  "meta": {
    "timestamp": "2025-01-01T00:00:00Z",
    "pipeline_version": "0.1.0",
    "total_targets": 1,
    "total_duration_seconds": 5.2
  },
  "results": [
    {
      "target": "/path/to/project",
      "language": "python",
      "duration_seconds": 5.2,
      "scores": {
        "correctness": { "value": 0.85, "weight": 0.40, "detail": "17/20 tests passed" },
        "quality": { "value": 0.96, "weight": 0.25, "detail": "2 lint issues in diff" },
        "security": { "value": 1.0, "weight": 0.20, "detail": "No security issues" },
        "maintainability": { "value": 0.9, "weight": 0.15, "detail": "Average complexity: 6.2" },
        "overall": 0.91
      },
      "metrics": {
        "tests_total": 20,
        "tests_passed": 17,
        "tests_failed": 3,
        "lint_issues": 2,
        "security_issues": 0,
        "avg_complexity": 6.2,
        "files_evaluated": 8
      },
      "issues": [ "..." ]
    }
  ],
  "summary": {
    "avg_overall_score": 0.91,
    "total_issues": 5,
    "critical_issues": 0,
    "targets_passed": 1,
    "targets_failed": 0
  }
}
```

### Eval Scoring Dimensions

| Dimension | Weight | Source | Scoring Logic |
|-----------|--------|--------|---------------|
| Correctness | 0.40 | pytest / maven-test | `tests_passed / tests_total`; no tests → 0.5; compilation failed → 0.0 |
| Quality | 0.25 | ruff / java-lint | `-0.02` per in-diff lint issue; `-0.002` per out-of-diff |
| Security | 0.20 | bandit / java-security | Deductions: critical `-0.30`, high `-0.15`, medium `-0.05`, low `-0.02` |
| Maintainability | 0.15 | radon / java-complexity | CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0 |

---

## Mode 2: `code-eval snippet`

Lightweight snippet evaluation — runs **static-analysis scanners only** (no test runners or dependency auditors) and produces a compact result with a 0-100 score.

### Inline code

Evaluate a code string directly:

```bash
code-eval snippet -c "import os; os.system('rm -rf /')" --lang python
```

### File input

Evaluate a single code file:

```bash
code-eval snippet --file ./utils.py
```

Language is auto-detected from the file extension. You can override it:

```bash
code-eval snippet --file ./script.txt --lang python
```

### Save snippet result

```bash
code-eval snippet -c "print('hello')" --lang python --output result.json
```

### Snippet Output Format

The snippet result JSON is a compact schema:

```json
{
  "language": "python",
  "file": "snippet.py",
  "duration_seconds": 0.45,
  "score": 85.0,
  "issues_count": 3,
  "issues": [
    {
      "id": "SNIPPET-001",
      "severity": "high",
      "type": "security",
      "message": "Possible shell injection via os.system()",
      "file": "snippet.py",
      "line": 1
    }
  ],
  "severity_summary": {
    "critical": 0,
    "high": 1,
    "medium": 1,
    "low": 1,
    "info": 0
  }
}
```

### Snippet Scoring Dimensions

Snippet mode uses **3 dimensions** (no correctness, since there are no tests):

| Dimension | Weight | Source | Scoring Logic |
|-----------|--------|--------|---------------|
| Quality | 0.40 | ruff / java-lint | `-0.02` per lint issue |
| Security | 0.35 | bandit / java-security | Deductions: critical `-0.30`, high `-0.15`, medium `-0.05`, low `-0.02` |
| Maintainability | 0.25 | radon / java-complexity | CC≤5 → 1.0; CC 5-15 → 1.0-0.5; CC 15-25 → 0.5-0.0 |

### Snippet Scanners by Language

| Language | Scanners |
|----------|----------|
| Python | ruff, bandit, radon |
| Java | java-lint, java-security, java-complexity |

> **Note:** Test runners (pytest, maven-test) and dependency auditors (pip-audit) are excluded from snippet mode since snippets have no project structure.

### Exit Codes (snippet)

| Code | Meaning |
|------|---------|
| `0` | No critical or high severity issues |
| `1` | At least one critical or high severity issue found |

---

## Configuration

Create a `.env` file (see `.env.example`) to customize behavior:

```env
# Sandbox
SANDBOX_ENABLED=false              # Global toggle (default: false)
SANDBOX_PYTHON_ENABLED=true        # Per-language override
SANDBOX_JAVA_ENABLED=              # Per-language override for Java
SANDBOX_MEMORY_LIMIT=512m          # Docker memory limit
SANDBOX_CPU_LIMIT=1                # Docker CPU limit
SANDBOX_TIMEOUT=300                # Total timeout in seconds
SANDBOX_NETWORK=none               # Docker network mode

# Concurrency
MAX_CONCURRENT=4                   # Max parallel evaluations

# Issue limits
MAX_ISSUES_PER_TARGET=50           # Max issues per target in report

# Scoring weights (auto-normalized if they don't sum to 1.0)
SCORE_WEIGHT_CORRECTNESS=0.40
SCORE_WEIGHT_QUALITY=0.25
SCORE_WEIGHT_SECURITY=0.20
SCORE_WEIGHT_MAINTAINABILITY=0.15

# Java / Maven
JAVA_MVN_PATH=                     # Optional mvn path (fallback: PATH lookup)
JAVA_MVN_SETTINGS=                 # Optional settings.xml
JAVA_MVN_TIMEOUT=300               # Maven timeout in seconds
JAVA_MVN_SKIP_TESTS=false          # Run compile instead of test
JAVA_MVN_THREADS=                  # Optional -T value (e.g. 2C)
```

### Sandbox resolution order

For each language: **per-language override** → **global toggle** → **default (false)**

Example: `SANDBOX_ENABLED=false` + `SANDBOX_PYTHON_ENABLED=true` → Python runs in sandbox, others run directly.

## Docker Sandbox

To build the evaluation Docker image:

```bash
docker build -f docker/Dockerfile.python -t code-eval-python .
```

Enable sandbox in `.env`:

```env
SANDBOX_ENABLED=true
```

## Project Structure

```
code_eval/
├── __init__.py
├── cli.py              # Click CLI entry point (eval + snippet sub-commands)
├── config.py           # Configuration from .env
├── adapters/           # Language adapter interface + Python/Java implementations
├── core/               # Runner, scheduler, sandbox, models
├── extractors/         # Issue extractors (Python + Java)
├── reporting/          # JSON & markdown report generation
├── resolvers/          # Target resolution & language detection
├── scanners/           # Scanner interface + Python/Java scanner implementations
├── schemas/            # Pydantic data models (Issue, Metrics, EvaluationReport, SnippetResult)
├── scoring/            # Score computation
└── snippet/            # Snippet-mode runner & scanner selection
```

## Development

```bash
# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v
```

## License

[MIT](LICENSE)
