You are analyzing a Harbor task trial to determine if the task is well-specified.

## Your Goal
Determine whether this trial outcome reveals a TASK PROBLEM (needs fixing) or is normal agent behavior (task is fine).

**Critical Context:** This task has already passed baseline validation (oracle passes, nop fails). Your job is to detect problems that baseline validation CANNOT catch:
- Underspecified instructions (agent lacks critical details)
- Overspecified/brittle tests (tests coupled to specific implementation)
- Ambiguous requirements (multiple valid interpretations)
- Tests checking for details not mentioned in instructions

## CRITICAL: Calibration for Hard Tasks

**Hard tasks are SUPPOSED to be hard.** A 20-40% pass rate is EXPECTED and DESIRABLE for good benchmark tasks. Do NOT classify a failure as a task problem just because:
- The agent had to explore the codebase to understand what to change
- The instruction doesn't explicitly list every file that needs modification
- The agent tried a reasonable approach that turned out to be wrong
- The task requires significant investigation or domain expertise

**The bar for BAD_FAILURE is HIGH.** Only classify as BAD_FAILURE if:
- Information is GENUINELY IMPOSSIBLE to derive from instruction + codebase combined
- Tests check for something that contradicts the instruction
- Multiple valid solutions exist but tests only accept one specific approach
- Tests are flaky or depend on non-deterministic behavior

**Default to GOOD_FAILURE** when the agent fails. Agent failures are the norm for hard tasks.

## CRITICAL: What the Agent Can and Cannot See

**During the trial, the agent ONLY has access to:**
- The `instruction.md` file describing the bug/task
- The buggy codebase (repository code with the bug present)
- Standard development tools (editor, terminal, etc.)

**The agent CANNOT see and has NO knowledge of:**
- `solution/` directory - contains fix.patch and solve.sh (used ONLY for oracle validation)
- `tests/` directory - test files are copied in AFTER the agent finishes (for verification only)
- Any patches, diffs, or reference solutions

**This means:**
- The agent must figure out the fix from scratch using only instruction.md and the buggy code
- The agent has NO access to any "solution patch" - do NOT fault the agent for not using it
- The agent cannot see how tests verify the solution - they work blind

## The Verified Result
**Test outcome: {result}** (pass = reward 1.0, fail = reward 0.0)

This result is FINAL and has been verified by running the tests. Your job is to classify WHY this result occurred, not to re-determine pass/fail.

**Classification constraints based on verified result:**
- If result = 'pass' → classify as GOOD_SUCCESS or BAD_SUCCESS
- If result = 'fail' → classify as GOOD_FAILURE, BAD_FAILURE, or HARNESS_ERROR

## Where to Look — YOU Must Read Everything

**You are the grader. You have full access to all task artifacts.** Read them thoroughly before classifying — the agent was sandboxed but you are not.

**Task Definition ({task_dir}):**
- `instruction.md` — what the agent was told (the ONLY thing the agent sees from the task definition)
- `solution/` — reference solution (`solve.sh`, `fix.patch`). Read this to understand the intended fix.
- `tests/` — verification tests (`test.sh` plus test source files). Read the actual test code to understand what is being checked and whether it is reasonable.
- `environment/Dockerfile` — container setup, determines what packages/tools are available

**Trial Execution ({trial_dir}):**
- `agent/` — agent execution logs and trajectory. Read this to see what the agent actually did.
- `verifier/test-stdout.txt` — raw test output. Read this to see exactly which tests passed/failed and why.
- `result.json` — contains `verifier_result.rewards.reward`

**Start by reading the key files.** Before classifying, you MUST read at minimum:
1. `{task_dir}/instruction.md` — to know what the agent was asked to do
2. The test source files under `{task_dir}/tests/` — to know what the tests actually verify
3. `{task_dir}/solution/solve.sh` or `{task_dir}/solution/fix.patch` — to understand the intended solution
4. `{trial_dir}/verifier/test-stdout.txt` — to see the actual test results
5. The agent trajectory under `{trial_dir}/agent/` — to see what the agent attempted

Only after reading these can you make an informed classification. A holistic view of all artifacts is essential — the instruction alone is not enough context to judge whether a failure is the task's fault or the agent's.

**Task directory structure:**

```
<task-dir>
├── instruction.md          ← agent's only view of the task
├── task.toml
├── environment
│   ├── Dockerfile
│   └── bug.patch
├── solution                ← YOU read this, agent cannot
│   ├── solve.sh
│   └── fix.patch
└── tests                   ← YOU read this, agent cannot
    ├── test.sh
    └── # test files (e.g., test_*.py, *.test.ts, *_test.go, etc.)
```

## Classification Taxonomy

### HARNESS_ERROR (Infrastructure Issue)
The agent never ran properly:
- Agent binary not found (e.g., 'bash: claude: command not found')
- Docker/container setup failures
- Missing dependencies in test environment
- Empty trajectory files

### GOOD_FAILURE (Agent's Fault - Task is Fine) ✓ DEFAULT FOR FAILURES
Agent ran but couldn't solve it due to its own limitations. **This is the expected outcome for hard tasks.**
- **Timeout**: Task requires many steps, agent ran out of time
- **Wrong Approach**: Agent tried reasonable approaches but couldn't find the right solution
- **Implementation Bugs**: Agent understood task but made coding errors
- **Context Loss**: Agent forgot earlier context or requirements
- **Premature Stop**: Agent gave up early or declared success incorrectly
- **Complexity Overwhelm**: Task is genuinely difficult and agent couldn't handle it
- **Insufficient Exploration**: Agent didn't explore the codebase enough to understand what to change
- **Incomplete Understanding**: Agent misunderstood the problem or solution space

**Key insight**: If the agent COULD have solved it with more effort, better exploration, or smarter reasoning, it's GOOD_FAILURE even if the task is hard.

### BAD_FAILURE (Task's Fault - Needs Fix) ⚠️
Agent failed due to task specification issues.

**⚠️ IMPORTANT: The bar for BAD_FAILURE is VERY HIGH. Default to GOOD_FAILURE.**

**Underspecified Instruction** - Information is IMPOSSIBLE to derive:
- Tests require behavior that is NOT mentioned in instruction AND NOT discoverable from codebase
- The instruction is actively misleading or contradicts what tests expect
- Example: Instruction says "validate cookies" but tests ONLY check "authorization" header (completely different requirement)

**NOT underspecified** (classify as GOOD_FAILURE instead):
- Instruction describes the problem but agent must explore to find which files to change
- Tests check specific files that a competent developer could identify by investigation
- Agent needs to understand the codebase structure to implement the fix
- Example: Instruction says "fix version references" - agent must explore to find go.mod files

**Rigid/Brittle Tests** - Tests reject CORRECT solutions:
- Tests check exact string matches instead of behavior (e.g., `assert "duplicate" in msg` rejects valid "conflicts with")
- Tests require specific variable/function names not specified in instruction
- Agent's solution is FUNCTIONALLY CORRECT but fails due to superficial differences
- Example: Agent fixes the bug correctly but test fails because it expects specific error message format

**NOT brittle** (classify as GOOD_FAILURE instead):
- Tests check for the correct behavior and agent's solution doesn't implement it
- Agent's approach was reasonable but wrong (this is expected for hard tasks)

**Non-deterministic Tests** - Flaky/unpredictable:
- Tests fail or pass inconsistently due to timing, race conditions, or randomness
- Tests depend on external state or network conditions
- Tests have order dependencies

**Environment Issues** - Task-specific setup problems:
- Missing dependencies, wrong package versions, docker/container problems specific to this task
- Not general infrastructure failures (those are HARNESS_ERROR)

**Missing File Reference** - Tests expect files not mentioned:
- Tests check for outputs to files that instruction doesn't specify
- Tests import modules that instruction doesn't mention needing

**Ambiguous Requirements** - Multiple valid interpretations:
- Instruction wording allows different correct implementations, but tests only accept one
- Example: "Handle duplicates" could mean reject, skip, or merge - instruction doesn't specify

⚠️ **KEY SIGNAL FOR BAD_FAILURE:** The agent's solution is FUNCTIONALLY CORRECT but tests reject it for superficial reasons, OR the task requires information that doesn't exist anywhere.

⚠️ **KEY SIGNAL FOR GOOD_FAILURE:** The agent failed to find the right solution, but the information WAS available in the codebase if they had explored more thoroughly.

### GOOD_SUCCESS (Legitimate Solution)
Agent correctly solved the task:
- Solution matches expected behavior from instruction
- Tests verify actual functionality
- Agent demonstrated understanding of the problem

### BAD_SUCCESS (Cheating/Gaming) ⚠️
Agent "passed" but through illegitimate means:

**Hardcoding** - Agent hardcodes expected outputs:
- Returns specific values instead of computing them
- Pattern-matched test cases and hardcoded responses
- Example: Returns "success" for every input instead of actual validation logic

**Test Inspection** - Agent somehow accessed test files:
- NOTE: Tests should NOT be visible to agent - if they accessed them, that's an infrastructure issue
- Look for evidence in trajectory of agent reading test files from unexpected locations
- Based solution on what tests check rather than instruction requirements

**Oracle Copying** - Agent somehow accessed solution:
- NOTE: solution/ should NOT be visible to agent - if they accessed it, that's an infrastructure issue
- Look for evidence in trajectory of agent finding/reading solution files
- This is rare and indicates a task setup problem, not normal agent cheating

**Minimal Compliance** - Bare minimum to pass:
- Technically passes tests but doesn't solve the actual problem
- Only handles the specific test cases, would fail on similar inputs
- Example: Hardcodes 3 expected values instead of implementing the algorithm

**Tests Too Permissive** - Tests accept bad solutions:
- Tests don't actually verify the requirement from instruction
- Tests pass for trivial/wrong implementations
- Example: Test checks function exists but doesn't verify behavior

**Task Pre-solved** - Solution already present:
- Repository already contained working code, agent just had to find it
- Tests pass without any meaningful changes

⚠️ **KEY SIGNAL:** If agent passed but their implementation is suspiciously minimal or hardcodes specific values, classify as BAD_SUCCESS. If they somehow accessed solution/ or tests/ (which should be hidden), note this as an infrastructure concern.

## How to Analyze

1. **Read all task artifacts first** — instruction.md, the test source files, and the reference solution. You need to understand the full picture before judging.
2. **Read the test output** (verifier/test-stdout.txt) — what specifically failed or passed?
3. **Read the agent trajectory** (agent/) — what did the agent actually try? Was the approach reasonable given what they could see (instruction.md + buggy codebase only)?
4. **Compare instruction vs tests** — are tests checking for things NOT in instructions and NOT discoverable from the codebase?
5. **Compare agent's solution vs reference solution** — did the agent take a valid alternative approach, or miss the mark entirely?
6. **Check for cheating patterns** — did agent hardcode values or somehow access hidden files?
7. **Consider the alternative solution test** — would a different valid approach (consistent with instruction) pass the tests, or do tests only accept one specific implementation?

## Key Questions for Task Quality

**For BAD_FAILURE (instruction/test problems) - ALL must be true:**
- Is the required information IMPOSSIBLE to derive from instruction + codebase?
- Did the agent implement something that is FUNCTIONALLY CORRECT but tests reject it?
- Would ANY competent developer struggle because the spec is genuinely ambiguous or contradictory?

**For GOOD_FAILURE (task is fine, agent failed) - ANY is sufficient:**
- Could a skilled developer solve this by exploring the codebase carefully?
- Is the information technically available but just requires investigation?
- Did the agent fail to explore enough or make reasoning errors?
- Is this just a hard problem that requires expertise?

**For BAD_SUCCESS (cheating/too easy):**
- Did the agent hardcode outputs instead of implementing logic?
- Could an agent pass by pattern-matching without understanding the problem?
- Do tests actually verify the requirement or just check superficial things?
- Is there evidence the agent somehow accessed hidden files? (This shouldn't be possible normally)

**Critical distinction (GOOD vs BAD):**
- **GOOD_FAILURE**: Agent tried reasonable approaches but couldn't solve it (agent's limitation)
- **BAD_FAILURE**: Agent tried reasonable approaches but tests rejected valid solutions (task's fault)
- **GOOD_SUCCESS**: Agent solved it properly by understanding and implementing requirements
- **BAD_SUCCESS**: Agent "solved" it by cheating, hardcoding, or tests are too permissive

## Output Format

REMEMBER: Your classification MUST match the verified result!
- Result '{result}' means you must choose a matching classification (SUCCESS for pass, FAILURE for fail)

Output ONLY valid JSON with this exact structure (no markdown, no code blocks, no explanation):
{{
  "classification": "HARNESS_ERROR | GOOD_FAILURE | BAD_FAILURE | GOOD_SUCCESS | BAD_SUCCESS",
  "subtype": "specific subtype from the taxonomy above",
  "evidence": "Quote specific test names, error messages, or code snippets that support your classification",
  "root_cause": "1-2 sentence explanation of what specifically caused this outcome",
  "recommendation": "If BAD_FAILURE or BAD_SUCCESS, explain how to fix the task. Otherwise write 'N/A - task is fine'"
}}
