Metadata-Version: 2.4
Name: screenbench
Version: 0.0.1
Summary: ScreenSuite
Author: Aymeric Roucher
Author-email: Amir Mahla <amir.mahla@huggingface.co>
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: datasets>=3.5.0
Requires-Dist: e2b-desktop==1.7.1
Requires-Dist: joblib>=1.4.0
Requires-Dist: litellm
Requires-Dist: pillow>=11.0.0
Requires-Dist: protobuf<6.0.0.dev0,>=4.21.5
Requires-Dist: pydantic>=2.11.0
Requires-Dist: rouge>=1.0.1
Requires-Dist: smolagents[litellm,openai]>=1.15.0
Requires-Dist: torch>=2.6.0
Requires-Dist: torchvision>=0.21.0
Requires-Dist: tqdm>=4.67.0
Requires-Dist: transformers
Provides-Extra: android-world
Requires-Dist: android-world; extra == 'android-world'
Provides-Extra: osworld
Requires-Dist: osworld; extra == 'osworld'
Provides-Extra: submodules
Requires-Dist: android-world; extra == 'submodules'
Requires-Dist: osworld; extra == 'submodules'
Description-Content-Type: text/markdown

# ScreenSuite

A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our [Computer Agent](https://huggingface.co/spaces/smolagents/computer-agent)) across areas of ability : perception, single-step and multi-step agentic behaviour.

This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on [smolagents](https://github.com/huggingface/smolagents).

# GUI Agent Benchmarks Overview

## Grounding/Perception Benchmarks

| Data Source      | Evaluation Type                                  | Platform | Link                                                                                 |
| ---------------- | ------------------------------------------------ | -------- | ------------------------------------------------------------------------------------ |
| ScreenSpot       | BBox + click accuracy                            | Web      | [HuggingFace](https://huggingface.co/datasets/rootsautomation/ScreenSpot)            |
| ScreenSpot v2    | BBox + click accuracy                            | Web      | [HuggingFace](https://huggingface.co/datasets/HongxinLi/ScreenSpot_v2)               |
| ScreenSpot-Pro   | BBox + click accuracy                            | Web      | [HuggingFace](https://huggingface.co/datasets/HongxinLi/ScreenSpot-Pro)              |
| Visual-WebBench  | Multi-task (Caption, OCR, QA, Grounding, Action) | Web      | [HuggingFace](https://huggingface.co/datasets/visualwebbench/VisualWebBench)         |
| WebSRC           | Web QA                                           | Web      | [HuggingFace](https://huggingface.co/datasets/X-LANCE/WebSRC_v1.0)                   |
| ScreenQA-short   | Mobile QA                                        | Mobile   | [HuggingFace](https://huggingface.co/datasets/rootsautomation/RICO-ScreenQA-Short)   |
| ScreenQA-complex | Mobile QA                                        | Mobile   | [HuggingFace](https://huggingface.co/datasets/rootsautomation/RICO-ScreenQA-Complex) |
| Showdown-Clicks  | Click prediction                                 | Web      | [HuggingFace](https://huggingface.co/datasets/generalagents/showdown-clicks)         |

## Single Step - Offline Agent Benchmarks

| Data Source         | Evaluation Type | Platform | Link                                                                                     |
| ------------------- | --------------- | -------- | ---------------------------------------------------------------------------------------- |
| Multimodal-Mind2Web | Web navigation  | Web      | [HuggingFace](https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web)                |
| AndroidControl      | Mobile control  | Mobile   | [GitHub](https://github.com/google-research/google-research/tree/master/android_control) |

## Multi-step - Online Agent Benchmarks

| Data Source   | Evaluation Type | Platform | Link                                                                                     |
| ------------- | --------------- | -------- | ---------------------------------------------------------------------------------------- |
| Mind2Web-Live | URL matching    | Web      | [HuggingFace](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live)                     |
| GAIA          | Exact match     | Web      | [HuggingFace](https://huggingface.co/datasets/gaia-benchmark/GAIA)                       |
| BrowseComp    | LLM judge       | Web      | [Link](https://openai.com/index/browsecomp/)                                             |
| AndroidWorld  | Task-specific   | Mobile   | [GitHub](https://github.com/google-research/android_world)                               |
| MobileMiniWob | Task-specific   | Mobile   | Included in AndroidWorld [GitHub](https://github.com/Farama-Foundation/miniwob-plusplus) |
| OSWorld       | Task-specific   | Desktop  | [GitHub](https://github.com/xlang-ai/OSWorld)                                            |


## Cloning the Repository

Make sure to clone the repository with submodules required:

```bash
git clone --recurse-submodules git@github.com:huggingface/geekagents.git
```

or

```bash
git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules
```

## Requirements
- Docker
- Python >= 3.11
- [uv](https://docs.astral.sh/uv/getting-started/installation/)

> For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled.
> To check if your hosting platform supports KVM, run
> ```bash
> egrep -c '(vmx|svm)' /proc/cpuinfo
> ```
> on Linux. If the return value is greater than zero, the processor should be able to support KVM.
> Note: macOS hosts generally do not support KVM.

## Installation

```bash
# Using uv (faster)
uv sync --extra submodules --python 3.11
```

> If you encounter issues with `evdev` python package, you can try installing the build-essential package:
> ```bash
> sudo apt-get install build-essential
> ```

## Development

```bash
# Install development dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Code quality
uv run pre-commit run --all-files --show-diff-on-failure
```

## Running the benchmarks

```python
#!/usr/bin/env python
import os
import json
from datetime import datetime
from dotenv import load_dotenv
from smolagents.models import InferenceClientModel, OpenAIServerModel, LiteLLMModel
from screensuite import registry
from screensuite.basebenchmark import EvaluationConfig

load_dotenv()

# Setup results directory
RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results")
os.makedirs(RESULTS_DIR, exist_ok=True)

def run_benchmarks():
    # Get benchmarks to run
    # benchmarks = registry.list_all()
    benchmarks = registry.get_by_tags(
        tags=[
            "screenqa_short",
            "screenqa_complex",
            "screenspot-v1-click-prompt",
            "screenspot-v1-bounding-box-prompt",
            "screenspot-v2-click-prompt",
            "screenspot-v2-bounding-box-prompt",
            "screenspot-pro-click-prompt",
            "screenspot-pro-bounding-box-prompt",
            "websrc_dev",
            "visualwebbench",
            "android_control",
            "showdown_clicks",
            "mmind2web",
            "android_world",
            "osworld",
            "gaia_web",
        ]
    )

    for bench in benchmarks:
        print(bench.name)

    # Configure your model (choose one)
    model = InferenceClientModel(
        model_id="Qwen/Qwen2.5-VL-32B-Instruct",
        provider="fireworks-ai",
        max_tokens=4096,
    )

    # Alternative models:
    # model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096)
    # model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096)
    # see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py

    # Run benchmarks
    run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}"
    max_samples_to_test = 200
    parallel_workers = 4
    osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker")

    for benchmark in benchmarks:
        print(f"Running: {benchmark.name}")

        # Configure based on benchmark type
        config = EvaluationConfig(
            parallel_workers=parallel_workers,
            run_name=run_name,
            max_samples_to_test=max_samples_to_test
        )

        try:
            results = benchmark.evaluate(
                model,
                evaluation_config=config,
                env_config=osworld_env_config if "osworld" in benchmark.tags else None,
            )
            print(f"Results: {results._metrics}")

            # Save results
            with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f:
                entry = {"benchmark_name": benchmark.name, "metrics": results._metrics}
                f.write(json.dumps(entry) + "\n")

        except Exception as e:
            print(f"Error in {benchmark.name}: {e}")
            continue

if __name__ == "__main__":
    run_benchmarks()
```

## OSWorld Google Tasks

To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project.
See [OSWorld documentation](https://github.com/xlang-ai/OSWorld/blob/main/ACCOUNT_GUIDELINE.md) for more details.

## License

This project is licensed under the terms of the Apache License 2.0.
