You are an AI agent designed to operate in an iterative loop to automate browser tasks. Your ultimate goal is accomplishing the task provided in <user_request>.
<intro>
You excel at following tasks:
1. Navigating complex websites and extracting precise information
2. Automating form submissions and interactive web actions
3. Gathering and saving information 
4. Using your filesystem effectively to decide what to keep in your context
5. Operate effectively in an agent loop
6. Efficiently performing diverse web tasks
</intro>
<language_settings>
- Default working language: **English**
- Always respond in the same language as the user request
</language_settings>
<input>
At every step, your input will consist of: 
1. <agent_history>: A chronological event stream including your previous actions and their results.
2. <agent_state>: Current <user_request>, summary of <file_system>, <todo_contents>, and <step_info>.
3. <browser_state>: Current URL, open tabs, interactive elements indexed for actions, and visible page content.
4. <browser_vision>: Screenshot of the browser with bounding boxes around interactive elements. If you used screenshot before, this will contain a screenshot.
5. <read_state> This will be displayed only if your previous action was extract or read_file. This data is only shown in the current step.
</input>
<agent_history>
Agent history will be given as a list of step information as follows:
<step_{{step_number}}>:
Evaluation of Previous Step: Assessment of last action
Memory: Your memory of this step
Next Goal: Your goal for this step
Action Results: Your actions and their results
</step_{{step_number}}>
and system messages wrapped in <sys> tag.
</agent_history>
<user_request>
USER REQUEST: This is your ultimate objective and always remains visible.
- This has the highest priority. Make the user happy.
- If the user request is very specific - then carefully follow each step and dont skip or hallucinate steps.
- If the task is open ended you can plan yourself how to get it done.
</user_request>
<browser_state>
1. Browser State will be given as:
Current URL: URL of the page you are currently viewing.
Open Tabs: Open tabs with their ids.
Interactive Elements: All interactive elements will be provided in format as [index]<type>text</type> where
- index: Numeric identifier for interaction
- type: HTML element type (button, input, etc.)
- text: Element description
Examples:
[33]<div>User form</div>
\t*[35]<button aria-label='Submit form'>Submit</button>
Note that:
- Only elements with numeric indexes in [] are interactive
- (stacked) indentation (with \t) is important and means that the element is a (html) child of the element above (with a lower index)
- Elements tagged with a star `*[` are the new interactive elements that appeared on the website since the last step - if url has not changed. Your previous actions caused that change. Think if you need to interact with them, e.g. after input you might need to select the right option from the list.
- Pure text elements without [] are not interactive.
</browser_state>
<browser_vision>
If you used screenshot before, you will be provided with a screenshot of the current page with  bounding boxes around interactive elements. This is your GROUND TRUTH: reason about the image in your thinking to evaluate your progress.
If an interactive index inside your browser_state does not have text information, then the interactive index is written at the top center of it's element in the screenshot.
Use screenshot if you are unsure or simply want more information.
</browser_vision>
<browser_rules>
Strictly follow these rules while using the browser and navigating the web:
- Only interact with elements that have a numeric [index] assigned.
- Only use indexes that are explicitly provided.
- If research is needed, open a **new tab** instead of reusing the current one.
- If the page changes after, for example, an input text action, analyse if you need to interact with new elements, e.g. selecting the right option from the list.
- By default, only elements in the visible viewport are listed. Use scrolling tools if you suspect relevant content is offscreen which you need to interact with. Scroll ONLY if there are more pixels below or above the page.
- You can scroll by a specific number of pages using the pages parameter (e.g., 0.5 for half page, 2.0 for two pages).
- If a captcha appears, attempt solving it if possible. If not, use fallback strategies (e.g., alternative site, backtrack).
- If expected elements are missing, try refreshing, scrolling, or navigating back.
- If the page is not fully loaded, use the wait action.
- You can call extract on specific pages to gather structured semantic information from the entire page, including parts not currently visible.
- Call extract only if the information you are looking for is not visible in your <browser_state> otherwise always just use the needed text from the <browser_state>.
- Calling the extract tool is expensive! DO NOT query the same page with the same extract query multiple times. Make sure that you are on the page with relevant information based on the screenshot before calling this tool.
- If you fill an input field and your action sequence is interrupted, most often something changed e.g. suggestions popped up under the field.
- If the action sequence was interrupted in previous step due to page changes, make sure to complete any remaining actions that were not executed. For example, if you tried to input text and click a search button but the click was not executed because the page changed, you should retry the click action in your next step.
- If the <user_request> includes specific page information such as product type, rating, price, location, etc., try to apply filters to be more efficient.
- The <user_request> is the ultimate goal. If the user specifies explicit steps, they have always the highest priority.
- If you input into a field, you might need to press enter, click the search button, or select from dropdown for completion.
- Don't login into a page if you don't have to. Don't login if you don't have the credentials. 
- There are 2 types of tasks always first think which type of request you are dealing with:
1. Very specific step by step instructions:
- Follow them as very precise and don't skip steps. Try to complete everything as requested.
2. Open ended tasks. Plan yourself, be creative in achieving them.
- If you get stuck e.g. with logins or captcha in open-ended tasks you can re-evaluate the task and try alternative ways, e.g. sometimes accidentally login pops up, even though there some part of the page is accessible or you get some information via web search.
- If you reach a PDF viewer, the file is automatically downloaded and you can see its path in <available_file_paths>. You can either read the file or scroll in the page to see more.
</browser_rules>
<file_system>
- You have access to a persistent file system which you can use to track progress, store results, and manage long tasks.
- Your file system is initialized with a `todo.md`: Use this to keep a checklist for known subtasks. Use `replace_file` tool to update markers in `todo.md` as first action whenever you complete an item. This file should guide your step-by-step execution when you have a long running task.
- If you are writing a `csv` file, make sure to use double quotes if cell elements contain commas.
- If the file is too large, you are only given a preview of your file. Use `read_file` to see the full content if necessary.
- If exists, <available_file_paths> includes files you have downloaded or uploaded by the user. You can only read or upload these files but you don't have write access.
- If the task is really long, initialize a `results.md` file to accumulate your results.
- DO NOT use the file system if the task is less than 10 steps!
</file_system>
<task_completion_rules>
You must call the `done` action in one of two cases:
- When you have fully completed the USER REQUEST.
- When you reach the final allowed step (`max_steps`), even if the task is incomplete.
- If it is ABSOLUTELY IMPOSSIBLE to continue.
The `done` action is your opportunity to terminate and share your findings with the user.
- Set `success` to `true` only if the full USER REQUEST has been completed with no missing components.
- If any part of the request is missing, incomplete, or uncertain, set `success` to `false`.
- You can use the `text` field of the `done` action to communicate your findings and `files_to_display` to send file attachments to the user, e.g. `["results.md"]`.
- Put ALL the relevant information you found so far in the `text` field when you call `done` action.
- Combine `text` and `files_to_display` to provide a coherent reply to the user and fulfill the USER REQUEST.
- You are ONLY ALLOWED to call `done` as a single action. Don't call it together with other actions.
- If the user asks for specified format, such as "return JSON with following structure", "return a list of format...", MAKE sure to use the right format in your answer.
- If the user asks for a structured output, your `done` action's schema will be modified. Take this schema into account when solving the task!
</task_completion_rules>
<action_rules>
- You are allowed to use a maximum of {max_actions} actions per step.
If you are allowed multiple actions, you can specify multiple actions in the list to be executed sequentially (one after another).
- If the page changes after an action, the sequence is interrupted and you get the new state.
</action_rules>
<efficiency_guidelines>
You can output multiple actions in one step. Try to be efficient where it makes sense. Do not predict actions which do not make sense for the current page.
**Recommended Action Combinations:**
- `input` + `click` → Fill form field and submit/search in one step
- `input` + `input` → Fill multiple form fields
- `click` + `click` → Navigate through multi-step flows (when the page does not navigate between clicks)
- `scroll` with pages 10 + `extract` → Scroll to the bottom of the page to load more content before extracting structured data
- File operations + browser actions
Do not try multiple different paths in one step. Always have one clear goal per step.
Its important that you see in the next step if your action was successful, so do not chain actions which change the browser state multiple times, e.g.
- do not use click and then navigate, because you would not see if the click was successful or not.
- or do not use switch and switch together, because you would not see the state in between.
- do not use input and then scroll, because you would not see if the input was successful or not.
</efficiency_guidelines>
<reasoning_rules>
You must reason explicitly and systematically at every step in your `thinking` block.
Exhibit the following reasoning patterns to successfully achieve the <user_request>:
- Reason about <agent_history> to track progress and context toward <user_request>.
- Analyze the most recent "Next Goal" and "Action Result" in <agent_history> and clearly state what you previously tried to achieve.
- Analyze all relevant items in <agent_history>, <browser_state>, <read_state>, <file_system>, <read_state> and the screenshot to understand your state.
- Explicitly judge success/failure/uncertainty of the last action. Never assume an action succeeded just because it appears to be executed in your last step in <agent_history>. For example, you might have "Action 1/1: Input '2025-05-05' into element 3." in your history even though inputting text failed. Always verify using <browser_vision> (screenshot) as the primary ground truth. If a screenshot is unavailable, fall back to <browser_state>. If the expected change is missing, mark the last action as failed (or uncertain) and plan a recovery.
- If todo.md is empty and the task is multi-step, generate a stepwise plan in todo.md using file tools.
- Analyze `todo.md` to guide and track your progress.
- If any todo.md items are finished, mark them as complete in the file.
- Analyze whether you are stuck, e.g. when you repeat the same actions multiple times without any progress. Then consider alternative approaches e.g. scrolling for more context or send_keys to interact with keys directly or different pages.
- Analyze the <read_state> where one-time information are displayed due to your previous action. Reason about whether you want to keep this information in memory and plan writing them into a file if applicable using the file tools.
- If you see information relevant to <user_request>, plan saving the information into a file.
- Before writing data into a file, analyze the <file_system> and check if the file already has some content to avoid overwriting.
- Decide what concise, actionable context should be stored in memory to inform future reasoning.
- When ready to finish, state you are preparing to call done and communicate completion/results to the user.
- Before done, use read_file to verify file contents intended for user output.
- Always reason about the <user_request>. Make sure to carefully analyze the specific steps and information required. E.g. specific filters, specific form fields, specific information to search. Make sure to always compare the current trajactory with the user request and think carefully if thats how the user requested it.
</reasoning_rules>
<examples>
Here are examples of good output patterns. Use them as reference but never copy them directly.
<todo_examples>
  "write_file": {{
    "file_name": "todo.md",
    "content": "# ArXiv CS.AI Recent Papers Collection Task\n\n## Goal: Collect metadata for 20 most recent papers\n\n## Tasks:\n- [ ] Navigate to https://arxiv.org/list/cs.AI/recent\n- [ ] Initialize papers.md file for storing paper data\n- [ ] Collect paper 1/20: The Automated LLM Speedrunning Benchmark\n- [x] Collect paper 2/20: AI Model Passport\n- [ ] Collect paper 3/20: Embodied AI Agents\n- [ ] Collect paper 4/20: Conceptual Topic Aggregation\n- [ ] Collect paper 5/20: Artificial Intelligent Disobedience\n- [ ] Continue collecting remaining papers from current page\n- [ ] Navigate through subsequent pages if needed\n- [ ] Continue until 20 papers are collected\n- [ ] Verify all 20 papers have complete metadata\n- [ ] Final review and completion"
  }}
</todo_examples>
<evaluation_examples>
- Positive Examples:
"evaluation_previous_goal": "Successfully navigated to the product page and found the target information. Verdict: Success"
"evaluation_previous_goal": "Clicked the login button and user authentication form appeared. Verdict: Success"
- Negative Examples:
"evaluation_previous_goal": "Failed to input text into the search bar as I cannot see it in the image. Verdict: Failure"
"evaluation_previous_goal": "Clicked the submit button with index 15 but the form was not submitted successfully. Verdict: Failure"
</evaluation_examples>
<memory_examples>
"memory": "Visited 2 of 5 target websites. Collected pricing data from Amazon ($39.99) and eBay ($42.00). Still need to check Walmart, Target, and Best Buy for the laptop comparison."
"memory": "Found many pending reports that need to be analyzed in the main page. Successfully processed the first 2 reports on quarterly sales data and moving on to inventory analysis and customer feedback reports."
</memory_examples>
<next_goal_examples>
"next_goal": "Click on the 'Add to Cart' button to proceed with the purchase flow."
"next_goal": "Extract details from the first item on the page."
</next_goal_examples>
</examples>
<output>
You must ALWAYS respond with a valid JSON in this exact format:
{{
  "thinking": "A structured <think>-style reasoning block that applies the <reasoning_rules> provided above.",
  "evaluation_previous_goal": "Concise one-sentence analysis of your last action. Clearly state success, failure, or uncertain.",
  "memory": "1-3 sentences of specific memory of this step and overall progress. You should put here everything that will help you track progress in future steps. Like counting pages visited, items found, etc.",
  "next_goal": "State the next immediate goal and action to achieve it, in one clear sentence."
  "action":[{{"navigate": {{ "url": "url_value"}}}}, // ... more actions in sequence]
}}
Action list should NEVER be empty.
</output>


---
# OpenBrowser API Reference
---

# OpenBrowser

[![PyPI version](https://badge.fury.io/py/openbrowser-ai.svg)](https://pypi.org/project/openbrowser-ai/)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://github.com/billy-enrizky/openbrowser-ai/actions/workflows/test.yml/badge.svg)](https://github.com/billy-enrizky/openbrowser-ai/actions)

**AI-powered browser automation using LangGraph and CDP (Chrome DevTools Protocol)**

OpenBrowser is a framework for intelligent browser automation. It combines direct CDP communication with LangGraph orchestration to create AI agents that can navigate, interact with, and extract information from web pages autonomously.

## Documentation

**Full documentation**: [openbrowser.mintlify.app](https://openbrowser.mintlify.app)

## Key Features

- **LangGraph-Powered Agents** - Stateful workflow orchestration with perceive-plan-execute loop
- **Raw CDP Communication** - Direct Chrome DevTools Protocol for maximum control and speed
- **Vision Support** - Screenshot analysis for visual understanding of pages
- **12+ LLM Providers** - OpenAI, Anthropic, Google, Groq, AWS Bedrock, Azure OpenAI, Ollama, and more
- **Code Agent Mode** - Jupyter notebook-like code execution for complex automation
- **MCP Server** - Model Context Protocol support for Claude Desktop integration
- **Video Recording** - Record browser sessions as video files

## Installation

```bash
pip install openbrowser-ai
```

### With Optional Dependencies

```bash
# Install with all LLM providers
pip install openbrowser-ai[all]

# Install specific providers
pip install openbrowser-ai[anthropic]  # Anthropic Claude
pip install openbrowser-ai[groq]       # Groq
pip install openbrowser-ai[ollama]     # Ollama (local models)
pip install openbrowser-ai[aws]        # AWS Bedrock
pip install openbrowser-ai[azure]      # Azure OpenAI

# Install with video recording support
pip install openbrowser-ai[video]
```

### Install Browser

```bash
uvx openbrowser install
# or
playwright install chromium
```

## Quick Start

### Basic Usage

```python
import asyncio
from openbrowser import Agent, ChatGoogle

async def main():
    agent = Agent(
        task="Go to google.com and search for 'Python tutorials'",
        llm=ChatGoogle(),
    )
    
    result = await agent.run()
    print(f"Result: {result}")

asyncio.run(main())
```

### With Different LLM Providers

```python
from openbrowser import Agent, ChatOpenAI, ChatAnthropic, ChatGoogle

# OpenAI
agent = Agent(task="...", llm=ChatOpenAI(model="gpt-4o"))

# Anthropic
agent = Agent(task="...", llm=ChatAnthropic(model="claude-sonnet-4-0"))

# Google Gemini
agent = Agent(task="...", llm=ChatGoogle(model="gemini-2.0-flash"))
```

### Using Browser Session Directly

```python
import asyncio
from openbrowser import BrowserSession, BrowserProfile

async def main():
    profile = BrowserProfile(
        headless=True,
        viewport_width=1920,
        viewport_height=1080,
    )
    
    session = BrowserSession(browser_profile=profile)
    await session.start()
    
    await session.navigate_to("https://example.com")
    screenshot = await session.screenshot()
    
    await session.stop()

asyncio.run(main())
```

## Configuration

### Environment Variables

```bash
# Google (recommended)
export GOOGLE_API_KEY="..."

# OpenAI
export OPENAI_API_KEY="sk-..."

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

# Groq
export GROQ_API_KEY="gsk_..."

# AWS Bedrock
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-west-2"

# Azure OpenAI
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"

# Browser-Use LLM (external service)
export BROWSER_USE_API_KEY="..."
```

### BrowserProfile Options

```python
from openbrowser import BrowserProfile

profile = BrowserProfile(
    headless=True,
    viewport_width=1280,
    viewport_height=720,
    disable_security=False,
    extra_chromium_args=["--disable-gpu"],
    record_video_dir="./recordings",
    proxy={
        "server": "http://proxy.example.com:8080",
        "username": "user",
        "password": "pass",
    },
)
```

## Supported LLM Providers

| Provider | Class | Models |
|----------|-------|--------|
| **Google** | `ChatGoogle` | gemini-2.0-flash, gemini-1.5-pro |
| **OpenAI** | `ChatOpenAI` | gpt-4o, o3, gpt-4-turbo |
| **Anthropic** | `ChatAnthropic` | claude-sonnet-4-0, claude-3-opus |
| **Groq** | `ChatGroq` | llama-3.3-70b-versatile, mixtral-8x7b |
| **AWS Bedrock** | `ChatAWSBedrock` | claude-3, amazon.titan |
| **Azure OpenAI** | `ChatAzureOpenAI` | Any Azure-deployed model |
| **Ollama** | `ChatOllama` | llama3, mistral (local) |
| **OCI** | `ChatOCIRaw` | Oracle Cloud GenAI models |
| **Browser-Use** | `ChatBrowserUse` | External LLM service |

## MCP Server (Claude Desktop Integration)

OpenBrowser includes an MCP server for integration with Claude Desktop.

### Running the MCP Server

```bash
python -m openbrowser.mcp
```

### Claude Desktop Configuration

Add to your Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "openbrowser": {
      "command": "uvx",
      "args": ["openbrowser-ai", "mcp"],
      "env": {
        "GOOGLE_API_KEY": "..."
      }
    }
  }
}
```

## CLI Usage

```bash
# Run a browser automation task
uvx openbrowser run "Search for Python tutorials on Google"

# Install browser
uvx openbrowser install

# Run MCP server
uvx openbrowser mcp
```

## Project Structure

```
openbrowser-ai/
├── src/openbrowser/
│   ├── __init__.py          # Main exports
│   ├── cli.py                # CLI commands
│   ├── config.py             # Configuration
│   ├── actor/                # Element interaction
│   ├── agent/                # LangGraph agent
│   │   ├── graph.py          # Agent workflow
│   │   ├── service.py        # Agent class
│   │   └── views.py          # Data models
│   ├── browser/              # CDP browser control
│   │   ├── session.py        # BrowserSession
│   │   └── profile.py        # BrowserProfile
│   ├── code_use/             # Code agent
│   ├── dom/                  # DOM extraction
│   ├── llm/                  # LLM providers
│   │   ├── openai/
│   │   ├── anthropic/
│   │   ├── google/
│   │   ├── groq/
│   │   ├── aws/
│   │   ├── azure/
│   │   └── ...
│   ├── mcp/                  # MCP server
│   └── tools/                # Action registry
└── tests/                    # Test suite
```

## Testing

```bash
# Run tests
pytest tests/

# Run with verbose output
pytest tests/ -v
```

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Forked from [browser-use](https://github.com/browser-use/browser-use)
- Built with [LangGraph](https://github.com/langchain-ai/langgraph)
- Uses [Playwright](https://playwright.dev/) for browser orchestration

## Contact

- **Email**: billy.suharno@gmail.com
- **GitHub**: [@billy-enrizky](https://github.com/billy-enrizky)
- **Repository**: [github.com/billy-enrizky/openbrowser-ai](https://github.com/billy-enrizky/openbrowser-ai)
- **Documentation**: [openbrowser.mintlify.app](https://openbrowser.mintlify.app)

---

**Made with love for the AI automation community**


---
# Supported Models
---

---
title: "Supported Models"
description: "Choose your favorite LLM"
icon: "microchip-ai"

---

OpenBrowser supports a wide variety of LLM providers. Choose the one that best fits your needs.

### Google Gemini [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/gemini.py)

<Warning>
`GEMINI_API_KEY` is deprecated and should be named `GOOGLE_API_KEY` as of 2025-05.
</Warning>

```python
from openbrowser import Agent, ChatGoogle
from dotenv import load_dotenv

# Read GOOGLE_API_KEY into env
load_dotenv()

# Initialize the model
llm = ChatGoogle(model='gemini-flash-latest')

# Create agent with the model
agent = Agent(
    task="Your task here",
    llm=llm
)
```

Required environment variables:

```bash .env
GOOGLE_API_KEY=
```


### OpenAI [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/gpt-4.1.py)

`O3` model is recommended for best accuracy.

```python
from openbrowser import Agent, ChatOpenAI

# Initialize the model
llm = ChatOpenAI(
    model="o3",
)

# Create agent with the model
agent = Agent(
    task="...", # Your task here
    llm=llm
)
```

Required environment variables:

```bash .env
OPENAI_API_KEY=
```

<Info>
  You can use any OpenAI compatible model by passing the model name to the
  `ChatOpenAI` class using a custom URL (or any other parameter that would go
  into the normal OpenAI API call).
</Info>

### Anthropic [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/claude-4-sonnet.py)

```python
from openbrowser import Agent, ChatAnthropic

# Initialize the model
llm = ChatAnthropic(
    model="claude-sonnet-4-0",
)

# Create agent with the model
agent = Agent(
    task="...", # Your task here
    llm=llm
)
```

And add the variable:

```bash .env
ANTHROPIC_API_KEY=
```

### Azure OpenAI [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/azure_openai.py)

```python
from openbrowser import Agent, ChatAzureOpenAI
from pydantic import SecretStr
import os

# Initialize the model
llm = ChatAzureOpenAI(
    model="o4-mini",
)

# Create agent with the model
agent = Agent(
    task="...", # Your task here
    llm=llm
)
```

Required environment variables:

```bash .env
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
AZURE_OPENAI_API_KEY=
```

### AWS Bedrock [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/aws.py)

AWS Bedrock provides access to multiple model providers through a single API. We support both a general AWS Bedrock client and provider-specific convenience classes.

#### General AWS Bedrock (supports all providers)

```python
from openbrowser import Agent, ChatAWSBedrock

# Works with any Bedrock model (Anthropic, Meta, AI21, etc.)
llm = ChatAWSBedrock(
    model="anthropic.claude-3-5-sonnet-20240620-v1:0",  # or any Bedrock model
    aws_region="us-east-1",
)

# Create agent with the model
agent = Agent(
    task="Your task here",
    llm=llm
)
```

#### Anthropic Claude via AWS Bedrock (convenience class)

```python
from openbrowser import Agent, ChatAnthropicBedrock

# Anthropic-specific class with Claude defaults
llm = ChatAnthropicBedrock(
    model="anthropic.claude-3-5-sonnet-20240620-v1:0",
    aws_region="us-east-1",
)

# Create agent with the model
agent = Agent(
    task="Your task here",
    llm=llm
)
```

#### AWS Authentication

Required environment variables:

```bash .env
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=us-east-1
```

You can also use AWS profiles or IAM roles instead of environment variables. The implementation supports:

- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`)
- AWS profiles and credential files
- IAM roles (when running on EC2)
- Session tokens for temporary credentials
- AWS SSO authentication (`aws_sso_auth=True`)

## Groq [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/llama4-groq.py)

```python
from openbrowser import Agent, ChatGroq

llm = ChatGroq(model="meta-llama/llama-4-maverick-17b-128e-instruct")

agent = Agent(
    task="Your task here",
    llm=llm
)
```

Required environment variables:

```bash .env
GROQ_API_KEY=
```

## Oracle Cloud Infrastructure (OCI) [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/oci_models.py)

OCI provides access to various generative AI models including Meta Llama, Cohere, and other providers through their Generative AI service.

```python
from openbrowser import Agent, ChatOCIRaw

# Initialize the OCI model
llm = ChatOCIRaw(
    model_id="ocid1.generativeaimodel.oc1.us-chicago-1.amaaaaaask7dceya...",
    service_endpoint="https://inference.generativeai.us-chicago-1.oci.oraclecloud.com",
    compartment_id="ocid1.tenancy.oc1..aaaaaaaayeiis5uk2nuubznrekd...",
    provider="meta",  # or "cohere"
    temperature=0.7,
    max_tokens=800,
    top_p=0.9,
    auth_type="API_KEY",
    auth_profile="DEFAULT"
)

# Create agent with the model
agent = Agent(
    task="Your task here",
    llm=llm
)
```

Required setup:
1. Set up OCI configuration file at `~/.oci/config`
2. Have access to OCI Generative AI models in your tenancy
3. Install the OCI Python SDK: `uv add oci` or `pip install oci`

Authentication methods supported:
- `API_KEY`: Uses API key authentication (default)
- `INSTANCE_PRINCIPAL`: Uses instance principal authentication
- `RESOURCE_PRINCIPAL`: Uses resource principal authentication

## Ollama

1. Install Ollama: https://github.com/ollama/ollama
2. Run `ollama serve` to start the server
3. In a new terminal, install the model you want to use: `ollama pull llama3.1:8b` (this has 4.9GB)

```python
from openbrowser import Agent, ChatOllama

llm = ChatOllama(model="llama3.1:8b")
```

## Langchain

[Example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/langchain) on how to use Langchain with OpenBrowser.

## Qwen [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/qwen.py)

Currently, only `qwen-vl-max` is recommended for OpenBrowser. Other Qwen models, including `qwen-max`, have issues with the action schema format.
Smaller Qwen models may return incorrect action schema formats (e.g., `actions: [{"navigate": "google.com"}]` instead of `[{"navigate": {"url": "google.com"}}]`). If you want to use other models, add concrete examples of the correct action format to your prompt.

```python
from openbrowser import Agent, ChatOpenAI
from dotenv import load_dotenv
import os

load_dotenv()

# Get API key from https://modelstudio.console.alibabacloud.com/?tab=playground#/api-key
api_key = os.getenv('ALIBABA_CLOUD')
base_url = 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1'

llm = ChatOpenAI(model='qwen-vl-max', api_key=api_key, base_url=base_url)

agent = Agent(
    task="Your task here",
    llm=llm,
    use_vision=True
)
```

Required environment variables:

```bash .env
ALIBABA_CLOUD=
```

## ModelScope [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/modelscope_example.py)

```python
from openbrowser import Agent, ChatOpenAI
from dotenv import load_dotenv
import os

load_dotenv()

# Get API key from https://www.modelscope.cn/docs/model-service/API-Inference/intro
api_key = os.getenv('MODELSCOPE_API_KEY')
base_url = 'https://api-inference.modelscope.cn/v1/'

llm = ChatOpenAI(model='Qwen/Qwen2.5-VL-72B-Instruct', api_key=api_key, base_url=base_url)

agent = Agent(
    task="Your task here",
    llm=llm,
    use_vision=True
)
```

Required environment variables:

```bash .env
MODELSCOPE_API_KEY=
```

## Other models (DeepSeek, Novita, OpenRouter...)

We support all other models that can be called via OpenAI compatible API. We are open to PRs for more providers.

### DeepSeek [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/deepseek-chat.py)

```python
from openbrowser import Agent
from openbrowser.llm import ChatDeepSeek
import os

deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')

llm = ChatDeepSeek(
    base_url='https://api.deepseek.com/v1',
    model='deepseek-chat',
    api_key=deepseek_api_key,
)

agent = Agent(
    task='Your task here',
    llm=llm,
    use_vision=False,
)
```

Required environment variables:

```bash .env
DEEPSEEK_API_KEY=
```

### Novita [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/novita.py)

```python
from openbrowser import Agent, ChatOpenAI
import os

api_key = os.getenv('NOVITA_API_KEY')

agent = Agent(
    task='Your task here',
    llm=ChatOpenAI(
        base_url='https://api.novita.ai/v3/openai',
        model='deepseek/deepseek-v3-0324',
        api_key=api_key,
    ),
    use_vision=False,
)
```

Required environment variables:

```bash .env
NOVITA_API_KEY=
```

### OpenRouter [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/openrouter.py)

```python
from openbrowser import Agent, ChatOpenAI
import os

llm = ChatOpenAI(
    model='x-ai/grok-4',
    base_url='https://openrouter.ai/api/v1',
    api_key=os.getenv('OPENROUTER_API_KEY'),
)

agent = Agent(
    task='Your task here',
    llm=llm,
)
```

Required environment variables:

```bash .env
OPENROUTER_API_KEY=
```

## Browser-Use LLM [example](https://github.com/billy-enrizky/openbrowser-ai/blob/main/examples/models/browser_use_llm.py)

`ChatBrowserUse()` is an external LLM service from [browser-use.com](https://browser-use.com) optimized for browser automation tasks.

```python
from openbrowser import Agent, ChatBrowserUse

# Initialize the model
llm = ChatBrowserUse()

# Create agent with the model
agent = Agent(
    task="...", # Your task here
    llm=llm
)
```

Required environment variables:

```bash .env
BROWSER_USE_API_KEY=
```

Get your API key from [browser-use.com](https://cloud.browser-use.com/new-api-key).


---
# Quick Start
---

---
title: "Human Quickstart"
description: ""
icon: "rocket"
---

To get started with OpenBrowser you need to install the package and create an `.env` file with your API key.

## 1. Installing OpenBrowser


```bash create environment
pip install uv
uv venv --python 3.12
```
```bash activate environment
source .venv/bin/activate
# On Windows use `.venv\Scripts\activate`
```
```bash install openbrowser & chromium
uv pip install openbrowser-ai
uvx openbrowser install
```


## 2. Choose your favorite LLM
Create a `.env` file and add your API key. 

```bash .env
touch .env
```

<Info>On Windows, use `echo. > .env`</Info>

Then add your API key to the file.

<CodeGroup>
```bash Google
# add your key to .env file
GOOGLE_API_KEY=
# Get your free Gemini API key from https://aistudio.google.com/app/u/1/apikey
```
```bash OpenAI
# add your key to .env file
OPENAI_API_KEY=
```
```bash Anthropic
# add your key to .env file
ANTHROPIC_API_KEY=
```
</CodeGroup>

See [Supported Models](/supported-models) for more.

## 3. Run your first agent

<CodeGroup>
```python OpenBrowser
from openbrowser import Agent, ChatBrowserUse
from dotenv import load_dotenv
import asyncio

load_dotenv()

async def main():
    llm = ChatBrowserUse()
    task = "Find the number 1 post on Show HN"
    agent = Agent(task=task, llm=llm)
    await agent.run()

if __name__ == "__main__":
    asyncio.run(main())
```
```python Google
from openbrowser import Agent, ChatGoogle
from dotenv import load_dotenv
import asyncio

load_dotenv()

async def main():
    llm = ChatGoogle(model="gemini-flash-latest")
    task = "Find the number 1 post on Show HN"
    agent = Agent(task=task, llm=llm)
    await agent.run()

if __name__ == "__main__":
    asyncio.run(main())
```
```python OpenAI
from openbrowser import Agent, ChatOpenAI
from dotenv import load_dotenv
import asyncio

load_dotenv()

async def main():
    llm = ChatOpenAI(model="gpt-4.1-mini")
    task = "Find the number 1 post on Show HN"
    agent = Agent(task=task, llm=llm)
    await agent.run()

if __name__ == "__main__":
    asyncio.run(main())
```
```python Anthropic
from openbrowser import Agent, ChatAnthropic
from dotenv import load_dotenv
import asyncio

load_dotenv()

async def main():
    llm = ChatAnthropic(model='claude-sonnet-4-0', temperature=0.0)
    task = "Find the number 1 post on Show HN"
    agent = Agent(task=task, llm=llm)
    await agent.run()

if __name__ == "__main__":
    asyncio.run(main())
```
</CodeGroup>

<Note> Custom browsers can be configured in one line. Check out <a href = "customize/browser/basics">browsers</a> for more. </Note>

## 4. Going to Production

For production deployments, see [Going to Production](/production) for best practices on running OpenBrowser in production environments.
