Metadata-Version: 2.2
Name: rambot
Version: 0.1.2
Summary: Configurable web scraping framework designed to automate data extraction from web pages
Home-page: https://github.com/AlexVachon/rambot
Author: Alexandre Vachon
Author-email: Alexandre Vachon <alex.vachon@outlook.com>
License: MIT License
        
        Copyright (c) 2025 Alexandre Vachon
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Source, https://github.com/AlexVachon/rambot
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: botasaurus
Requires-Dist: sqlalchemy
Requires-Dist: loguru
Requires-Dist: pydantic-settings
Requires-Dist: pydantic
Requires-Dist: wheel
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# **Rambot: Versatile Web Scraping Framework**  



## **Description**    
Rambot is a versatile and configurable web scraping framework designed to automate data extraction from web pages. It provides an intuitive structure for:  
- Managing different scraping modes.  
- Automating browser navigation.  
- Handling logs and errors.  
- Performing advanced HTTP requests to interact with APIs.  



## **Installation**    
```bash
pip install rambot
```

### **ChromeDriver Dependency**  
Rambot uses `ChromeDriver` for automated browsing. Install it based on your operating system:  
- **Windows**: [Download ChromeDriver here](https://sites.google.com/chromium.org/driver/downloads) and add it to your `PATH`.
- **macOS**: Install via Homebrew:  
  ```bash
  brew install chromedriver
  ```
- **Linux**: Install via APT:  
  ```bash
  sudo apt install chromium-chromedriver
  ```



## **Key Features**    
### **1. Mode-Based Execution**  
- Supports multiple scraping modes via `ScraperModeManager`.
- Use `@bind` decorator or `self.mode_manager.register()` to associate functions with specific modes.

### **2. Headless Browser Control**  
- Integrates with `botasaurus` for automation.
- Advanced proxy management, image blocking, and extension loading.
- Uses `ChromeDriver` to navigate and extract content.

### **3. Optimized Data Handling**  
- Saves extracted data in JSON format.
- Reads and processes existing data files as input.
- Models structured data using `Document`.

### **4. Error Management & Logging**  
- Centralized error handling with `ErrorConfig`.
- Uses `loguru` for detailed and structured logging.

### **5. Scraping Throttling & Delays**  
- Introduces randomized delays to mimic human behavior (`wait()`).
- Ensures compliance with website rate limits.

### **6. Useful Decorators**  
- `@errors`: Structured error handling.
- `@no_print`: Suppresses unwanted output.
- `@scrape`: Enforces function structure in scraping processes.



## **Basic Usage**    

### **1. Create a Scraper**  
```python
from rambot.scraper import Scraper, bind
from rambot.scraper.models import Document
import typing

class App(Scraper):
    BASE_URL: str = "https://www.skipthedishes.com"

    @bind(mode="cities")
    def available_cities(self) -> typing.List[Document]:
        self.get("https://www.skipthedishes.com/canada-food-delivery")
        elements = self.find_all("h4 div a")
        return [
            Document(link=self.BASE_URL + href)
            for element in elements
            if (href := element.get_attribute("href"))
        ]
```

### **2. Run the Scraper**  
```python
if __name__ == "__main__":
    app = App()
    app.run()  # Executes the mode registered in launch.json
```

### **3. Configure `launch.json` in VSCode**  
```json
{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "cities",
      "type": "python",
      "request": "launch",
      "program": "main.py",
      "justMyCode": false,
      "args": ["--mode", "cities"]
    }
  ]
}
```

### **4. Retrieve Results**  
Extracted data is saved in `{mode}.json`:  
```json
{
  "data": [
    {"link": "https://www.skipthedishes.com/cities/calgary"},
    {"link": "https://www.skipthedishes.com/cities/brandon"},
    {"link": "https://www.skipthedishes.com/cities/welland"}
  ],
  "run_stats": {"status": "success", "message": null}
}
```



## **HTTP Request Module**    
### **Description**  
This module allows sending HTTP requests with automatic error handling, logging, and retry attempts.

### **Example Usage**  
```python
from module_name import request

response = request(
    method="GET",
    url="http://example.com",
    options={"headers": {"User-Agent": "CustomAgent"}, "timeout": 10},
    max_retry=3,
    retry_wait=2
)
```

### **Using Proxies and Custom Headers**  
```python
response = request(
    method="POST",
    url="http://example.com/api",
    options={
        "proxies": {"http": "http://my-proxy.com:{port}", "https": "http://my-proxy.com:{port}"},
        "json": {"key": "value"},
        "headers": {"Authorization": "Bearer TOKEN"}
    },
    max_retry=5,
    retry_wait=3
)
```

### **Usage in a Scraper**  
```python
from rambot.requests import requests
from rambot.scraper import Scraper, bind
from rambot.models import Document
import typing

class App(Scraper):
    def open(self, wait=True):
        if self.mode in ["cities"]:
            return  # Prevents browser from opening for this mode
        return super().open(wait)

    @bind(mode="cities")
    def cities(self) -> typing.List[Document]:
        response = requests.send(
            method="GET",
            url="https://www.skipthedishes.com/canada-food-delivery",
            options={"timeout": 15},
            max_retry=5,
            retry_wait=1.25
        )
        elements = response.select("h4 div a")
        return [
            Document(link=self.BASE_URL + href)
            for element in elements
            if (href := element.get("href"))
        ]
```

### **Advantages**  
- **Scraping without a browser**: Reduces resource consumption.
- **Retry mechanism**: Minimizes failures.
- **Fast data extraction**: Parses HTML directly with `requests`.

With Rambot, automate and optimize your data extractions efficiently! 🚀

