Metadata-Version: 2.4
Name: descraper
Version: 0.2.1
Summary: A robust web scraping pipeline with smart static/dynamic fallback and semantic text classification.
Home-page: https://github.com/unan/descraper
Author: Ugurhan Colak
Author-email: ugurhancolak5544@gmail.com
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: selenium
Requires-Dist: webdriver-manager
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# DeScraper 🕷️

An intelligent Python library that turns any web page into clean, structured, AI-ready content.

```python
from descraper import run_scrape

# Scrape an article, a list, or a table-heavy page
data = run_scrape("https://en.wikipedia.org/wiki/List_of_Byzantine_emperors")

# Get clean, LLM-ready markdown content
print(data['content'])
```

---

<details>
<summary><strong>🇬🇧 English Documentation</strong> (Click to expand)</summary>

### Key Features

*   🧠 **AI-Ready Content:** Converts messy HTML into clean Markdown, including full support for converting `<table>` elements into Markdown tables. Perfect for RAG pipelines.
*   🚀 **Smart Strategy:** Automatically switches from a fast static scraper to a full browser engine (`Selenium`) if JavaScript rendering is detected or needed.
*   🛡️ **Noise Reduction:** Intelligently removes ads, navigation menus, footers, and other boilerplate to isolate the main content of a page.
*   📦 **Production-Ready:** Built-in retries, timeouts, and user-agent management for robust and reliable scraping.

### Installation

```bash
pip install descraper
```
*Note: The dynamic mode requires Firefox. The necessary driver is downloaded automatically.*

### Output Structure

DeScraper returns a dictionary, with the most important key being `content`: a clean, LLM-ready string of the page's main information.

```json
{
  "url": "https://...",
  "title": "Page Title",
  "content": "# Title\n\nThe main article text, cleaned and formatted in markdown, including tables...",
  "structured_text": "[...]",
  "links": "{...}",
  "images": "[...]"
}
```

</details>

<br>

<details>
<summary><strong>🇹🇷 Türkçe Dokümantasyon</strong> (Genişletmek için tıklayın)</summary>

### Temel Özellikler

*   🧠 **Yapay Zekaya Hazır İçerik:** Karışık HTML'i, `<table>` etiketlerini Markdown tablolarına dönüştürme dahil, temiz Markdown metnine çevirir. RAG sistemleri için idealdir.
*   🚀 **Akıllı Strateji:** JavaScript ile render edilen siteleri veya zayıf içeriği algıladığında, hızlı statik scraper'dan tam bir tarayıcı motoruna (`Selenium`) otomatik olarak geçer.
*   🛡️ **Gürültü Engelleme:** Reklamları, menüleri, footer'ları ve diğer alakasız şablonları akıllıca temizleyerek sayfanın ana içeriğini izole eder.
*   📦 **Production Seviyesinde:** Dayanıklı ve güvenilir scraping için yerleşik tekrar deneme (retry), zaman aşımı (timeout) ve user-agent yönetimi içerir.

### Kurulum

```bash
pip install descraper
```
*Not: Dinamik mod için Firefox tarayıcısı gereklidir. Gerekli sürücü otomatik olarak indirilir.*

### Çıktı Yapısı

DeScraper, en önemlisi `content` anahtarı olan bir sözlük (dictionary) döndürür. Bu anahtar, sayfanın ana bilgisinin temiz, LLM'ye hazır bir metin halini içerir.

```json
{
  "url": "https://...",
  "title": "Sayfa Başlığı",
  "content": "# Başlık\n\nTablolar dahil, temizlenmiş ve markdown formatında ana metin...",
  "structured_text": "[...]",
  "links": "{...}",
  "images": "[...]"
}
```

</details>

---

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
