Metadata-Version: 2.4
Name: pydefuddle
Version: 0.1.0
Summary: Python implementation of Defuddle - extract and clean web content as Markdown
Project-URL: Homepage, https://github.com/phalt/pydefuddle
Project-URL: Issues, https://github.com/phalt/pydefuddle/issues
Author-email: Paul Hallett <paulandrewhallett@googlemail.com>
License: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: click>=8.1.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: lxml>=5.0.0
Requires-Dist: markdownify>=0.11.6
Requires-Dist: pyperclip>=1.8.0
Requires-Dist: rich>=13.0.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=5.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# pydefuddle

Python implementation of [Defuddle](https://github.com/kepano/defuddle) — extract and clean web content as Markdown.

Pass any HTML string (or a URL via the CLI) and get back clean, readable Markdown with rich metadata extracted from the page.

## Features

- **Content extraction** — finds the main article body, removes ads, navbars, sidebars, comments, cookie notices, paywalls, and other clutter
- **Metadata extraction** — title, author, published date, description, image, favicon, domain, language, site name via OpenGraph, Twitter Cards, Schema.org, and DOM fallbacks
- **Markdown conversion** — clean ATX-style Markdown with properly fenced code blocks (with language tags), tables, figures, and footnotes
- **Code block handling** — detects syntax highlighter markup from Prism, Highlight.js, Shiki, and others; normalises indentation and strips UI chrome (copy buttons, toolbars)
- **Image processing** — promotes lazy-loaded images, picks the highest-res srcset source, removes tracking pixels
- **CLI** — fetch any URL and copy the Markdown to your clipboard in one command
- **Raw Python** — only standard library + BeautifulSoup4, markdownify, click, httpx, rich, pyperclip

## Installation

```bash
pip install pydefuddle
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv add pydefuddle
```

## Python API

```python
from pydefuddle import defuddle

with open("page.html") as f:
    html = f.read()

result = defuddle(html, url="https://example.com/article")

print(result.title)      # "How Python Works"
print(result.author)     # "Jane Smith"
print(result.published)  # "2024-03-15"
print(result.markdown)   # Clean Markdown string
```

### `defuddle(html, url="", **options)` — convenience function

| Option | Type | Default | Description |
| --- | --- | --- | --- |
| `markdown` | bool | `True` | Convert to Markdown (set `False` for clean HTML only) |
| `remove_low_scoring` | bool | `True` | Remove low-signal blocks via content scoring |
| `remove_small_images` | bool | `True` | Remove tracking pixels and tiny images |
| `remove_hidden_elements` | bool | `True` | Remove elements hidden with CSS |
| `content_selector` | str | `None` | Override content discovery with a CSS selector |
| `debug` | bool | `False` | Include removal debug info in result |

### `Defuddle` class

```python
from pydefuddle import Defuddle, DefuddleOptions

opts = DefuddleOptions(markdown=True, debug=True, content_selector="article")
result = Defuddle(html, url="https://example.com", options=opts).parse()

for removal in result.debug:
    print(removal.name, removal.count, removal.selector)
```

### `DefuddleResult` fields

```python
result.content       # str  — clean HTML
result.markdown      # str  — Markdown (empty if markdown=False)
result.title         # str
result.author        # str
result.published     # str  — ISO date / datetime string
result.description   # str
result.image         # str  — URL
result.favicon       # str  — URL
result.domain        # str
result.language      # str  — BCP 47 (e.g. "en", "fr")
result.site_title    # str
result.word_count    # int
result.parse_time    # float — milliseconds
result.debug         # list[DebugRemoval] | None
```

## CLI

### Fetch a URL → clipboard

```bash
pydefuddle fetch https://example.com/some-article
```

The Markdown is copied to your clipboard automatically.

### Options

```bash
pydefuddle fetch <url> --no-clipboard   # print to stdout instead
pydefuddle fetch <url> --output out.md  # write to file
pydefuddle fetch <url> --preview        # render in terminal with rich
pydefuddle fetch <url> --debug          # show removal steps
pydefuddle fetch <url> --no-markdown    # return clean HTML instead
```

### Parse a local file

```bash
pydefuddle parse page.html --no-clipboard
pydefuddle parse page.html --output article.md
```

## Development

```bash
git clone https://github.com/phalt/pydefuddle
cd pydefuddle
make install   # install deps with uv
make test      # run tests with coverage
make format    # ruff format + lint
```

## Credits

Based on [Defuddle](https://github.com/kepano/defuddle) by Steph Ango ([@kepano](https://github.com/kepano)), which is the JavaScript original powering [Obsidian Web Clipper](https://obsidian.md/clipper).

## License

MIT
