Metadata-Version: 2.4
Name: webtable2json
Version: 1.1.0
Summary: Extract HTML tables from webpages and convert them to JSON format
Home-page: https://github.com/yourusername/webtable2json
Author: Raja CSP Raman
Author-email: Raja CSP Raman <raja.r.csp@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/yourusername/webtable2json
Project-URL: Bug Reports, https://github.com/yourusername/webtable2json/issues
Project-URL: Source Code, https://github.com/yourusername/webtable2json
Keywords: html,table,json,web,scraping,beautifulsoup
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: beautifulsoup4>=4.9.0
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# webtable2json

A Python library to extract HTML tables from webpages and convert them to JSON format. Perfect for web scraping, data extraction, and converting tabular web data into structured JSON.

## Features

- Extract tables from URLs or HTML content
- Clean and normalize table data
- Handle complex table structures (thead, tbody, colspan, etc.)
- Preserve links and images with automatic URL normalization
- Specialized functions for ranking websites
- Session support for better performance
- Built-in logging and error handling
- Save results directly to JSON files
- Filter tables by size requirements
- Type hints for better development experience
- Comprehensive error handling

## Installation

```bash
pip install webtable2json
```

## Quick Start

```python
from webtable2json import convert_url_to_json, WebTableToJSON

# Extract all tables from a URL
tables = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")

# Extract a specific table (0-based index)
table = convert_url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html", table_index=0)

# Use the class for more control
converter = WebTableToJSON()
result = converter.url_to_json("https://web.archive.org/web/20251115143556/https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")
```

## Usage Examples

### Basic Table Extraction

```python
from webtable2json import convert_url_to_json, save_tables_to_file

# Get all tables from a webpage
tables = convert_url_to_json("https://www.w3schools.com/html/html_tables.asp")

# Save to file
save_tables_to_file(tables, "extracted_tables.json")

for i, table in enumerate(tables):
    print(f"Table {i}: {table['row_count']} rows, {table['column_count']} columns")
    print(f"First row: {table['data'][0]}")
```

### Working with Custom Headers

```python
from webtable2json import convert_url_to_json

# Custom headers for authentication or specific requirements
headers = {
    'Authorization': 'Bearer your-token',
    'User-Agent': 'My Custom Bot 1.0'
}

tables = convert_url_to_json("https://example.com", headers=headers)
```

### Working with HTML Content

```python
from webtable2json import convert_html_to_json

html = """
<table>
    <tr><th>Name</th><th>Website</th><th>Logo</th></tr>
    <tr>
        <td>Example Corp</td>
        <td><a href="https://example.com">Visit Site</a></td>
        <td><img src="logo.png" alt="Company Logo"></td>
    </tr>
</table>
"""

tables = convert_html_to_json(html, base_url="https://example.com")
print(tables[0]['data'])
# Output includes normalized URLs and image data
```

### Filtering and Utility Functions

```python
from webtable2json import convert_url_to_json, filter_tables_by_size, save_tables_to_file

# Get all tables
all_tables = convert_url_to_json("https://example.com")

# Filter tables with at least 5 rows and 3 columns
large_tables = filter_tables_by_size(all_tables, min_rows=5, min_cols=3)

# Save filtered results
save_tables_to_file(large_tables, "large_tables.json")
```

### Advanced Usage with Custom Headers

```python
from webtable2json import WebTableToJSON
import requests

# Custom headers and session for better performance
session = requests.Session()
headers = {
    'User-Agent': 'My Custom Bot 1.0',
    'Accept': 'text/html,application/xhtml+xml'
}

converter = WebTableToJSON(headers=headers, session=session, timeout=60)
result = converter.url_to_json("https://example.com")
```

### Specialized Functions

```python
from webtable2json import get_main_table, get_clean_ranking_data

# Get the largest table (usually the main data table)
main_table = get_main_table("https://example.com/data-page")

# Specialized function for ranking websites
ranking_data = get_clean_ranking_data("https://www.nirfindia.org/Rankings/2025/ManagementRanking.html")
```

## API Reference

### Classes

#### `WebTableToJSON`

Main class for table extraction and conversion.

**Methods:**
- `__init__(headers=None, timeout=30, session=None)`: Initialize with optional custom headers, timeout, and session
- `fetch_webpage(url)`: Fetch HTML content from URL
- `normalize_url(url, base_url)`: Convert relative URLs to absolute URLs
- `extract_table_data(table, base_url=None)`: Extract data from BeautifulSoup table element
- `extract_tables_from_html(html_content, base_url=None)`: Extract all tables from HTML
- `url_to_json(url, table_index=None)`: Convert tables from URL to JSON
- `html_to_json(html_content, table_index=None, base_url=None)`: Convert tables from HTML to JSON

### Functions

#### `convert_url_to_json(url, table_index=None, headers=None)`
Convert tables from a URL to JSON format.

#### `convert_html_to_json(html_content, table_index=None, base_url=None)`
Convert tables from HTML content to JSON format.

#### `save_tables_to_file(tables, filename, indent=2)`
Save table data to a JSON file.

#### `filter_tables_by_size(tables, min_rows=1, min_cols=1)`
Filter tables by minimum size requirements.

#### `get_main_table(url)`
Get the main data table from a URL (usually the largest table).

#### `get_clean_ranking_data(url)`
Specialized function for ranking websites like NIRF.

## Output Format

Each table is returned as a dictionary with the following structure:

```json
{
    "table_index": 0,
    "row_count": 10,
    "column_count": 3,
    "caption": "Optional table caption",
    "id": "table-id",
    "class": "table-class",
    "source_url": "https://example.com",
    "data": [
        {
            "Column 1": "Simple text value",
            "Column 2": {
                "text": "Link Text",
                "link": "https://example.com/page"
            },
            "Column 3": {
                "text": "Image description",
                "image": "https://example.com/image.jpg",
                "image_alt": "Alt text"
            }
        }
    ]
}
```

## Requirements

- Python 3.7+
- requests >= 2.25.0
- beautifulsoup4 >= 4.9.0

## License

MIT License - see LICENSE file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
