Metadata-Version: 2.4
Name: QuerySUTRA
Version: 0.4.0
Summary: SUTRA: Structured-Unstructured-Text-Retrieval-Architecture - Complete data extraction
Author: Aditya Batta
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: openai>=1.0.0
Requires-Dist: plotly>=5.0.0
Requires-Dist: matplotlib>=3.3.0
Requires-Dist: PyPDF2>=3.0.0
Requires-Dist: python-docx>=0.8.11
Requires-Dist: openpyxl>=3.0.0
Provides-Extra: mysql
Requires-Dist: sqlalchemy>=1.4.0; extra == "mysql"
Requires-Dist: mysql-connector-python>=8.0.0; extra == "mysql"
Provides-Extra: postgres
Requires-Dist: sqlalchemy>=1.4.0; extra == "postgres"
Requires-Dist: psycopg2-binary>=2.9.0; extra == "postgres"
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.0.0; extra == "embeddings"
Provides-Extra: all
Requires-Dist: sqlalchemy>=1.4.0; extra == "all"
Requires-Dist: mysql-connector-python>=8.0.0; extra == "all"
Requires-Dist: psycopg2-binary>=2.9.0; extra == "all"
Requires-Dist: sentence-transformers>=2.0.0; extra == "all"
Dynamic: license-file
Dynamic: requires-python

# QuerySUTRA

**SUTRA: Structured-Unstructured-Text-Retrieval-Architecture**

Professional Python library for AI-powered data analysis with automatic entity extraction, natural language querying, and intelligent caching.

## Installation

```bash
pip install QuerySUTRA

# With optional features
pip install QuerySUTRA[embeddings]  # Smart caching
pip install QuerySUTRA[mysql]       # MySQL support
pip install QuerySUTRA[postgres]    # PostgreSQL support
pip install QuerySUTRA[all]         # All features
```

## Key Features

### 1. Complete Data Extraction from Large Documents
Processes entire documents in chunks - no data loss on large PDFs.

```python
from sutra import SUTRA

sutra = SUTRA(api_key="your-openai-key")
sutra.upload("large_document.pdf")  # Extracts ALL data, not just first page

# Automatically creates multiple tables:
# - document_people (40 rows, 8 columns)
# - document_skills (50 rows, 5 columns)
# - document_technologies (30 rows, 4 columns)
# - document_projects (25 rows, 6 columns)
# etc.
```

### 2. Natural Language Querying

```python
result = sutra.ask("Show me all people from New York")
print(result.data)

# With visualization
result = sutra.ask("Show sales by region", viz="pie")
```

### 3. Load Existing Databases

```python
# Load SQLite database
sutra = SUTRA.load_from_db("sutra.db", api_key="your-key")

# Connect to MySQL
sutra = SUTRA.connect_mysql("localhost", "root", "password", "database")

# Connect to PostgreSQL
sutra = SUTRA.connect_postgres("localhost", "postgres", "password", "database")
```

### 4. Import SQLite to MySQL

**Step 1: In Colab or Python - Export Database**
```python
# After uploading your data
sutra.upload("data.pdf")
sutra.tables()

# Export to SQLite
sutra.export_db("my_data.db", format="sqlite")

# In Colab, download the file
from google.colab import files
files.download("my_data.db")
```

**Step 2: On Windows - Import to MySQL**

Method A: Using QuerySUTRA
```python
from sutra import SUTRA

# Load the SQLite database
sutra = SUTRA(api_key="your-key", db="my_data.db")

# Verify tables
sutra.tables()

# Export to MySQL
sutra.save_to_mysql("localhost", "root", "password", "my_database")
```

Method B: Using simple_import.py Script

Download the conversion script from the repository or create simple_import.py:

```python
import sqlite3
import mysql.connector
import pandas as pd

# Configuration
SQLITE_DB = "my_data.db"
MYSQL_HOST = "localhost"
MYSQL_USER = "root"
MYSQL_PASSWORD = "your_password"
MYSQL_DATABASE = "my_database"

# Connect to SQLite
sqlite_conn = sqlite3.connect(SQLITE_DB)
cursor = sqlite_conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = [row[0] for row in cursor.fetchall()]

# Connect to MySQL and create database
mysql_conn_temp = mysql.connector.connect(
    host=MYSQL_HOST, user=MYSQL_USER, password=MYSQL_PASSWORD
)
temp_cursor = mysql_conn_temp.cursor()
temp_cursor.execute(f"CREATE DATABASE IF NOT EXISTS {MYSQL_DATABASE}")
temp_cursor.close()
mysql_conn_temp.close()

# Connect to database and import
mysql_conn = mysql.connector.connect(
    host=MYSQL_HOST, user=MYSQL_USER, 
    password=MYSQL_PASSWORD, database=MYSQL_DATABASE
)
mysql_cursor = mysql_conn.cursor()

for table in tables:
    df = pd.read_sql_query(f"SELECT * FROM {table}", sqlite_conn)
    
    # Create table
    mysql_cursor.execute(f"DROP TABLE IF EXISTS {table}")
    cols = []
    for col in df.columns:
        dtype = 'INT' if df[col].dtype == 'int64' else 'FLOAT' if df[col].dtype == 'float64' else 'TEXT'
        cols.append(f"`{col}` {dtype}")
    mysql_cursor.execute(f"CREATE TABLE {table} ({', '.join(cols)})")
    
    # Insert data
    if len(df) > 0:
        placeholders = ', '.join(['%s'] * len(df.columns))
        for _, row in df.iterrows():
            vals = [None if pd.isna(v) else v for v in row.values]
            mysql_cursor.execute(f"INSERT INTO {table} VALUES ({placeholders})", vals)
    
    mysql_conn.commit()
    print(f"Imported {table}: {len(df)} rows")

sqlite_conn.close()
mysql_cursor.close()
mysql_conn.close()
print("Complete!")
```

Run: `python simple_import.py`

**Step 3: Verify in MySQL**
```sql
USE my_database;
SHOW TABLES;
SELECT * FROM employee_data_people;
```

### 5. Custom Visualizations

```python
result = sutra.ask("Sales by region", viz="pie")       # Pie chart
result = sutra.ask("Trends", viz="line")               # Line chart
result = sutra.ask("Compare", viz="bar")               # Bar chart
result = sutra.ask("Correlation", viz="scatter")       # Scatter plot
result = sutra.ask("Data", viz="table")                # Table view
result = sutra.ask("Analysis", viz="heatmap")          # Heatmap
result = sutra.ask("Auto", viz=True)                   # Auto-detect
```

### 6. Smart Fuzzy Matching

```python
sutra = SUTRA(api_key="your-key", fuzzy_match=True)

# "New York City" automatically matches "New York"
result = sutra.ask("Who are from New York City?")
```

### 7. Intelligent Caching with Embeddings

```python
sutra = SUTRA(api_key="your-key", use_embeddings=True)

result = sutra.ask("Show sales")           # Calls API
result = sutra.ask("Display sales data")   # Uses cache (no API call)
```

### 8. Irrelevant Query Detection

```python
sutra = SUTRA(api_key="your-key", check_relevance=True)

result = sutra.ask("What is the weather?")
# Warns: "Query may be irrelevant to your database"
```

### 9. Direct SQL Access

```python
result = sutra.sql("SELECT * FROM people WHERE city='New York'")
print(result.data)
```

## Complete Configuration

```python
sutra = SUTRA(
    api_key="your-openai-key",
    db="database.db",              # SQLite path
    use_embeddings=True,           # Smart caching (saves API calls)
    check_relevance=True,          # Detect irrelevant queries
    fuzzy_match=True,              # Better NLP
    cache_queries=True             # Simple caching
)
```

## Supported Formats

CSV, Excel, JSON, SQL, PDF, Word, Text, Pandas DataFrame

## How It Works

### Multi-Table Entity Extraction

From a single PDF, QuerySUTRA automatically creates multiple related tables:

**Input:** Employee PDF with 40 employees

**Output Tables:**
- employee_data_people (40 rows)
- employee_data_skills (50 rows)
- employee_data_technologies (30 rows)
- employee_data_projects (25 rows)
- employee_data_certifications (20 rows)
- employee_data_education (40 rows)
- employee_data_work_experience (35 rows)

### Proper Relational Structure

Tables have unique primary keys and proper foreign key relationships:

```
people table:
  id=1, name="John Doe", city="Dallas"
  id=2, name="Jane Smith", city="New York"

skills table:
  id=1, person_id=1, skill_name="Python"
  id=2, person_id=1, skill_name="SQL"
  id=3, person_id=2, skill_name="Java"
```

### Chunk Processing for Large Documents

v0.4.0 processes documents in 10,000 character chunks, ensuring ALL data is extracted:

- PDF with 50 employees: Extracts all 50 (not just first 10)
- Large documents: Processes entire content
- Merges results with unique IDs across chunks

## API Reference

### Class Methods

`SUTRA.load_from_db(db_path, api_key, **kwargs)` - Load existing SQLite

`SUTRA.connect_mysql(host, user, password, database, ...)` - Connect to MySQL

`SUTRA.connect_postgres(host, user, password, database, ...)` - Connect to PostgreSQL

### Instance Methods

`upload(data, name, extract_entities)` - Upload data

`ask(question, viz, table)` - Natural language query

`sql(query, viz)` - Raw SQL query

`tables()` - List all tables

`schema(table)` - Show schema

`peek(table, n)` - Preview data

`export_db(path, format)` - Export database (sqlite/sql/json/excel)

`save_to_mysql(...)` - Export to MySQL

`save_to_postgres(...)` - Export to PostgreSQL

`backup(path)` - Create backup

`close()` - Close connection

## Common Workflows

### Workflow 1: Analyze PDF in Colab, Export to Local MySQL

```python
# In Colab
from sutra import SUTRA

sutra = SUTRA(api_key="your-key")
sutra.upload("document.pdf")
sutra.tables()

# Export and download
sutra.export_db("data.db", format="sqlite")
from google.colab import files
files.download("data.db")

# On Windows
sutra = SUTRA(api_key="your-key", db="data.db")
sutra.save_to_mysql("localhost", "root", "password", "my_database")
```

### Workflow 2: Load Existing Database and Query

```python
# No need to re-upload data
sutra = SUTRA.load_from_db("data.db", api_key="your-key")
result = sutra.ask("Your question", viz="pie")
```

### Workflow 3: Query MySQL Directly

```python
# Connect and query MySQL database
sutra = SUTRA.connect_mysql("localhost", "root", "password", "production_db")
result = sutra.ask("Show me latest transactions")
```

## Performance Tips

1. Use `load_from_db()` to avoid re-uploading
2. Use `sql()` for complex queries (no API cost)
3. Enable `use_embeddings=True` for caching similar queries
4. Enable `cache_queries=True` for exact query matches
5. For large PDFs (50+ pages), allow extra processing time

## Troubleshooting

**Only extracting 10 records instead of 50:**
- Fixed in v0.4.0 with chunk processing
- Upgrade: `pip install --upgrade QuerySUTRA`

**MySQL import fails:**
- Ensure MySQL database exists: `CREATE DATABASE my_database;`
- Install dependencies: `pip install QuerySUTRA[mysql]`
- Check MySQL is running: `mysql -u root -p`

**Colab disk I/O error:**
- Fixed in v0.4.0 with better connection handling
- Restart Colab runtime and try again

**connect_mysql() not found:**
- Update QuerySUTRA: `pip install --upgrade QuerySUTRA`
- Requires v0.3.0 or higher

## System Requirements

- Python 3.8+
- OpenAI API key
- 100MB disk space (for embeddings)
- MySQL/PostgreSQL (optional, for database export)

## License

MIT License

## Changelog

### v0.4.0 (Latest)
- FIXED: Complete data extraction from large documents
- Chunk processing for PDFs with 50+ pages
- All employees extracted (not just first 10)
- Improved MySQL/PostgreSQL import
- Better error handling for Colab

### v0.3.x
- Added MySQL/PostgreSQL connectivity
- Smart caching with embeddings
- Fuzzy matching for better NLP
- Custom visualizations
- Irrelevant query detection

### v0.2.x
- Multi-table entity extraction
- Proper primary and foreign keys

### v0.1.x
- Initial release

---

**Made by Aditya Batta**
