Metadata-Version: 2.4
Name: data-glance
Version: 0.1.1
Summary: Quick data profiling CLI for parquet and CSV files
Requires-Python: >=3.12
Requires-Dist: pandas>=2.0.0
Requires-Dist: polars>=1.0.0
Requires-Dist: pyarrow>=15.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: setuptools
Requires-Dist: typer>=0.12.0
Requires-Dist: ydata-profiling>=4.6.0
Provides-Extra: dev
Requires-Dist: mypy>=1.13.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Description-Content-Type: text/markdown

# data-glance

Fast data profiling CLI for parquet and CSV files. Powered by [ydata-profiling](https://github.com/ydataai/ydata-profiling) and [Polars](https://pola.rs/).

## Installation

Install from PyPI:

```bash
# Run with uvx (cached)
uvx data-glance profile data.parquet

# Install globally
uv tool install data-glance

# Install with pip
pip install data-glance
```

Or run directly from GitHub:

```bash
# Run from GitHub (always latest)
uvx --from git+https://github.com/bswrundquist/data-glance data-glance profile data.parquet
```

## Quick Start

```bash
# Profile a file
data-glance profile data.parquet

# Quick preview
data-glance head data.csv

# Check data quality
data-glance diagnose data.parquet

# View schema
data-glance schema data.parquet
```

## Commands

| Command    | Description                  |
| ---------- | ---------------------------- |
| `profile`  | Generate HTML profile report |
| `diagnose` | Check data quality issues    |
| `head`     | Preview first N rows         |
| `tail`     | Preview last N rows          |
| `schema`   | Display column types         |
| `stats`    | Quick statistics             |
| `count`    | Count rows (fast)            |
| `columns`  | List column names            |
| `unique`   | Show unique values           |
| `filter`   | Filter data by expression    |
| `sample`   | Extract random sample        |
| `convert`  | Convert between formats      |
| `compare`  | Compare two files            |
| `validate` | Validate data rules          |
| `info`     | File metadata                |
| `generate` | Create test data             |

## Profile Command

### Basic Usage

```bash
data-glance profile data.parquet
data-glance profile data.csv --preset quick
data-glance profile data.parquet --preset full
data-glance profile huge.parquet --sample 10000
```

### Column Filtering

```bash
data-glance profile data.csv --include "user_*,order_*"
data-glance profile data.csv --exclude "*_id,*_hash"
```

### Null Handling

```bash
data-glance profile data.csv --nulls drop-cols
data-glance profile data.csv --drop-null-threshold 0.5
data-glance profile data.csv --drop-constant
```

### Output Options

```bash
data-glance profile data.csv -o report.html
data-glance profile data.csv --json report.json
data-glance profile data.csv --no-browser
data-glance profile data.csv --dry-run
```

### CSV Options

```bash
data-glance profile data.tsv --delimiter tab
data-glance profile data.csv --encoding latin-1
data-glance profile messy.csv --ignore-errors
```

## Data Inspection

### head / tail - Preview Data

```bash
data-glance head data.parquet --rows 20
data-glance tail data.csv --rows 10
```

### schema - View Structure

```bash
data-glance schema data.parquet
```

### stats - Quick Statistics

```bash
data-glance stats data.parquet
```

### count - Row Count

```bash
data-glance count data.parquet          # Single file
data-glance count *.csv                 # Multiple files
data-glance count *.parquet --total     # Just the number
```

### columns - List Columns

```bash
data-glance columns data.parquet
data-glance columns data.csv --one       # One per line (for piping)
data-glance columns data.csv --types     # With data types
data-glance columns data.csv --one | grep user  # Filter columns
```

### unique - Value Distribution

```bash
data-glance unique data.csv status
data-glance unique data.parquet category --counts --sort
data-glance unique data.csv user_id --limit 50
```

### info - File Metadata

```bash
data-glance info data.parquet
```

Shows file size, modification time, and for parquet: row count, columns, row groups.

## Data Operations

### filter - Query Data

```bash
# Filter by condition
data-glance filter data.csv "col('status') == 'active'"
data-glance filter data.parquet "col('age') > 30" -o filtered.parquet
data-glance filter data.csv "col('name').str.contains('test')" --limit 100

# Expression syntax (Polars)
col('column') == 'value'
col('column') > 100
col('column').is_in(['a', 'b'])
col('column').is_null()
col('column').str.contains('pattern')
```

### sample - Extract Sample

```bash
data-glance sample data.parquet sample.parquet -n 1000
data-glance sample big.csv small.csv --fraction 0.1
data-glance sample data.parquet sample.csv  # Convert while sampling
```

### convert - Format Conversion

```bash
data-glance convert data.csv data.parquet
data-glance convert data.parquet data.csv
data-glance convert data.csv data.parquet --compression zstd
```

## Data Quality

### diagnose - Quality Check

```bash
data-glance diagnose data.csv
```

Shows: schema, null percentages, quality issues, suggested fixes.

### compare - Diff Files

```bash
data-glance compare data_v1.parquet data_v2.parquet
```

Shows: row/column differences, schema changes, null changes.

### validate - Check Rules

```bash
# Check for nulls
data-glance validate data.csv --no-nulls "id,email"

# Check uniqueness
data-glance validate data.parquet --unique "id"

# Check row count
data-glance validate data.csv --min-rows 1000

# Check null percentage
data-glance validate data.csv --max-null-pct 0.1

# Check required columns
data-glance validate data.csv --required-cols "id,name,email"

# Combine rules
data-glance validate data.parquet \
    --unique "id" \
    --no-nulls "id,email" \
    --min-rows 1000
```

Returns exit code 1 if validation fails (useful in CI/CD).

## Global Options

```bash
data-glance -q profile data.csv    # Quiet mode
data-glance -v profile data.csv    # Verbose mode
```

## Test Data

```bash
data-glance generate test.parquet --rows 5000
data-glance generate test.csv --edge-cases --nulls 0.1
```

## Presets

| Preset    | Speed  | Detail   | Use Case                  |
| --------- | ------ | -------- | ------------------------- |
| `quick`   | Fast   | Minimal  | Large files, quick checks |
| `default` | Medium | Standard | Most use cases            |
| `full`    | Slow   | Detailed | Deep analysis             |

## Tips

- Use `--preset quick` or `--sample` for large files
- Use `diagnose` before `profile` to understand data quality
- Use `--dry-run` to preview what will be profiled
- Use `validate` in CI/CD pipelines
- Use `count --total` for scripting
- Use `columns --one` to pipe to other tools
- Use `filter` to extract subsets before profiling

## Development

```bash
# Clone and install
git clone https://github.com/bswrundquist/data-glance
cd data-glance
make install-dev

# Run tests
make test

# Lint and format
make lint
make format

# Build
make build

# Release
make release-patch  # 0.1.0 -> 0.1.1
make release-minor  # 0.1.0 -> 0.2.0
make release-major  # 0.1.0 -> 1.0.0
```
