Metadata-Version: 2.4
Name: datashelf-py
Version: 0.1.2
Summary: A local dataset tracking tool.
Author-email: Rohan Krishnan <krishnan.rohan@outlook.com>
License: MIT License
        
        Copyright (c) 2025 Rohan
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/r0hankrishnan/datashelf
Project-URL: Repository, https://github.com/r0hankrishnan/datashelf
Project-URL: Issues, https://github.com/r0hankrishnan/datashelf/issues
Keywords: data,datasets,tracking,versioning,cli
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas<3.0,>=1.5
Requires-Dist: PyYAML<7.0,>=6.0
Requires-Dist: fastparquet>=2023.1.0
Requires-Dist: openpyxl<4.0,>=3.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# Datashelf

**Lightweight local dataset tracking for data science projects.**

![datashelf logo](https://raw.githubusercontent.com/r0hankrishnan/datashelf/main/assets/ds.svg)

Stop naming files `data_final_really_final.csv`. Datashelf stores tabular datasets as immutable artifacts and lets you retrieve them by name or hash — so your experiments stay reproducible without any heavy infrastructure.

```bash
$ datashelf save data/people.csv people_raw --message "initial load" --tag raw
Successfully saved 'people_raw' (hash: c8a2f8e1)

$ datashelf load people_raw --df
# Returns a pandas DataFrame, ready to use
```

---

## Why Datashelf?

Data science projects often accumulates files like this:

```
data.csv
data_clean.csv
data_final_v2.csv
data_final_really_final.csv
```

Datashelf replaces that chaos with **content-addressed storage**: each dataset is hashed (SHA256), stored once as Parquet, and registered with metadata; and if you try to save a duplicate, Datashelf tells you. You can always get your data back by name or hash prefix. 

---

## Installation

```bash
pip install git+https://github.com/r0hankrishnan/datashelf.git
```

> PyPI release coming soon.

---

## Quick Start

```bash
# Initialize in your project directory
datashelf init

# Save a dataset
datashelf save data/people.csv people_raw --message "initial load" --tag raw

# List what's stored
datashelf list

# Load it back into pandas
datashelf load people_raw --df
```

Or use the Python API:

```python
import datashelf as ds

ds.init()
ds.save("data/people.csv", name="people_raw", message="initial load", tag="raw")
df = ds.load("people_raw", to_df=True)
```

---

## Commands

| Command | Description |
|---|---|
| `datashelf init` | Initialize a `.datashelf/` repo in the current directory (recommended to intialize in your project's root) |
| `datashelf save <path> <name>` | Store a dataset artifact |
| `datashelf list` | List all stored datasets |
| `datashelf show <name>` | Inspect metadata for a dataset |
| `datashelf load <name>` | Print the artifact path (use `--df` to load into pandas) |
| `datashelf checkout <name> <dest>` | Export an artifact to another location |

---

## How It Works

When you save a dataset, Datashelf:

1. Computes a SHA256 hash of the file contents
2. Normalizes it to Parquet and stores it at `.datashelf/artifacts/<hash>.parquet`
3. Registers metadata (name, tag, message, timestamp) in `.datashelf/metadata.json`

If you try to save the same data again under a different name, Datashelf detects the duplicate and asks if you want to update the metadata instead of storing a redundant copy.

```
.datashelf/
├── config.yaml
├── metadata.json
└── artifacts/
    └── c8a2f8e1...parquet
```

---

## Design Philosophy
Datashelf deliberately tracks only tabular data. The core of the tool is duplicate detection and easy data organization: before storing anything, Datashelf checks whether you've already saved that data under a different name. For that check to work reliably, every dataset needs to be in a canonical format — you can't meaningfully compare a CSV and a Parquet of the same table without normalizing them first. I chose Parquet as the canonical format for its size benefits.

Accepting only tabular data is the direct consequence of that decision. It also makes future features like dataset diffing coherent — diffing only makes sense when you can compare rows and columns. Trying to extend Datashelf to handle images, audio, or arbitrary binary files would undermine both of those things without adding much value over a general-purpose tool like DVC.

The scope is intentionally narrow: Datashelf does one thing well for one kind of data.

---

## Comparison

| Tool | Best for |
|---|---|
| **Datashelf** | Lightweight local dataset tracking on a single project |
| DVC | Full data version control with remote storage and pipeline orchestration |
| Git LFS | Large file versioning inside a Git repository |

Datashelf intentionally has no Git integration, no remote storage, and no pipeline orchestration. It's small and it stays out of your way.

---

## Supported File Types

Datashelf accepts `.csv`, `.parquet`, `.xlsx`, and `.json` files and normalizes everything to Parquet internally.

---

## Running Tests

```bash
pytest
```

---

## Roadmap

- [ ] PyPI release
- [ ] Dataset diffing
- [ ] Experiment tracking
- [ ] Dataset lineage
- [ ] Remote artifact storage

---

## License

MIT — see [LICENSE](./LICENSE).

---

## Contributing

See [CONTRIBUTING.md](./CONTRIBUTING.md).
