Metadata-Version: 2.3
Name: scidats
Version: 0.0.17
Summary: SciDatS is a python package for storing and retrieving scientific data stored in JSON-LD (semantically annotated JSON - Linked Data).
Author: mark doerr
Author-email: mark doerr <mark.doerr@uni-greifswald.de>
License: MIT
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Requires-Dist: fastparquet>=2026.3
Requires-Dist: pandas>=3
Requires-Dist: pyarrow>=23
Requires-Dist: pydantic>=2.13
Requires-Dist: pyld>=3
Requires-Dist: rdflib>=7.6
Requires-Python: ==3.13.*
Project-URL: Bug Tracker, https://github.com/markdoerr/scidats/issues
Project-URL: Changelog, https://github.com/markdoerr/scidats/blob/main/CHANGELOG.md
Project-URL: documentation, https://scidats.readthedocs.io
Project-URL: repository, https://github.com/markdoerr/scidats
Description-Content-Type: text/markdown

# SciDatS

<p align="center">
  <a href="https://github.com/markdoerr/scidats/actions/workflows/ci.yml?query=branch%3Amain">
    <img src="https://img.shields.io/github/actions/workflow/status/markdoerr/scidats/ci.yml?branch=main&label=CI&logo=github&style=flat-square" alt="CI Status" >
  </a>
  <a href="https://scidats.readthedocs.io">
    <img src="https://img.shields.io/readthedocs/scidats.svg?logo=read-the-docs&logoColor=fff&style=flat-square" alt="Documentation Status">
  </a>
  <a href="https://codecov.io/gh/markdoerr/scidats">
    <img src="https://img.shields.io/codecov/c/github/markdoerr/scidats.svg?logo=codecov&logoColor=fff&style=flat-square" alt="Test coverage percentage">
  </a>
</p>
<p align="center">
  <a href="https://github.com/astral-sh/uv">
    <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json" alt="uv">
  </a>
  <a href="https://github.com/astral-sh/ruff">
    <img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff">
  </a>
  <a href="https://github.com/pre-commit/pre-commit">
    <img src="https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white&style=flat-square" alt="pre-commit">
  </a>
</p>
<p align="center">
  <a href="https://pypi.org/project/scidats/">
    <img src="https://img.shields.io/pypi/v/scidats.svg?logo=python&logoColor=fff&style=flat-square" alt="PyPI Version">
  </a>
  <img src="https://img.shields.io/pypi/pyversions/scidats.svg?style=flat-square&logo=python&amp;logoColor=fff" alt="Supported Python versions">
  <img src="https://img.shields.io/pypi/l/scidats.svg?style=flat-square" alt="License">
</p>

---

SciDatS is a python package for storing and retrieving scientific data stored as parquet files with JSON-LD metadata (semantically annotated JSON - Linked Data).

This _Scientific Data Standard_ is designed as a data exchange format to enable exchange/synchronisation of Scientific Data, maintaining all metadata between
different laboratories.

## Features

- efficient storage and retrieval of scientific data and metadata in a single file (parquet with JSON-LD metadata)
- convenient functions for retrieving data and metadata
- improved tooling based on pydantic and rdflib
- **reading and writing** for _SciDatS_ files
- coupling to the [LabDataReader framework](https://gitlab/opensourcelab/ScientificData/LabDataReader) - for transforming proppriatory lab data into a semantically annotated SciDatSa format.
- recommended metadata formats should be DCAT-application profiles, e.g. [DCAT-AP-PLUS](https://github.com/nfdi/dcat-ap-plus) or [Chem-DCAT-AP](https://github.com/nfdi/chem-dcat-ap)

## Design criteria

Here are some of the criteria the data / metadata standard has to fulfil (and in brackets the selected technology) :

- data and metadata storage for scientific / machine learning needs (semantic annotation, based on ontologies, derivatives of owlready2)

  - proper nullable data / missing data handling (pyarrow / parquet)

  - data modalities, like range / limits, type / continuous / categorial / variable treatment in case of range violation (parquet metadata)

  - cardinality (parquet metadata)

- efficient storage (parquet)

- metadata and data stored at one place (parquet)

- metadata conservation when saving / loading / processing (parquet -> arrow)

- fast data exchange (arrow flight, MinIO active replication)

- fast loading (fastparquet, pyarrow)

- fast data processing without in-memory re-writing after loading ( pandas with pyarrow backend, arrow flight, polars)

- "modalities" for the machine learning models

- semantic annotations / metadata in RDF compliant format - for creating instances of ontology classes and SPARQL reasoning (JSON-LD, rdflib, owlready2)

- fast data processing (direct loading into pyarrow driven dataframe )

- programming language agnostic / independent (parquet)

- easy to use (SciDatS / labDataReader framework, currently in implementation by me)

- commonly used in ETL pipelines (Apache Spark, prefect, ... )

- suitable for S3 file storage systems (MinIO)

## Installation

You can install SciDatS via pip (or your favourite package manager):

```bash
# using pip
pip install scidats
# using uv
uv add scidats
uv sync --group dev --group test
```

## Tutorials

A tutorial on how to use SciDatS can be found in the [scidat_demo_tutorial.ipynb](https://gitlab.com/opensourcelab/scientificdata/scidats/-/blob/main/jupyter/scidat_demo_tutorial.ipynb) Jupyter notebook in the `jupyter` folder.

## Documentation

The Documentation can be found here: [https://opensourcelab/scientificdata.gitlab.io/scidats](https://opensourcelab/scientificdata.gitlab.io/scidats)

**ReadTheDocs**: <a href="https://scidats.readthedocs.io" target="_blank">https://scidats.readthedocs.io </a>

**Source Code**: <a href="https://gitlab.com/opensourcelab/scientificdata/scidats" target="_blank">https://gitlab.com/opensourcelab/scientificdata/scidats </a>

---

SciDatS is a python package for storing and retrieving scientific data stored in JSON-LD (semantically annotated JSON - Linked Data).

## Contributors ✨

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

<!-- prettier-ignore-start -->
<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- markdownlint-disable -->
<!-- markdownlint-enable -->
<!-- ALL-CONTRIBUTORS-LIST:END -->
<!-- prettier-ignore-end -->

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!

## Credits

[![Copier](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/copier-org/copier/master/img/badge/badge-grayscale-inverted-border-orange.json)](https://github.com/copier-org/copier)

This package was created with
[Copier](https://copier.readthedocs.io/) and the
[pypackage-template](https://gitlab.com/opensourcelab/software-dev/copier-pypackage)
project template.
