This document specifies a fully general tensor dataset and decomposition interface for an external library that will operate independently of Lunascope. The library must not assume any particular scientific domain, modality, event type, or application-specific metadata schema. It should accept numeric tensors with named axes plus arbitrary metadata, support lazy subsetting and concatenation across files, and return decomposition outputs that can be joined back onto source instances for plotting, browsing, filtering, annotation, or downstream analysis.
The core object is an axis-annotated tensor: a numeric array with named axes, optional explicit axis index arrays, per-axis metadata tables, and global attributes.
@dataclass
class AxisAnnotatedTensor:
data: ArrayLike
axes: list[str]
axis_index: dict[str, np.ndarray | None]
axis_meta: dict[str, pd.DataFrame]
attrs: dict
Numeric tensor, such as NumPy, Zarr-backed arrays, Dask arrays, CuPy arrays, or another array interface.
Ordered axis names, for example ["instance", "sample", "channel"].
Optional explicit axis coordinate arrays or ids, such as sample offsets, times, channel ids, or instance ids.
Per-axis metadata tables, one row per element of the corresponding axis, with arbitrary user-defined columns.
instance, sample, channel, feature, and component.
Metadata are arbitrary typed annotations attached to named axes. The decomposition package should not encode domain-specific concepts such as subject, stage, modality, class label, condition, or source, but it must allow all of them to be carried through as ordinary metadata fields.
| Axis | Requirement | Examples |
|---|---|---|
instance |
One metadata row per instance. Must include a stable unique id. | instance_id, class, score, provenance, group, tags, arbitrary covariates |
sample |
One row per sample coordinate if present. | sample_index, sample_time, masks, regions, windows |
channel |
One row per channel or feature dimension if present. | channel_id, label, unit, transform, grouping |
A common case is a tensor with shape instance x sample x channel. Even in that case, the
decomposition library should treat those names as conventions rather than hardcoded semantics.
tensor.shape == (E, T, C)
axes == ["instance", "sample", "channel"]
For this case, a typical set of coordinates might be:
axis_index = {
"instance": instance_ids,
"sample": sample_offsets_or_times,
"channel": channel_ids
}
The package should distinguish between a persistent dataset and lightweight lazy views onto that dataset. This allows users to open a compiled tensor once, then make many metadata-driven selections without repeatedly passing the full tensor or all metadata tables.
@dataclass
class TensorDataset:
manifest: dict
store: Any
axis_meta: dict[str, pd.DataFrame]
attrs: dict
@dataclass
class TensorView:
dataset: TensorDataset
selectors: dict
attrs: dict
query(axis, expr)filter(axis, ids_or_mask)slice(axis, start, stop)take(axis, indices)materialize()ds = open_dataset("/path/to/dataset")
view = ds.query(axis="instance", expr="label == 'target' and qc_pass")
view = view.filter(axis="channel", ids_or_mask=["ch_a", "ch_b", "ch_c"])
result = decompose(view, method="svd", config={...})
The preferred storage format is a directory-based bundle using:
zarr for numeric arraysparquet for metadata tablesjson for a top-level manifestdataset/
manifest.json
arrays/
data.zarr
masks.zarr # optional
meta/
instance.parquet
sample.parquet
channel.parquet
partitions/
files.parquet # optional but recommended
groups.parquet # optional but recommended
The manifest must explicitly declare the tensor shape, dtype, axis names, coordinate id columns, and paths to metadata and array stores.
{
"schema_version": "0.1.0",
"tensor_name": "data",
"shape": [42337, 1793, 3],
"dtype": "float32",
"axes": ["instance", "sample", "channel"],
"axis_index_columns": {
"instance": "instance_id",
"sample": "sample_index",
"channel": "channel_id"
},
"meta_files": {
"instance": "meta/instance.parquet",
"sample": "meta/sample.parquet",
"channel": "meta/channel.parquet"
},
"array_store": "arrays/data.zarr",
"mask_store": "arrays/masks.zarr",
"attrs": {
"creator": "example-package",
"version": "0.1.0"
}
}
Concatenation across many files or individuals should be a first-class use case. The specification should support compiling many source tensors into a single logical dataset while preserving lineage.
instance_id: globally unique within the compiled datasetsource_dataset_idsource_file_idgroup_id: arbitrary grouping keysource_instance_id: original id in the source system, if one existsfiles.parquet: one row per source filegroups.parquet: one row per source group such as session, participant, acquisition, block, or any arbitrary groupingcompiled = concat_datasets(
datasets,
along="instance",
align={"channel": "channel_id", "sample": "sample_index"},
join="inner"
)
instance should be standard.The spec should allow an optional validity mask, but for a first version it is reasonable to require that tensors are already regularized before decomposition. Ragged inputs should be normalized upstream.
Recommended v1 policy:
The decomposition package should accept a dataset view, a method name, and a fully serializable config.
result = decompose(
tensor_view,
method="svd",
config={
"center": "none",
"scale": "none",
"rank": 8
}
)
center: none, global, per-axis, or other explicit modesscale: none, z-score-like modes, robust modes, or explicit user-defined modesmissing: error, mask, or another explicit policyrank: fixed rank or a selection strategyweights: optional axis or channel weightingsample_mask: optional subset of one axis before decompositionflatten: required for matrix methods if multiple axes are collapsedResults should also be generic and axis-aware. The output must preserve stable ids and explicit axis names so plots, subsets, and annotations can be reconstructed without ambiguity.
@dataclass
class DecompositionResult:
method: str
config: dict
input_ref: dict
factor_tables: dict[str, pd.DataFrame]
component_tensors: dict[str, AxisAnnotatedTensor]
diagnostics: dict
attrs: dict
| Field | Purpose |
|---|---|
factor_tables["instance"] |
Scores, cluster labels, reconstruction error, outlier scores, and any instance-level derived values |
factor_tables["sample"] |
Sample-axis weights if the method exposes them directly |
factor_tables["channel"] |
Channel or feature-axis weights if the method exposes them directly |
component_tensors["components"] |
Waveform-like or tensor-like component objects, for example component x sample x channel |
diagnostics |
Method-specific diagnostics such as singular values, explained variance, convergence, residuals |
For any decomposition performed on a dataset with an instance axis, the result should always
contain a table keyed by the same instance_id values used in the input dataset or input view.
instance_id
score_1
score_2
...
score_k
recon_error
cluster # optional
outlier_score # optional
This is the critical contract for downstream tools that want to attach factor scores or cluster labels back onto the original dataset and use them for filtering, plotting, browsing, or annotation.
input tensor: instance x sample x channel
flattened: instance x (sample * channel)
outputs:
factor_tables["instance"] -> left scores
component_tensors["components"] -> reshaped right vectors as component x sample x channel
diagnostics["singular_values"]
input tensor: instance x sample x channel
outputs:
factor_tables["instance"] -> instance factors
factor_tables["sample"] -> sample factors
factor_tables["channel"] -> channel factors
component_tensors["components"] -> rank-1 terms or reconstructed component tensors
Querying should be generic and metadata-driven. The library should not interpret metadata field names, only expose a uniform mechanism for filtering and selecting by metadata.
view = ds.query("instance", "label == 'target' and qc_pass and score > 0.8")
view = ds.query("channel", "group in ['A', 'B']")
view = ds.slice("sample", 100, 900)
This avoids repeatedly passing data tables around and makes compiled datasets practical.
instance_id should be the canonical join key for the instance axis.channel_id and sample_index should be used consistently when those axes exist.A narrow first version of the external tensor library can still be useful if it supports the following:
instance.compiled_dataset/
manifest.json
arrays/
data.zarr
meta/
instance.parquet
sample.parquet
channel.parquet
partitions/
files.parquet
groups.parquet
decomposition_result/
manifest.json
factor_tables/
instance.parquet
sample.parquet
channel.parquet
components/
components.zarr
diagnostics.json
This specification intentionally avoids domain assumptions. Any application-specific metadata should be represented as arbitrary axis metadata columns or global attributes, not hardcoded schema requirements.