Generic Tensor Decomposition Specification

This document specifies a fully general tensor dataset and decomposition interface for an external library that will operate independently of Lunascope. The library must not assume any particular scientific domain, modality, event type, or application-specific metadata schema. It should accept numeric tensors with named axes plus arbitrary metadata, support lazy subsetting and concatenation across files, and return decomposition outputs that can be joined back onto source instances for plotting, browsing, filtering, annotation, or downstream analysis.

1. Design Goals

2. Core Abstraction

The core object is an axis-annotated tensor: a numeric array with named axes, optional explicit axis index arrays, per-axis metadata tables, and global attributes.

@dataclass
class AxisAnnotatedTensor:
    data: ArrayLike
    axes: list[str]
    axis_index: dict[str, np.ndarray | None]
    axis_meta: dict[str, pd.DataFrame]
    attrs: dict

data

Numeric tensor, such as NumPy, Zarr-backed arrays, Dask arrays, CuPy arrays, or another array interface.

axes

Ordered axis names, for example ["instance", "sample", "channel"].

axis_index

Optional explicit axis coordinate arrays or ids, such as sample offsets, times, channel ids, or instance ids.

axis_meta

Per-axis metadata tables, one row per element of the corresponding axis, with arbitrary user-defined columns.

The library should not hardcode axis names, but should work smoothly with common conventions such as instance, sample, channel, feature, and component.

3. Generic Metadata Model

Metadata are arbitrary typed annotations attached to named axes. The decomposition package should not encode domain-specific concepts such as subject, stage, modality, class label, condition, or source, but it must allow all of them to be carried through as ordinary metadata fields.

Per-Axis Requirements

Axis Requirement Examples
instance One metadata row per instance. Must include a stable unique id. instance_id, class, score, provenance, group, tags, arbitrary covariates
sample One row per sample coordinate if present. sample_index, sample_time, masks, regions, windows
channel One row per channel or feature dimension if present. channel_id, label, unit, transform, grouping
The package should never require domain-specific columns. It should only require stable ids and alignment between metadata row count and axis length.

4. Canonical Use Case

A common case is a tensor with shape instance x sample x channel. Even in that case, the decomposition library should treat those names as conventions rather than hardcoded semantics.

tensor.shape == (E, T, C)
axes == ["instance", "sample", "channel"]

For this case, a typical set of coordinates might be:

axis_index = {
    "instance": instance_ids,
    "sample": sample_offsets_or_times,
    "channel": channel_ids
}

5. Dataset and View Model

The package should distinguish between a persistent dataset and lightweight lazy views onto that dataset. This allows users to open a compiled tensor once, then make many metadata-driven selections without repeatedly passing the full tensor or all metadata tables.

@dataclass
class TensorDataset:
    manifest: dict
    store: Any
    axis_meta: dict[str, pd.DataFrame]
    attrs: dict

@dataclass
class TensorView:
    dataset: TensorDataset
    selectors: dict
    attrs: dict

Expected View Operations

ds = open_dataset("/path/to/dataset")
view = ds.query(axis="instance", expr="label == 'target' and qc_pass")
view = view.filter(axis="channel", ids_or_mask=["ch_a", "ch_b", "ch_c"])
result = decompose(view, method="svd", config={...})

6. On-Disk Storage Specification

The preferred storage format is a directory-based bundle using:

dataset/
  manifest.json
  arrays/
    data.zarr
    masks.zarr              # optional
  meta/
    instance.parquet
    sample.parquet
    channel.parquet
  partitions/
    files.parquet           # optional but recommended
    groups.parquet          # optional but recommended

Why This Layout

7. Manifest Specification

The manifest must explicitly declare the tensor shape, dtype, axis names, coordinate id columns, and paths to metadata and array stores.

{
  "schema_version": "0.1.0",
  "tensor_name": "data",
  "shape": [42337, 1793, 3],
  "dtype": "float32",
  "axes": ["instance", "sample", "channel"],
  "axis_index_columns": {
    "instance": "instance_id",
    "sample": "sample_index",
    "channel": "channel_id"
  },
  "meta_files": {
    "instance": "meta/instance.parquet",
    "sample": "meta/sample.parquet",
    "channel": "meta/channel.parquet"
  },
  "array_store": "arrays/data.zarr",
  "mask_store": "arrays/masks.zarr",
  "attrs": {
    "creator": "example-package",
    "version": "0.1.0"
  }
}

8. Provenance and Compiled Multi-File Datasets

Concatenation across many files or individuals should be a first-class use case. The specification should support compiling many source tensors into a single logical dataset while preserving lineage.

Recommended Instance Metadata Columns

Recommended Optional Catalog Tables

compiled = concat_datasets(
    datasets,
    along="instance",
    align={"channel": "channel_id", "sample": "sample_index"},
    join="inner"
)

Concatenation Rules

9. Missing Data and Masks

The spec should allow an optional validity mask, but for a first version it is reasonable to require that tensors are already regularized before decomposition. Ragged inputs should be normalized upstream.

Recommended v1 policy:

10. Decomposition API

The decomposition package should accept a dataset view, a method name, and a fully serializable config.

result = decompose(
    tensor_view,
    method="svd",
    config={
        "center": "none",
        "scale": "none",
        "rank": 8
    }
)

Expected Config Fields

All config values should be serializable back into the output so downstream tools always know exactly how the result was created.

11. Output Specification

Results should also be generic and axis-aware. The output must preserve stable ids and explicit axis names so plots, subsets, and annotations can be reconstructed without ambiguity.

@dataclass
class DecompositionResult:
    method: str
    config: dict
    input_ref: dict
    factor_tables: dict[str, pd.DataFrame]
    component_tensors: dict[str, AxisAnnotatedTensor]
    diagnostics: dict
    attrs: dict

Recommended Output Parts

Field Purpose
factor_tables["instance"] Scores, cluster labels, reconstruction error, outlier scores, and any instance-level derived values
factor_tables["sample"] Sample-axis weights if the method exposes them directly
factor_tables["channel"] Channel or feature-axis weights if the method exposes them directly
component_tensors["components"] Waveform-like or tensor-like component objects, for example component x sample x channel
diagnostics Method-specific diagnostics such as singular values, explained variance, convergence, residuals

12. Minimum Instance-Level Return Contract

For any decomposition performed on a dataset with an instance axis, the result should always contain a table keyed by the same instance_id values used in the input dataset or input view.

instance_id
score_1
score_2
...
score_k
recon_error
cluster               # optional
outlier_score         # optional

This is the critical contract for downstream tools that want to attach factor scores or cluster labels back onto the original dataset and use them for filtering, plotting, browsing, or annotation.

13. Matrix and Tensor Examples

Example: Matrix SVD on Flattened Instances

input tensor:  instance x sample x channel
flattened:     instance x (sample * channel)

outputs:
  factor_tables["instance"]   -> left scores
  component_tensors["components"] -> reshaped right vectors as component x sample x channel
  diagnostics["singular_values"]

Example: Tensor Decomposition

input tensor: instance x sample x channel

outputs:
  factor_tables["instance"] -> instance factors
  factor_tables["sample"]   -> sample factors
  factor_tables["channel"]  -> channel factors
  component_tensors["components"] -> rank-1 terms or reconstructed component tensors

14. Query Semantics

Querying should be generic and metadata-driven. The library should not interpret metadata field names, only expose a uniform mechanism for filtering and selecting by metadata.

view = ds.query("instance", "label == 'target' and qc_pass and score > 0.8")
view = ds.query("channel", "group in ['A', 'B']")
view = ds.slice("sample", 100, 900)

This avoids repeatedly passing data tables around and makes compiled datasets practical.

15. Strong Naming Recommendations

16. What Not To Do

17. Minimal v1 Scope

A narrow first version of the external tensor library can still be useful if it supports the following:

  1. Open a dataset lazily from a manifest-backed on-disk bundle.
  2. Expose axis metadata tables and stable ids.
  3. Query and filter lightweight views by metadata.
  4. Concatenate multiple datasets along instance.
  5. Run a decomposition on a selected view.
  6. Return factor tables keyed by the original stable ids.
  7. Return component tensors with explicit axis names and coordinates.

18. Recommended v1 Bundle Example

compiled_dataset/
  manifest.json
  arrays/
    data.zarr
  meta/
    instance.parquet
    sample.parquet
    channel.parquet
  partitions/
    files.parquet
    groups.parquet

decomposition_result/
  manifest.json
  factor_tables/
    instance.parquet
    sample.parquet
    channel.parquet
  components/
    components.zarr
  diagnostics.json

This specification intentionally avoids domain assumptions. Any application-specific metadata should be represented as arbitrary axis metadata columns or global attributes, not hardcoded schema requirements.