Metadata-Version: 2.4
Name: engel-semantic-layer
Version: 0.3.0
Summary: Engel-style semantic layer compiler for BigQuery
Author: Rasmus Engelbrecht
License-Expression: MIT
Project-URL: Homepage, https://github.com/rasmusengelbrecht/engel-semantic-layer-public
Project-URL: Repository, https://github.com/rasmusengelbrecht/engel-semantic-layer-public
Project-URL: Issues, https://github.com/rasmusengelbrecht/engel-semantic-layer-public/issues
Keywords: semantic-layer,metrics,sql,bigquery,yaml
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Dynamic: license-file

# engel-semantic-layer

Engel-style semantic layer as a Python package.

This package lets users define metrics/modules in YAML (same shape as Engel's code reference) and compile executable BigQuery SQL for a metric query request.

## Scope

- YAML loading from:
  - one-module-per-file (`module:` root), or
  - single-file semantic layer (`semantic_layer.modules` + `semantic_layer.cross_module_metrics`)
- Metric types:
  - `sum`
  - `count`
  - `ratio`
  - `count-distinct`
  - `custom-value`
  - `custom-ratio`
  - `derived-ratio` (cross-module, metric-on-metric, `join_on: time`)
- Metric filters + query-time filters
- Metric slices (`slice_name`) for base metrics
- Time grain support: none, daily, weekly, monthly, quarterly, yearly
- Breakdowns with dimension access checks
- Relationship graph traversal across modules (`relationships`, with legacy `joinPaths` support)
- BigQuery SQL output

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for release notes.

## Installation

Install from PyPI:

```bash
pip install engel-semantic-layer
```

Install from TestPyPI (pre-release validation):

```bash
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple engel-semantic-layer
```

Install from source for local development:

```bash
pip install -e .
```

## Example

```python
from engel_semantic_layer import QueryFilter, QueryRequest, SemanticLayer

layer = SemanticLayer.from_path("examples/model")

request = QueryRequest(
    from_date="2026-01-01T00:00:00Z",
    to_date="2026-01-31T23:59:59Z",
    time_grain="daily",
    breakdown_dimension_ids=["user_email"],
    filters=[
        QueryFilter(
            dimension_id="user_email",
            filter_values=["demo-user@example.com"],
        )
    ],
)

sql = layer.compile_sql(metric_id="metric_event_count", request=request)
print(sql)
```

## YAML shape

You can use either:
- one-module-per-file (`module:` root), or
- a single `semantic_layer.yml` with `modules` and `cross_module_metrics`.

```yaml
module:
  schema: "analytics"
  table: "records"
  identifier: "records"
  dimensions:
    - column: "country"
      type: "country"
  metrics:
    - identifier: "record_volume"
      name: "Record volume"
      ai_context:
        synonyms:
          - "booked value"
          - "record amount"
      calculation: "sum"
      time: "records.recorded_at"
      value: "records.amount"
      dimensions:
        - "this.*"
```

Relationship example (use dimensions from another module safely):

```yaml
module:
  identifier: "orders"
  schema: "analytics"
  table: "orders"
  dimensions:
    - column: "customer_id"
      type: "categorical"
  relationships:
    - to_module: "customers"
      from_column: "customer_id"
      to_column: "id"
      relationship_type: "many-to-one" # one-to-one | one-to-many | many-to-one | many-to-many
      join_type: "left"                # left | inner | full | right
  metrics:
    - identifier: "orders_count"
      name: "Orders"
      calculation: "count"
      time: "orders.created_at"
      dimensions:
        - "customers.country"
```

Cross-module derived ratio in single-file mode:

```yaml
semantic_layer:
  modules:
    - identifier: "wide_metric_count_a"
      project: "demo-project-123456"
      schema: "analytics_wide"
      table: "wide_metric_count_a"
      metrics:
        - identifier: "metric_actual_primary"
          name: "Primary Metric, Actual"
          calculation: "sum"
          time: "wide_metric_count_a.fulfilled_at"
          value: "wide_metric_count_a.value_primary_amount"
          filters:
            - column: "settled_at"
              operator: "is-not"
              expression: "NULL"

    - identifier: "entity_snapshots"
      project: "demo-project-123456"
      schema: "analytics_marts"
      table: "entity_snapshots"
      metrics:
        - identifier: "metric_entities_active"
          name: "# Active Entities"
          calculation: "count-distinct"
          time: "entity_snapshots.date"
          distinct_on: "entity_snapshots.dim_entity_id"
          filters:
            - column: "is_live"
              operator: "is"
              expression: "true"

  cross_module_metrics:
    - identifier: "ratio_primary_per_entity"
      name: "Primary Metric per Active Entity"
      calculation: "derived-ratio"
      numerator_metric: "metric_actual_primary"
      denominator_metric: "metric_entities_active"
      join_on: "time"
      join_type: "inner" # optional: inner | left | full
```

API/OpenAPI field mapping (camelCase -> YAML snake_case):

- `sqlExpression` -> `sql_expression`
- `numeratorSql` -> `numerator_sql`
- `denominatorSql` -> `denominator_sql`
- `numeratorMetricId` -> `numerator_metric`
- `denominatorMetricId` -> `denominator_metric`

Notes:
- `derived-ratio` currently supports `join_on: time`.
- `join_type` supports `inner` (default), `left`, and `full` (`FULL OUTER JOIN`).
- query request must include `timeGrain` for cross-module metrics (`timeGrain: none` is not supported for cross-module derived ratios).
- requested breakdowns must be available on both numerator and denominator metrics.

## Publishing

Release asset workflow:

- `.github/workflows/release-assets.yml`

What it does:
- Runs tests
- Builds wheel + sdist
- Uploads artifacts to the GitHub release for `v*` tags
- On manual dispatch, can attach to a specified tag or create a draft release

## CLI

Validate your semantic model:

```bash
engel-semantic-layer validate --model-path examples/model --json
```

Compact summary output for CI logs:

```bash
engel-semantic-layer validate --model-path examples/model --json --summary-only
```

Validate that all metrics are compilable with synthetic dates (uses `--time-grain monthly` by default so cross-module metrics compile too):

```bash
engel-semantic-layer validate \
  --model-path examples/model \
  --json \
  --check-compilable
```

Override synthetic compile time grain when needed:

```bash
engel-semantic-layer validate \
  --model-path examples/semantic_layer.yml \
  --check-compilable \
  --time-grain weekly
```

Use period shortcuts instead of explicit from/to:

```bash
engel-semantic-layer validate \
  --model-path examples/model \
  --check-compilable \
  --period "last 12 months"
```

Write a full validation report artifact for CI (includes warning summary, compile failures, and per-metric validation statuses):

```bash
engel-semantic-layer validate \
  --model-path examples/model \
  --check-compilable \
  --report /tmp/semantic-validate-report.json
```

Fail on semantic warnings:

```bash
engel-semantic-layer validate \
  --model-path examples/model \
  --warnings-as-errors
```

Validate with strict column linting for all metrics:

```bash
engel-semantic-layer validate \
  --model-path examples/model \
  --json \
  --strict-column-lint \
  --column-registry /tmp/registry.json
```

Fail on potential fanout-risk relationship traversals during compile checks:

```bash
engel-semantic-layer validate \
  --model-path examples/model \
  --check-compilable \
  --strict-relationships
```

List metrics:

```bash
engel-semantic-layer metrics --model-path examples/model
```

Machine-readable list + filters:

```bash
engel-semantic-layer metrics \
  --model-path examples/model \
  --format json \
  --module fact_records \
  --calculation sum
```

Inspect module join graph:

```bash
engel-semantic-layer joins --model-path examples/model --format json
```

Compile SQL from a metric + request payload:

```bash
engel-semantic-layer compile \
  --model-path examples/model \
  --metric-id metric_event_count \
  --request /tmp/request.json \
  --format sql
```

Compile a cross-module derived ratio from single-file semantic layer:

```bash
engel-semantic-layer compile \
  --model-path examples/semantic_layer.yml \
  --metric-id ratio_primary_per_entity \
  --request /tmp/request.json \
  --period "last 12 months" \
  --format sql
```

Get compiler explain output (resolved metadata + source files):

```bash
engel-semantic-layer compile \
  --model-path examples/model \
  --metric-id metric_event_count \
  --request /tmp/request.json \
  --format explain
```

Validate request compatibility without returning SQL:

```bash
engel-semantic-layer compile \
  --model-path examples/model \
  --metric-id metric_event_count \
  --request /tmp/request.json \
  --dry-run-validate
```

Use period shortcuts in compile:

```bash
engel-semantic-layer compile \
  --model-path examples/model \
  --metric-id metric_event_distinct_count \
  --request /tmp/request.json \
  --period "Q1 2025"
```

Supported period formats:
- `last x years`
- `last x months`
- `last x quarters`
- `last x days`
- `current year` (YTD)

`last x days` uses an inclusive lookback window: from `today - x days` through `today`.
Example: `last 30 days` returns 31 daily buckets when today is included, and 30 when `excludeToday` is enabled.
- `current month` (MTD)
- `QX YYYY` (e.g. `Q2 2025`)
- `YYYY` (e.g. `2025`)
- `MM-YYYY` (e.g. `02-2025`)

Time comparison options in request payload (`timeComparison`):
- Percentage comparisons: `YoY`, `MoM`
- Comparison values: `Last Year`, `Last Month`
- Daily-only percentage comparison: `YoY (Match Weekday)`

`YoY (Match Weekday)` requires `timeGrain: daily` and compares to 364 days prior.
With `timeGrain: none`, comparisons are done as full-window totals (e.g. YTD vs last YTD).

Target comparison options in request payload:
- `targetSeries` (e.g. `budget_current`, `stretch_target`)
- `targetComparisonMode`: `both` (default), `value`, `percentage`

Target output matrix:

| targetComparisonMode | `target_value` | `target_comparison_percentage` |
|---|---|---|
| `both` | ✅ | ✅ |
| `value` | ✅ | ❌ |
| `percentage` | ❌ | ✅ |

Example request snippets:

```json
{ "targetSeries": "budget_current", "targetComparisonMode": "both" }
```

```json
{ "targetSeries": "budget_current", "targetComparisonMode": "value" }
```

```json
{ "targetSeries": "budget_current", "targetComparisonMode": "percentage" }
```

Period handling options in request payload:
- `excludeOpenPeriod` (`true|false`)
- `excludeToday` (`true|false`)

When `excludeOpenPeriod` is true, the compiler excludes the currently open period bucket
for the selected `timeGrain` (e.g. current month for `monthly`, current day for `daily`).

When `excludeToday` is true, the compiler clips `toDate` by one day before other period logic.

Cumulative options in request payload:
- `cumulativeMode`: `mtd`, `ytd`, `rolling_days`
- `rollingDays`: positive integer (required only for `rolling_days`)

Validation rules:
- `mtd` requires `timeGrain: daily`
- `rolling_days` requires `timeGrain: daily`
- `ytd` works with daily/weekly/monthly/quarterly/yearly grains
- cumulative modes require a time grain (not `none`)
- cumulative + targets requires `targetComparisonMode` to include values (`value` or `both`)

Output columns for cumulative mode:
- `metric_base` (original bucket metric)
- `metric` (cumulative metric)
- if `targetSeries` is set: `target_value_base` + cumulative `target_value`

When `targetSeries` is provided, the compiler joins aggregated targets from
`demo-project-123456.analytics_prod.fct_targets_monthly` by metric identifier,
selected date range, and breakdown dimensions (if any), and returns target columns
based on `targetComparisonMode` (`both` by default):
- `target_value`
- `target_comparison_percentage`

Target source can be overridden in CLI with:
- `--target-project`
- `--target-schema`
- `--target-table`

With `--strict-column-lint`, target columns used for series/metric/time/value,
breakdowns, and dimension filters are validated against the column registry.

End-to-end request example (comparison + target + cumulative + period clipping):

```json
{
  "fromDate": "2026-01-01T00:00:00Z",
  "toDate": "2026-12-31T23:59:59Z",
  "timeGrain": "monthly",
  "timeComparison": "YoY",
  "targetSeries": "budget_current",
  "targetComparisonMode": "both",
  "cumulativeMode": "ytd",
  "excludeOpenPeriod": true,
  "excludeToday": true,
  "breakdownDimensionIds": ["segment_group"],
  "filters": [
    {
      "dimensionId": "segment_group",
      "filterValues": ["SegmentA"]
    }
  ]
}
```

You can also pipe request JSON through stdin:

```bash
echo '{"fromDate":"2026-01-01T00:00:00Z","toDate":"2026-01-31T23:59:59Z"}' | \
  engel-semantic-layer compile \
    --model-path examples/model \
    --metric-id metric_event_distinct_count \
    --request -
```

Compile with strict column linting:

```bash
engel-semantic-layer compile \
  --model-path examples/model \
  --metric-id metric_actual_secondary \
  --request /tmp/request.json \
  --strict-column-lint \
  --column-registry /tmp/registry.json
```

Compile with strict relationship checks (fails on potential fanout-risk traversals):

```bash
engel-semantic-layer compile \
  --model-path examples/model \
  --metric-id metric_actual_secondary \
  --request /tmp/request.json \
  --strict-relationships
```

## Optional strict column linting

You can enforce table+column existence checks during compile by providing a column registry.

```python
from engel_semantic_layer import ColumnRegistry, QueryRequest, SemanticLayer

registry = ColumnRegistry.from_dict(
    {
        "analytics_wide": {
            "wide_metric_count_a": ["recorded_at", "fulfilled_at", "value_secondary_amount", "status"]
        }
    }
)

layer = SemanticLayer.from_path(
    "examples/model",
    strict_column_lint=True,
    column_registry=registry,
)

sql = layer.compile_sql(
    "metric_actual_secondary",
    QueryRequest(from_date="2026-01-01T00:00:00Z", to_date="2026-01-31T23:59:59Z"),
)
```

You can also construct the registry from `INFORMATION_SCHEMA.COLUMNS` rows using
`ColumnRegistry.from_information_schema_rows(...)`.

CLI helper:

```bash
engel-semantic-layer registry from-information-schema \
  --input /tmp/information_schema_rows.json \
  --output /tmp/registry.json
```

Input supports both:
- JSON array (`[ {...}, {...} ]`)
- JSONL (one JSON object per line)

## Testing

The test suite includes SQL snapshot tests for representative metrics. Run:

```bash
uv run --extra dev pytest -q
```

If SQL output intentionally changes, update snapshot files in `tests/snapshots/`.

CLI exit codes:
- `0` success
- `2` validation/parse/compile input error

For machine-readable failures, use `--json-errors`:

```bash
engel-semantic-layer --json-errors compile ...
```

## Notes

- This compiler focuses on BigQuery SQL generation.
- Output aims to be Engel-like SQL, but this is an independent implementation.
- Strict validation is enabled for basics: duplicate IDs, unsupported operators/calculations, invalid metric field combinations, and disallowed breakdown/filter dimensions.
- Custom SQL metric expressions are normalized for BigQuery string literals (double-quoted strings are converted to single-quoted literals).
- Custom SQL reference linting is enabled: dotted references in custom expressions are validated and unknown table references fail at compile time.
- Query compilation guards against invalid breakdown requests (max 2 dimensions, no duplicates, no alias collisions).
- Missing join-path errors are actionable and include metric + module context plus reference sources (which fields caused the dependency).
