Metadata-Version: 2.4
Name: informatica-python
Version: 1.8.2
Summary: Convert Informatica PowerCenter workflow XML to Python/PySpark code
Author: Nick
License: MIT
Keywords: informatica,powercenter,etl,code-generator,pandas,pyspark,data-engineering
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Database :: Database Engines/Servers
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lxml>=4.9.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# informatica-python

Convert Informatica PowerCenter workflow XML exports into clean, runnable Python/PySpark code.

**Author:** Nick
**License:** MIT
**PyPI:** [informatica-python](https://pypi.org/project/informatica-python/)

---

## Overview

`informatica-python` parses Informatica PowerCenter XML export files and generates equivalent Python code using your choice of data library. It handles all 72 DTD tags from the PowerCenter XML schema and produces a complete, ready-to-run Python project.

## Installation

```bash
pip install informatica-python
```

## Quick Start

### Command Line

```bash
# Generate Python files to a directory
informatica-python workflow_export.xml -o output_dir

# Generate as a zip archive
informatica-python workflow_export.xml -z output.zip

# Use a different data library
informatica-python workflow_export.xml -o output_dir --data-lib polars

# Include a parameter file
informatica-python workflow_export.xml -o output_dir --param-file workflow.param

# Enable data quality validation on type casts
informatica-python workflow_export.xml -o output_dir --validate-casts

# Parse to JSON only (no code generation)
informatica-python workflow_export.xml --json

# Save parsed JSON to file
informatica-python workflow_export.xml --json-file parsed.json
```

### Python API

```python
from informatica_python import InformaticaConverter

converter = InformaticaConverter()

# Parse and generate files
converter.convert_to_files("workflow_export.xml", "output_dir")

# Parse and generate zip
converter.convert_to_zip("workflow_export.xml", "output.zip")

# Parse to structured dict
result = converter.parse_file("workflow_export.xml")

# Use a different data library
converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars")
```

## Generated Output Files

| File | Description |
|------|-------------|
| `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions), window/analytic functions, stored procedure execution, state persistence |
| `mapping_N.py` | One per mapping — transformation logic with row-count logging, source reads, target writes, inline documentation |
| `workflow.py` | Task orchestration with topological ordering, decision branching, worklet calls, and error handling |
| `config.yml` | Connection configs, source/target metadata, runtime parameters |
| `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms (with ANSI-translated variants) |
| `error_log.txt` | Conversion summary with unsupported transform analysis, unmapped port detection, and unknown expression function tracing |

## Supported Data Libraries

Select via `--data-lib` CLI flag or `data_lib` parameter:

| Library | Flag | Best For |
|---------|------|----------|
| **pandas** | `pandas` (default) | General-purpose, most compatible |
| **dask** | `dask` | Large datasets, parallel processing |
| **polars** | `polars` | High performance, Rust-backed |
| **vaex** | `vaex` | Out-of-core, billion-row datasets |
| **modin** | `modin` | Drop-in pandas replacement, multi-core |

## Supported Transformations

The code generator produces real, runnable Python for these transformation types:

- **Source Qualifier** — SQL override, pre/post SQL, column selection, session connection overrides
- **Expression** — Field-level expressions converted to vectorized pandas operations (`df["COL"]` style)
- **Filter** — Row filtering with vectorized converted conditions
- **Joiner** — `pd.merge()` with join type and condition parsing (inner/left/right/outer)
- **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads, multiple match policies, default values
- **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST, computed aggregates
- **Sorter** — `sort_values()` with multi-key ascending/descending
- **Router** — Multi-group conditional routing with named groups
- **Union** — `pd.concat()` across multiple input groups
- **Update Strategy** — DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT routing with actual target INSERT/UPDATE/DELETE operations, dialect-aware SQL placeholders, auto-detected primary keys
- **Sequence Generator** — Auto-incrementing ID columns
- **Normalizer** — `pd.melt()` with auto-detected id/value vars
- **Rank** — `groupby().rank()` with Top-N filtering
- **Stored Procedure** — Full code generation with Oracle/MSSQL/generic support, input/output parameter mapping
- **Transaction Control** — Commit/rollback logic
- **Custom / Java** — Placeholder stubs with TODO markers
- **SQL Transform** — Direct SQL execution pass-through

## Supported XML Tags (72 Tags)

**Top-level:** POWERMART, REPOSITORY, FOLDER, FOLDERVERSION

**Source/Target:** SOURCE, SOURCEFIELD, TARGET, TARGETFIELD, TARGETINDEX, TARGETINDEXFIELD, FLATFILE, XMLINFO, XMLTEXT, GROUP, TABLEATTRIBUTE, FIELDATTRIBUTE, METADATAEXTENSION, KEYWORD, ERPSRCINFO

**Mapping/Mapplet:** MAPPING, MAPPLET, TRANSFORMATION, TRANSFORMFIELD, TRANSFORMFIELDATTR, TRANSFORMFIELDATTRDEF, INSTANCE, ASSOCIATED_SOURCE_INSTANCE, CONNECTOR, MAPDEPENDENCY, TARGETLOADORDER, MAPPINGVARIABLE, FIELDDEPENDENCY, INITPROP, ERPINFO

**Task/Session/Workflow:** TASK, TIMER, VALUEPAIR, SCHEDULER, SCHEDULEINFO, STARTOPTIONS, ENDOPTIONS, SCHEDULEOPTIONS, RECURRING, CUSTOM, DAILYFREQUENCY, REPEAT, FILTER, SESSION, CONFIGREFERENCE, SESSTRANSFORMATIONINST, SESSTRANSFORMATIONGROUP, PARTITION, HASHKEY, KEYRANGE, CONFIG, SESSIONCOMPONENT, CONNECTIONREFERENCE, TASKINSTANCE, WORKFLOWLINK, WORKFLOWVARIABLE, WORKFLOWEVENT, WORKLET, WORKFLOW, ATTRIBUTE

**Shortcut:** SHORTCUT

**SAP:** SAPFUNCTION, SAPSTRUCTURE, SAPPROGRAM, SAPOUTPUTPORT, SAPVARIABLE, SAPPROGRAMFLOWOBJECT, SAPTABLEPARAM

## Key Features

### Row-Count Logging (v1.8+)

Generated code automatically logs row counts at every step of the data pipeline:

```
Source SQ_CUSTOMERS: 10000 rows read
EXP_CALC (Expression): 10000 input rows -> 10000 output rows
FIL_ACTIVE (Filter): 10000 input rows -> 8542 output rows
AGG_TOTALS (Aggregator): 8542 input rows -> 150 output rows
Target TGT_SUMMARY: 150 rows written
```

All row-count operations are backend-safe (wrapped in try/except), so Dask and other lazy-evaluation backends won't fail.

### Generated Code Documentation (v1.8+)

Every generated mapping function includes a rich docstring describing:
- Mapping name and original Informatica description
- Source and target tables/files
- Transformation pipeline with field counts per step

Each transformation block is annotated with:
- Separator headers for visual scanning
- Transform type and description (from Informatica XML)
- Input and output field lists (truncated at 10 for readability)

### Window / Analytic Functions (v1.7+)

DataFrame-level analytic functions for aggregation transforms:
- `moving_avg_df(df, col, window)` — rolling mean via `.rolling().mean()`
- `moving_sum_df(df, col, window)` — rolling sum via `.rolling().sum()`
- `cume_df(df, col)` — cumulative sum via `.expanding().sum()`
- `percentile_df(df, col, pct)` — quantile via `.quantile()`

### Update Strategy with Target Operations (v1.7+)

Update Strategy transforms now generate real INSERT/UPDATE/DELETE operations:
- Static strategies (0/1/2/3) map to INSERT/UPDATE/DELETE/REJECT
- DD_INSERT/DD_UPDATE/DD_DELETE/DD_REJECT expressions parsed from conditions
- Target writer splits rows and routes to appropriate SQL operations
- Dialect-aware SQL placeholders (`?` for MSSQL, `%s` for PostgreSQL/Oracle)
- Primary key columns auto-detected from target field definitions

### Stored Procedure Execution (v1.7+)

Full stored procedure code generation (not just stubs):
- Oracle: `cursor.callproc()` with output parameter registration
- MSSQL: `EXEC` with output parameter capture
- Generic: `CALL` syntax for other databases
- Input/output parameter mapping from transformation fields
- Empty-input guard prevents errors on empty upstream DataFrames

### State Persistence (v1.7+)

JSON-based variable persistence between workflow runs:
- `load_persistent_state()` / `save_persistent_state()` bracketing workflow execution
- `get_persistent_variable()` / `set_persistent_variable()` scoped by workflow/mapping name
- Mapping variables marked `is_persistent="YES"` automatically load from and save to state file
- Non-persistent variables remain unaffected

### SQL Dialect Translation (v1.6+)

Automatically translates vendor-specific SQL to ANSI equivalents:
- **Oracle:** NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP, DECODE→CASE, NVL2→CASE, (+)→ANSI JOIN, ROWNUM→LIMIT
- **MSSQL:** GETDATE→CURRENT_TIMESTAMP, ISNULL→COALESCE, TOP N→LIMIT, LEN→LENGTH, CHARINDEX→POSITION
- Auto-detects source dialect; outputs both original and translated SQL

### Enhanced Error Reporting (v1.6+)

Structured error log with three analysis sections:
- **Unsupported Transforms:** Lists each skipped transform with type, field count, and attributes
- **Unmapped Ports:** OUTPUT fields not connected to any downstream transform
- **Unsupported Expression Functions:** Unknown functions with location traces

### Nested Mapplet Support (v1.6+)

Recursively expands mapplet-within-mapplet instances:
- Double-underscore namespacing for nested transforms
- Depth limit of 10 with circular reference protection
- Connector rewiring through the full expansion tree

### Data Quality Validation (v1.6+)

Optional `--validate-casts` flag generates null-count checks before/after type casting:
- Counts null values pre- and post-coercion per column
- Logs warnings when coercion introduces new nulls
- Helps identify data quality issues during test runs

### Vectorized Expression Generation (v1.5+)

Column-level pandas operations instead of row-level iteration:
- IIF → `np.where()`, NVL → `.fillna()`, UPPER/LOWER → `.str.upper()/.str.lower()`
- SUBSTR → `.str[start:end]`, TO_INTEGER → `pd.to_numeric()`, TO_DATE → `pd.to_datetime()`
- IS NULL/IS NOT NULL → `.isna()`/`.notna()`

### Parameter File Support (v1.5+)

Standard Informatica `.param` file parsing:
- `[Global]` and `[folder.WF:workflow.ST:session]` section support
- `get_param(config, var_name)` resolution chain: config → env vars → defaults
- CLI `--param-file` flag for specifying parameter files

### Session Connection Overrides (v1.4+)

When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.

### Worklet Support (v1.4+)

Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.

### Type Casting at Target Writes (v1.4+)

Target field datatypes are mapped to pandas types and generate proper casting code:
- Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
- Dates: `pd.to_datetime(errors='coerce')`
- Decimals/Floats: `pd.to_numeric(errors='coerce')`
- Booleans: `.astype('boolean')`

### Flat File Handling (v1.3+)

Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.

### Mapplet Inlining (v1.3+)

Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.

### Decision Tasks (v1.3+)

Converts Informatica decision conditions to Python if/else branches with proper variable substitution.

### Expression Converter (80+ Functions)

Converts Informatica expressions to Python equivalents:

- **String:** SUBSTR, LTRIM, RTRIM, UPPER, LOWER, LPAD, RPAD, INSTR, LENGTH, CONCAT, REPLACE, REG_EXTRACT, REG_REPLACE, REVERSE, INITCAP, CHR, ASCII
- **Date:** ADD_TO_DATE, DATE_DIFF, GET_DATE_PART, SYSDATE, SYSTIMESTAMP, TO_DATE, TO_CHAR, TRUNC (date)
- **Numeric:** ROUND, TRUNC, MOD, ABS, CEIL, FLOOR, POWER, SQRT, LOG, EXP, SIGN
- **Conversion:** TO_INTEGER, TO_BIGINT, TO_FLOAT, TO_DECIMAL, TO_CHAR, TO_DATE
- **Null handling:** IIF, DECODE, NVL, NVL2, ISNULL, IS_SPACES, IS_NUMBER
- **Aggregate:** SUM, AVG, COUNT, MIN, MAX, FIRST, LAST, MEDIAN, STDDEV, VARIANCE
- **Lookup:** :LKP expressions with dynamic lookup references
- **Variable:** SETVARIABLE / mapping variable assignment

## Requirements

- Python >= 3.8
- lxml >= 4.9.0
- PyYAML >= 6.0

## Changelog

### v1.8.x (Phase 7)
- Row-count logging at every pipeline step (source reads, transforms, target writes)
- Backend-safe logging (try/except wrapped for Dask/lazy backends)
- Rich mapping function docstrings with sources, targets, and transform pipeline summary
- Per-transform documentation headers with description, input/output field lists

### v1.7.x (Phase 6)
- Window/analytic functions (rolling avg/sum, cumulative sum, percentile)
- Update Strategy routing with actual INSERT/UPDATE/DELETE target operations
- Dialect-aware SQL placeholders for MSSQL/PostgreSQL/Oracle
- Full stored procedure code generation (Oracle/MSSQL/generic)
- JSON-based state persistence for mapping and workflow variables
- Primary key auto-detection for update strategy targets

### v1.6.x (Phase 5)
- SQL dialect translation (Oracle/MSSQL → ANSI)
- Enhanced error reporting (unsupported transforms, unmapped ports, unknown functions)
- Nested mapplet expansion with circular reference protection
- Data quality validation warnings on type casting (`--validate-casts`)

### v1.5.x (Phase 4)
- Parameter file support (`.param` files with section parsing)
- Vectorized expression generation (column-level pandas operations)
- Library-specific code adapters (polars/dask/modin/vaex syntax generation)
- 72+ integration tests

### v1.4.x (Phase 3)
- Session connection overrides for sources and targets
- Worklet function generation with safe invocation
- Type casting at target writes based on TARGETFIELD datatypes
- Flat-file session path overrides properly wired

### v1.3.x (Phase 2)
- FLATFILE metadata in source reads and target writes
- Normalizer with `pd.melt()`
- Rank with group-by and Top-N filtering
- Decision tasks with real if/else branches
- Mapplet instance inlining

### v1.2.x (Phase 1)
- Core parser for all 72 XML tags
- Expression converter with 80+ functions
- Aggregator, Joiner, Lookup code generation
- Workflow orchestration with topological task ordering
- Multi-library support (pandas, dask, polars, vaex, modin)

## Development

```bash
# Clone and install in development mode
cd informatica_python
pip install -e ".[dev]"

# Run tests (136 tests)
pytest tests/ -v
```

## License

MIT License - Copyright (c) 2025 Nick

See [LICENSE](LICENSE) for details.
