Metadata-Version: 2.4
Name: informatica-python
Version: 1.8.0
Summary: Convert Informatica PowerCenter workflow XML to Python/PySpark code
Author: Nick
License: MIT
Keywords: informatica,powercenter,etl,code-generator,pandas,pyspark,data-engineering
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Database :: Database Engines/Servers
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lxml>=4.9.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# informatica-python

Convert Informatica PowerCenter workflow XML exports into clean, runnable Python/PySpark code.

**Author:** Nick
**License:** MIT
**PyPI:** [informatica-python](https://pypi.org/project/informatica-python/)

---

## Overview

`informatica-python` parses Informatica PowerCenter XML export files and generates equivalent Python code using your choice of data library. It handles all 72 DTD tags from the PowerCenter XML schema and produces a complete, ready-to-run Python project.

## Installation

```bash
pip install informatica-python
```

## Quick Start

### Command Line

```bash
# Generate Python files to a directory
informatica-python workflow_export.xml -o output_dir

# Generate as a zip archive
informatica-python workflow_export.xml -z output.zip

# Use a different data library
informatica-python workflow_export.xml -o output_dir --data-lib polars

# Parse to JSON only (no code generation)
informatica-python workflow_export.xml --json

# Save parsed JSON to file
informatica-python workflow_export.xml --json-file parsed.json
```

### Python API

```python
from informatica_python import InformaticaConverter

converter = InformaticaConverter()

# Parse and generate files
converter.convert_to_files("workflow_export.xml", "output_dir")

# Parse and generate zip
converter.convert_to_zip("workflow_export.xml", "output.zip")

# Parse to structured dict
result = converter.parse_file("workflow_export.xml")

# Use a different data library
converter.convert_to_files("workflow_export.xml", "output_dir", data_lib="polars")
```

## Generated Output Files

| File | Description |
|------|-------------|
| `helper_functions.py` | Database/file I/O helpers, Informatica expression equivalents (80+ functions) |
| `mapping_N.py` | One per mapping — transformation logic, source reads, target writes |
| `workflow.py` | Task orchestration with topological ordering and error handling |
| `config.yml` | Connection configs, source/target metadata, runtime parameters |
| `all_sql_queries.sql` | All SQL extracted from Source Qualifiers, Lookups, SQL transforms |
| `error_log.txt` | Conversion summary, warnings, and unsupported feature notes |

## Supported Data Libraries

Select via `--data-lib` CLI flag or `data_lib` parameter:

| Library | Flag | Best For |
|---------|------|----------|
| **pandas** | `pandas` (default) | General-purpose, most compatible |
| **dask** | `dask` | Large datasets, parallel processing |
| **polars** | `polars` | High performance, Rust-backed |
| **vaex** | `vaex` | Out-of-core, billion-row datasets |
| **modin** | `modin` | Drop-in pandas replacement, multi-core |

## Supported Transformations

The code generator produces real, runnable Python for these transformation types:

- **Source Qualifier** — SQL override, pre/post SQL, column selection
- **Expression** — Field-level expressions converted to pandas operations
- **Filter** — Row filtering with converted conditions
- **Joiner** — `pd.merge()` with join type and condition parsing
- **Lookup** — `pd.merge()` lookups with connection-aware DB/file reads
- **Aggregator** — `groupby().agg()` with SUM/COUNT/AVG/MIN/MAX/FIRST/LAST
- **Sorter** — `sort_values()` with multi-key ascending/descending
- **Router** — Multi-group conditional routing with if/elif/else
- **Union** — `pd.concat()` across multiple input groups
- **Update Strategy** — Insert/Update/Delete/Reject flag generation
- **Sequence Generator** — Auto-incrementing ID columns
- **Normalizer** — `pd.melt()` with auto-detected id/value vars
- **Rank** — `groupby().rank()` with Top-N filtering
- **Stored Procedure** — Stub generation with SP name and parameters
- **Transaction Control** — Commit/rollback logic stubs
- **Custom / Java** — Placeholder stubs with TODO markers
- **SQL Transform** — Direct SQL execution pass-through

## Supported XML Tags (72 Tags)

**Top-level:** POWERMART, REPOSITORY, FOLDER, FOLDERVERSION

**Source/Target:** SOURCE, SOURCEFIELD, TARGET, TARGETFIELD, TARGETINDEX, TARGETINDEXFIELD, FLATFILE, XMLINFO, XMLTEXT, GROUP, TABLEATTRIBUTE, FIELDATTRIBUTE, METADATAEXTENSION, KEYWORD, ERPSRCINFO

**Mapping/Mapplet:** MAPPING, MAPPLET, TRANSFORMATION, TRANSFORMFIELD, TRANSFORMFIELDATTR, TRANSFORMFIELDATTRDEF, INSTANCE, ASSOCIATED_SOURCE_INSTANCE, CONNECTOR, MAPDEPENDENCY, TARGETLOADORDER, MAPPINGVARIABLE, FIELDDEPENDENCY, INITPROP, ERPINFO

**Task/Session/Workflow:** TASK, TIMER, VALUEPAIR, SCHEDULER, SCHEDULEINFO, STARTOPTIONS, ENDOPTIONS, SCHEDULEOPTIONS, RECURRING, CUSTOM, DAILYFREQUENCY, REPEAT, FILTER, SESSION, CONFIGREFERENCE, SESSTRANSFORMATIONINST, SESSTRANSFORMATIONGROUP, PARTITION, HASHKEY, KEYRANGE, CONFIG, SESSIONCOMPONENT, CONNECTIONREFERENCE, TASKINSTANCE, WORKFLOWLINK, WORKFLOWVARIABLE, WORKFLOWEVENT, WORKLET, WORKFLOW, ATTRIBUTE

**Shortcut:** SHORTCUT

**SAP:** SAPFUNCTION, SAPSTRUCTURE, SAPPROGRAM, SAPOUTPUTPORT, SAPVARIABLE, SAPPROGRAMFLOWOBJECT, SAPTABLEPARAM

## Key Features

### Session Connection Overrides (v1.4+)
When sessions define per-transform connection overrides (different database, file directory, or filename), the generated code uses those overrides instead of source/target defaults.

### Worklet Support (v1.4+)
Worklet workflows are detected and generate separate `run_worklet_NAME(config)` functions. The main workflow calls these automatically for Worklet task types.

### Type Casting at Target Writes (v1.4+)
Target field datatypes are mapped to pandas types and generate proper casting code:
- Integers: nullable `Int64`/`Int32` or `fillna(0).astype(int)` for NOT NULL
- Dates: `pd.to_datetime(errors='coerce')`
- Decimals/Floats: `pd.to_numeric(errors='coerce')`
- Booleans: `.astype('boolean')`

### Flat File Handling (v1.3+)
Parses FLATFILE metadata for delimiter, fixed-width, header lines, skip rows, quote/escape chars. Generates `pd.read_fwf()` for fixed-width or enriched `read_file()` for delimited.

### Mapplet Inlining (v1.3+)
Expands Mapplet instances into prefixed transforms, rewires connectors, and eliminates duplication.

### Decision Tasks (v1.3+)
Converts Informatica decision conditions to Python if/else branches with proper variable substitution.

### Expression Converter (80+ Functions)

Converts Informatica expressions to Python equivalents:

- **String:** SUBSTR, LTRIM, RTRIM, UPPER, LOWER, LPAD, RPAD, INSTR, LENGTH, CONCAT, REPLACE, REG_EXTRACT, REG_REPLACE, REVERSE, INITCAP, CHR, ASCII
- **Date:** ADD_TO_DATE, DATE_DIFF, GET_DATE_PART, SYSDATE, SYSTIMESTAMP, TO_DATE, TO_CHAR, TRUNC (date)
- **Numeric:** ROUND, TRUNC, MOD, ABS, CEIL, FLOOR, POWER, SQRT, LOG, EXP, SIGN
- **Conversion:** TO_INTEGER, TO_BIGINT, TO_FLOAT, TO_DECIMAL, TO_CHAR, TO_DATE
- **Null handling:** IIF, DECODE, NVL, NVL2, ISNULL, IS_SPACES, IS_NUMBER
- **Aggregate:** SUM, AVG, COUNT, MIN, MAX, FIRST, LAST, MEDIAN, STDDEV, VARIANCE
- **Lookup:** :LKP expressions with dynamic lookup references
- **Variable:** SETVARIABLE / mapping variable assignment

## Requirements

- Python >= 3.8
- lxml >= 4.9.0
- PyYAML >= 6.0

## Changelog

### v1.4.x (Phase 3)
- Session connection overrides for sources and targets
- Worklet function generation with safe invocation
- Type casting at target writes based on TARGETFIELD datatypes
- Flat-file session path overrides properly wired

### v1.3.x (Phase 2)
- FLATFILE metadata in source reads and target writes
- Normalizer with `pd.melt()`
- Rank with group-by and Top-N filtering
- Decision tasks with real if/else branches
- Mapplet instance inlining

### v1.2.x (Phase 1)
- Core parser for all 72 XML tags
- Expression converter with 80+ functions
- Aggregator, Joiner, Lookup code generation
- Workflow orchestration with topological task ordering
- Multi-library support (pandas, dask, polars, vaex, modin)

## Development

```bash
# Clone and install in development mode
cd informatica_python
pip install -e ".[dev]"

# Run tests (25 tests)
pytest tests/test_converter.py -v
```

## License

MIT License - Copyright (c) 2025 Nick

See [LICENSE](LICENSE) for details.
