Metadata-Version: 2.4
Name: stylometry-cli
Version: 1.0.2
Summary: Quantitative author fingerprinting & stylometric analysis - offline CLI tool
Author: SpectreDeath
License: MIT
Project-URL: Homepage, https://github.com/SpectreDeath/stylometry-cli
Project-URL: Documentation, https://github.com/SpectreDeath/stylometry-cli#readme
Project-URL: Repository, https://github.com/SpectreDeath/stylometry-cli.git
Project-URL: Issues, https://github.com/SpectreDeath/stylometry-cli/issues
Keywords: stylometry,authorship,nlp,linguistics,forensics,text-analysis
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: pypdf>=3.0.0
Provides-Extra: ai
Requires-Dist: requests>=2.28.0; extra == "ai"
Provides-Extra: stylometrix
Requires-Dist: stylo_metrix>=0.1.9; extra == "stylometrix"
Requires-Dist: spacy>=3.7.0; extra == "stylometrix"
Requires-Dist: regex>=2023.0.0; extra == "stylometrix"
Provides-Extra: all
Requires-Dist: stylometry-cli[ai,stylometrix]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Dynamic: license-file

# Stylometry CLI (local/offline) — v1.0

This is a small, **offline** Python tool to extract **stylometric artifacts/patterns** from text
and optionally compute simple **similarity** signals between corpora using character n-grams.

It’s designed to slot into your **Stylometry Orchestrator** workflow by emitting SAO-style
`ResultBundle_*.json` files plus CSV artifacts.

## What it does

For each document (and each chunk of a document), it computes:

- **Lexical**
  - word count, unique word count
  - average word length
  - **MATTR** lexical diversity (more length-robust than raw TTR)

- **Syntactic (proxy)**
  - average sentence length
  - sentence length variation (population SD)

- **Habitual**
  - function word frequencies (configurable list)
  - punctuation rates (commas/semicolons/etc per 1000 words and per sentence)

If 2+ corpora are provided and there are enough chunks, it also computes:

- **Char n-gram TF-IDF centroid cosine similarity** across corpora (`corpus_similarity_char_ngrams.csv`)
- **Nearest-centroid chunk assignment** (`chunk_assignments_char_ngrams.csv`)

> Note: these are **signals**, not definitive authorship proof. Topic/genre/boilerplate can dominate.

## Requirements

- Windows, macOS, or Linux
- Python **3.12+**
- `pip` install of dependencies

## Install (Windows PowerShell)

```powershell
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
```

Quick check:

```powershell
python -c "import numpy, pandas, sklearn; print('ok')"
```

## Input formats

You provide one or more `--corpus LABEL=PATH` arguments.

`PATH` can be:
- a single `.txt` / `.md` file
- a folder containing `.txt` / `.md` files (recursively)
- a `.zip` archive containing `.txt` / `.md` files (recursively)

Examples of folder layouts that work:

**Single corpus**
```
my_corpus/
  speech1.txt
  speech2.txt
  speech3.txt
```

**Multiple corpora**
```
corpora/
  A/
    doc1.txt
    doc2.txt
  B/
    doc3.txt
    doc4.txt
```

You can point each corpus to its subfolder:
- `--corpus A=corpora/A --corpus B=corpora/B`

## Run examples

### 1) Characterize a single document

```powershell
python stylometry_run.py --task characterize --corpus TextA=./speech1.txt --output ./out_textA
```

### 2) Build a profile from many documents (single corpus)

```powershell
python stylometry_run.py --task profile_build --corpus PersonX=./my_corpus --output ./out_personx
```

### 3) Compare two corpora

```powershell
python stylometry_run.py --task compare --corpus A=./corpora/A --corpus B=./corpora/B --output ./out_compare
```

### 4) Use zip archives

```powershell
python stylometry_run.py --task compare --corpus A=./A.zip --corpus B=./B.zip --output ./out_compare_zip
```

## Outputs

The output folder contains:

- `manifest.json` — corpus manifest (doc list + word counts + local provenance paths)
- `doc_metrics.csv` — per-document metrics
- `chunk_metrics.csv` — per-chunk metrics
- `ResultBundle_ArtifactExtractor.json` — SAO-compatible bundle describing artifacts produced
- `run_metadata.json` — parameters and reproducibility info

If 2+ corpora and enough chunks:
- `corpus_similarity_char_ngrams.csv`
- `chunk_assignments_char_ngrams.csv`
- `ResultBundle_Comparator.json`

If matplotlib is installed and working, it also saves:
- `plot_avg_sentence_len_boxplot.png`
- `plot_mattr_boxplot.png`

## Useful options

- `--chunk-words 1200` — set chunk size (default 1200)
- `--mattr-window 500` — MATTR window size (default 500)
- `--function-words-file path.txt` — override function word list (newline-delimited)
- `--include-chunk-text` — include chunk text in `chunk_metrics.csv` (can be large)
- `--char-analyzer char_wb|char` — default `char_wb` (often better for stylometry)
- `--max-features 50000` and `--min-df 2` — control n-gram feature size

## Notes for political/public-figure corpora

Prepared remarks and official publications can reflect speechwriters, staff editing,
or transcript normalization. Use “channel-specific” corpora where possible
(e.g., floor speeches vs press releases vs prepared remarks).

## Troubleshooting

- If plots aren’t produced: ensure `matplotlib` installed and you have write permission.
- If Unicode errors: convert source files to UTF-8, or the script will fall back to forgiving decodes.
- If it’s slow on huge corpora: increase `--min-df`, reduce `--max-features`, or reduce corpus size.

