Metadata-Version: 2.2
Name: contentcoder
Version: 1.0.3
Summary: A generalized implementation of a dictionary-based content coder.
Author-email: "Ryan L. Boyd" <ryan@ryanboyd.io>
Project-URL: Homepage, https://github.com/ryanboyd/ContentCoder-Py
Project-URL: Issues, https://github.com/ryanboyd/ContentCoder-Py/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE

# ContentCoder

![AI Reading Machine](images/DALL·E%202023-02-01%2016.58.23%20-%20a%20machine%20that%20reads%20books%20and%20has%20good%20ideas,%20high%20quality%203d%20digital%20art.png)


ContentCoder is a Python-based text analysis tool that enables users to process and analyze text using custom linguistic dictionaries. It is inspired by tools like **LIWC (Linguistic Inquiry and Word Count)** and provides robust methods for tokenization, text analysis, and frequency calculations. As noted in a much older version of the README.MD, this is a stripped-down, feature-incomplete version of several tools used in past projects.

Note that like 98% of this readme was generated by ChatGPT — it may not be entirely accurate, but at a quick glance, it looks pretty spot-on 😅🤞

## 🔥 Features

- **Custom Dictionary-Based Analysis**  
- **Support for LIWC-style dictionaries (2007 & 2022 formats)**  
- **Efficient text tokenization**  
- **Wildcard and abbreviation handling**  
- **Punctuation and big word analysis**  
- **Dictionary export in multiple formats (JSON, CSV, Poster format, etc.)**  
- **High-performance wildcard matching with memory optimization**

---

## 🚀 Installation

Make sure you have Python 3.9+ installed (although it'll probably work with older versions as well). This package is pretty much entirely native Python, so it doesn't have any dependencies for installation. Well, none that I can recall, anyways 😄

```bash
pip install contentcoder
```

---

## 📁 Folder Structure

```
src/contentcoder/
│── __init__.py
│── ContentCoder.py
│── ContentCodingDictionary.py
│── happiestfuntokenizing.py
│── create_export_dir.py
```

---

## 📌 Quick Start

### **1. Import the `ContentCoder` class**
```python
from contentcoder.ContentCoder import ContentCoder
```

### **2. Initialize the Analyzer**
```python
cc = ContentCoder(dicFilename='path/to/dictionary.dic', fileEncoding='utf-8-sig')
```

### **3. Analyze a Text Sample**
```python
text = "Libraries are crucial to our society."
results = cc.Analyze(text, relativeFreq=True, dropPunct=True, retainCaptures=True, returnTokens=False, wildcardMem=True)
print(results)
```

Expected output:
```json
{
  "WC": 6,
  "Dic": 4.5,
  "BigWords": 2.0,
  "Numbers": 0.0,
  "AllPunct": 0.0,
  "Period": 0.0,
  "Comma": 0.0,
  "QMark": 0.0,
  "Exclam": 0.0,
  "Apostro": 0.0,
  "Libraries": 1.0,
  "crucial": 1.0,
  "society": 1.0
}
```

---

## 📖 **Main Functions & Usage**

### **1️⃣ `Analyze(text, **options)`**
Analyzes a given text and returns a dictionary of results.

#### **Parameters:**
- `inputText` _(str)_: The text to analyze.
- `relativeFreq` _(bool)_: If `True`, returns relative frequencies. Otherwise, raw frequencies.
- `dropPunct` _(bool)_: If `True`, punctuation is removed before processing.
- `retainCaptures` _(bool)_: If `True`, captures and stores wildcard-matched words.
- `returnTokens` _(bool)_: If `True`, returns tokenized text.
- `wildcardMem` _(bool)_: If `True`, speeds up wildcard processing by storing past matches.

#### **Example Usage:**
```python
result = cc.Analyze("Hello world! This is a test sentence.", returnTokens=relativeFreq=True)
```

---

### **2️⃣ `GetResultsHeader()`**
Returns a list of all available output categories.

#### **Example Usage:**
```python
print(cc.GetResultsHeader())
```

Expected output:
```json
["WC", "Dic", "BigWords", "Numbers", "AllPunct", "Period", "Comma", "QMark", "Exclam", "Apostro"]
```

---

### **3️⃣ `GetResultsArray(resultsDICT, rounding=4)`**
Formats the results of `Analyze()` into a CSV-friendly list.

#### **Example Usage:**
```python
text = "The government plays an important role."
result = cc.Analyze(text)
csv_row = cc.GetResultsArray(result)
print(csv_row)
```

Expected output:
```json
[6, 4.3, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

---

### **4️⃣ `ExportCaptures(filename, fileEncoding='utf-8-sig', wildcardsOnly=False, fullset=True)`**
Exports wildcard-captured words and their frequencies to a CSV file.

#### **Example Usage:**
```python
cc.ExportCaptures("captured_words.csv")
```

---

### **5️⃣ `ExportDict2007Format(dicOutFilename, fileEncoding, separateDicts=False, separateDictsFolder=None)`**
Exports the loaded dictionary in **LIWC-2007 format**.

#### **Example Usage:**
```python
cc.dict.ExportDict2007Format("dictionary_2007.dic")
```

---

### **6️⃣ `ExportDict2022Format(dicOutFilename, fileEncoding, **options)`**
Exports the loaded dictionary in **LIWC-22 format**.

#### **Example Usage:**
```python
cc.dict.ExportDict2022Format("dictionary_2022.dicx")
```

---

### **7️⃣ `ExportDictJSON(filename, fileEncoding, indent=4)`**
Exports the dictionary mapping to a JSON file.

#### **Example Usage:**
```python
cc.dict.ExportDictJSON("dictionary.json")
```

---

### **8️⃣ `UpdateCategories(dicTerm, newCategories)`**
Updates the categories associated with a dictionary term.

#### **Example Usage:**
```python
cc.dict.UpdateCategories(dicTerm="happiness", newCategories={"positive_emotion": 1.0, "joy": 0.5})
```

---

## 🔄 **Example: Processing a Large CSV File with `tqdm`**
This script reads a **large CSV file** and processes each text in the `"body"` column.

```python
import csv
from tqdm import tqdm
from contentcoder.ContentCoder import ContentCoder

cc = ContentCoder(dicFilename='dictionary.dic', fileEncoding='utf-8-sig')

with open("Comments.csv", "r", encoding="utf-8-sig") as csvfile:
    reader = csv.DictReader(csvfile)

    for row in tqdm(reader, desc="Processing", unit=" comments"):
        text = row["body"]
        result = cc.Analyze(text)

        # some other stuff to export your result here

```

---

## ⚡ Performance Optimizations

- **Uses wildcard caching** to speed up regex evaluations.
- **Tokenization is optimized** for handling social media text.
- **Processes large datasets efficiently** using streaming CSV reads.

---

## 📜 **Dictionary Formats Supported**
- **LIWC-2007 (`.dic`)**
- **LIWC-22 (`.dicx`, `.csv`)**
- **JSON Exports**
- **Custom Hierarchical Category Mapping**

---

## 🤝 **Contributing**
Pull requests are welcome! If you find bugs or have feature requests, open an issue.

---

## 📄 **License**
MIT License © 2021

---

## 📝 **Acknowledgments**
Developed by **Ryan L. Boyd, Ph.D.**  
For academic and research purposes. Or, you know, whatever.
