Metadata-Version: 2.1
Name: align4d
Version: 1.2.0
Summary: align4d: Multi-sequence alignment tools for aligning ASR and Speaker Diarization result
Author-email: Peilin Wu <pwu54@emory.edu>
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: Microsoft :: Windows :: Windows 10
Classifier: Operating System :: Microsoft :: Windows :: Windows 11
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Project-URL: Bug Tracker, https://github.com/emorynlp/align4d/issues
Project-URL: Homepage, https://github.com/emorynlp/align4d

# User Instruction

## Introduction

**align4d** is a powerful Python package used for aligning text results from Speaker Diarization and Speech Recognition to gold standard transcript, especially when there are overlappings between speakers. This user manual provides a step-by-step guide on how to install, use and troubleshoot the package.

## Mechanism

The **align4d** uses global alignment alignment that is a multi-sequence variant of Needleman-Wunsch algorithm to align hypothesis (results generated by Speaker Diarization and Speech Recognition models) to reference (usually gold standard transcript, which will be separated into multiple sequence if there are multiple speakers). The alignment happens on the token level. For long sequence the **align4d** will automatically separate the sequence into smaller segments, align them separately by finding the absolute aligned parts (called barriers), and finally assemble them together. 

The **align4d** uses Levenshtein Distance as the measurement of the similarity between tokens while doing alignment. There can be 4 situations between each position of alignment:

1. Fully match. Two tokens are exactly the same (Levenshtein Distance is 0).
2. Partially match. Two tokens are not exactly the same but the Levenshtein Distance between them are within a boundary.
3. Mismatch. Two tokens are different and the Levenshtein Distance between them exceed the boundary.
4. Gap. Only one token is present because it is aligned to a gap (insertion or deletion of tokens).

## Installation

To install **align4d**, you need to have Python version 3.10 or 3.11. Follow these steps:

1. Open your terminal or command prompt.
2. Type in the following command: `pip install align4d`
3. Wait for the package to download and install.

## Usage

### Importing Align4d

To use Align4d in your Python code, you need to import it. Here's how:

```python
from align4d import align
```

### Aligning Text Results

Align4d can align results from Speaker Diarization and Speech Recognition. For simple and straight forward usage, the function can be used like this:

```python
aligned_result = align.align(hypothesis, reference)
```

Here's the overview of all parameters of the function:

```python
aligned_result = align.align(hypothesis: str | list[str], reference: list[list[str]], partial_bound: int = 2, segment_length: int = None, barrier_length: int = None, strip_punctuation: bool = True)
```

The `align()` function takes in 6 parameters, the `hypothesis` and `reference` are required and the other 4 of them are optional:

1. `hypothesis`: This is a list of strings or a string containing tokenized text . Each string represents a word that is generated from the Speech Recognition model. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.
    
    ```python
    hypothesis = ["ok", "I", "am", "a", "fish", "Are", "you", "Hello", "there", "How", "are", "you", "ok"]
    # or 
    hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
    ```
    
2. `reference`: This is a nested list of strings containing utterance and speaker labels from the gold standard text. The first string within each secondary list represents the speaker label, the second string represents the utterance. It is suggested to remove all the punctuations, escape values, and any other characters that is not in the natural language.
    
    ```python
    reference = [
        ["A", "I am a fish."],
        ["B", "okay."],
        ["C", "Are you?"],
        ["D", "Hello there."],
        ["E", "How are you?"]
    ]
    ```
    
3. `partial_bound`: This is an integer that specifies the boundary between partially match and mismatch in terms of the Levenshtein Distance between the two tokens in comparison. This is an optional parameter and the default value is 2.
4. `segment_length`: This is a integer that specifies the minimum length of each segment in terms of the number of hypothesis tokens. By providing `segment_length` and `barrier_length` the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters. 
    
    If `segment_length` and `barrier_length` are not provided and the hypothesis length in terms of tokens is over 100, the program will automatically search the optimal `segment_length` between 30 and 120 and the following message will appear while doing alignment:
    
    ```python
    segment length: 30 max hypothesis length: 13 max reference length: 12
    segment length: 31 max hypothesis length: 13 max reference length: 12
    segment length: 32 max hypothesis length: 13 max reference length: 12
    ...
    ...
    segment length: 117 max hypothesis length: 13 max reference length: 12
    segment length: 118 max hypothesis length: 13 max reference length: 12
    segment length: 119 max hypothesis length: 13 max reference length: 12
    optimal length: 119 optimal barrier length: 6
    ```
    
    If `segment_length` and `barrier_length` are not provided and the hypothesis length in terms of tokens is lower than 100, no segmentation will be performed.
    
    If `segment_length` and `barrier_length` are provided and both are integers less than or equal to 0, no segmentation will be performed.
    
    It is strongly suggested to perform auto or manual segmentation when the input sequence are long otherwise the alignment may fail because of RAM space limit.
    
    It is important that the `segment_length` and `barrier_length` need to be provided together to perform manual segmentation otherwise an Exception will be raised.
    
    ```python
    Exception: Segment length or barrier length parameter incorrect or missing.
    ```
    
5. `barrier_length`: This is an integer that specifies the length of parts in terms of number of tokens used to detect the absolute aligned parts. This is an optional parameter and the default value is 6 if the parameter is not specified. By providing `segment_length` and `barrier_length` the program can perform manual segmentation before actual alignment for long sequence based on the provided parameters.
    
    It is important that the `segment_length` and `barrier_length` need to be provided together to perform manual segmentation otherwise an Exception will be raised.
    
    ```python
    Exception: Segment length or barrier length parameter incorrect or missing.
    ```
    
6. `strip_punctuation`: This is an boolean that specifies if the **align4d** will strip all punctuation in the hypothesis and reference to provide more accurate alignment result or not. The default is set to **True** and the output will provide alignment with the original punctuation. 

At this stage, the alignment function will also print out the relative information for alignment calculation, including the size of the total matrix used for storing scores for alignment, the number of speakers, the maximum score in the matrix, and the time for computation.

```python
 matrix size: 14 5 2 3 3 4  total cell: 5040 speaker num: 5 cell max score: 21
time: 0
```

The `align()` function returns a dictionary containing the aligned results. The hypothesis will be the list of strings (tokens) as the value for the key “hypothesis”. The reference will be separated into multiple sequences according to the provided speaker label, where each sequence will be a list of strings (tokens) as the value for the key of their speaker labels. All the reference sequences will be contained in a secondary dictionary as the value for the key “reference” in the primary dictionary. In each list, each token is aligned to the positions that have the same index and the gap is denoted as “” (empty string). If there is punctuation in the input, the punctuation will be preserved in the output.

```python
import json

hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
print(json.dumps(output, indent=4))
```

Sample output from `align()` : 

```python
# content in align_result
{
    "hypothesis": ['ok', 'I', 'am', 'a', 'fish.', 'Are', 'you?', 'Hello', 'there.', 'How', 'are', 'you?', 'ok'],
    "reference": {
        "A": ['', 'I', 'am', 'a', 'fish.', '', '', '', '', '', '', '', ''],
        "B": ['okay.', '', '', '', '', '', '', '', '', '', '', '', ''],
        "C": ['', '', '', '', '', 'Are', 'you?', '', '', '', '', '', ''],
        "D": ['', '', '', '', '', '', '', 'Hello', 'there.', '', '', '', ''],
        "E": ['', '', '', '', '', '', '', '', '', 'How', 'are', 'you?', '']
    }
}
```

### Retrieve token match result

Based on the alignment result, this tool provide function to retrieve the matching result (fully match, partially match, mismatch, gap) for each token. Use `get_token_match_result()` to retrieve the token level matching result.

The criterion for determining the matching result are the following (also mentioned in the **Mechanism**):

1. fully match: Levenshtein Distance = 0
2. partially match: Levenshtein Distance ≤ boundary (default to be 2)
3. mismatch: Levenshtein Distance > boundary (default to be 2)
4. gap: aligned to a gap

The `get_token_match_result()` requires 2 parameter, the `align_result` which is the direct return value from the previous three alignment functions, and an optional parameter `partial_bound` which must be the same as the `partial_bound` used in `align()` function. 

```python
hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
token_match_result = align.get_token_match_result(align_result)
print(token_match_result)
```

The return value is a list of strings that shows the token matching result and can either be fully match, partially match, mismatch, or gap.

```python
# possible output for get_token_match_result()
['mismatch', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'fully match', 'gap']
```

### Retrieve mapping from reference to hypothesis

Based on the alignment result, this tool provide function to retrieve the mapping from each token in the reference sequences to the hypothesis sequence. Each index shows the relative position (index) in the hypothesis sequence of the non-gap token (fully match, partially match, or mismatch) from the separated reference sequences. If the index is -1, it means that the current token does not aligned to any token in the hypothesis (align to a gap).

To achieve this, use function `get_align_indices()`. This function requires one parameter, the `align_result` which is the direct return value from the previous `align()` function. 

```python
hypothesis = "ok I am a fish. Are you? Hello there. How are you? ok"
reference = [
        ["A", "I am a fish. "],
        ["B", "okay. "],
        ["C", "Are you? "],
        ["D", "Hello there. "],
        ["E", "How are you? "]
]
align_result = align.align(hypothesis, reference)
align_indices = align.get_token_match_result(align_result)
print(align_indices)
```

The return value is a dictionary containing list of integers that shows the mapping between tokens from separated reference to hypothesis. The integers are the indices of the tokens in reference sequence map to the hypothesis sequence (for example, the first token in sequence “C” is mapped to the token in hypothesis with index 5).

```python
# possible output
{
    'A': [1, 2, 3, 4], 
    'B': [0], 
    'C': [5, 6], 
    'D': [7, 8], 
    'E': [9, 10, 11]
}
```

## Troubleshooting
This package currently only supports Windows 10/11 x86_64, Linux x86_64 (tested with Ubuntu 22.04), and macOS (M-series processor or Intel processor). 

If you encounter any issues while using Align4d, try the following:

1. Make sure you have installed Python version 3.10 or 3.11.
2. Make sure you have installed the latest version of Align4d.
3. Check the input data to make sure it is in the correct format.
    1. All the input strings must be encoded in the utf-8 format.
