Metadata-Version: 2.4
Name: full-text-sparse-engine
Version: 0.2.0
Author: Carlo Moro
Author-email: Carlo Moro <cnmoro@gmail.com>
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: lmdb
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: unidecode
Requires-Dist: scikit-learn
Requires-Dist: lenlp
Requires-Dist: xxhash
Dynamic: author
Dynamic: requires-python

# Full text search engine - LMDB/BM25 based

This project is a simple, yet powerful, full-text search engine written in Python. It's designed to be easy to use, thread-safe, and efficient for a variety of search tasks.

## Key Features

* **Breaking Change in v1.1:** This version introduces a new on-disk format for the metadata index to enable efficient filtering. If you are upgrading from a previous version, you **must** re-index your data.
* **Easy to Use:** The `SearchEngine` class provides a simple API for storing and searching documents.
* **Metadata Support:** Store and query documents based on metadata fields.
* **Fast:** Uses a combination of `FlashText` for quick keyword matching and a `BM25` vectorizer for more complex queries.
* **Scalable:** The sharded storage backend allows the engine to handle large amounts of data.
* **Thread-Safe:** The underlying LMDB storage is thread-safe, making it suitable for multi-threaded environments.

## Usage

Here's a quick example of how to use the search engine:

```python
from engine import SearchEngine
import os
import shutil

# Define paths for the storage directories
storage_path = "./db"
metadata_path = "./db_metadata"
metadata_index_path = "./db_metadata_index"
matrix_path = "./matrix"

# Clean up previous runs if they exist
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)

# 1. Initialize the Search Engine
search_engine = SearchEngine(
    storage_base_path=storage_path,
    metadata_storage_base_path=metadata_path,
    metadata_index_storage_base_path=metadata_index_path,
    matrix_path=matrix_path
)

# 2. Store some documents with metadata
docs = [
    ("The quick brown fox jumps over the lazy dog", {"source": "proverb"}),
    ("A journey of a thousand miles begins with a single step", {"source": "proverb"}),
    ("The early bird catches the worm", {"source": "proverb"}),
    ("An apple a day keeps the doctor away", {"source": "health"}),
]

for text, metadata in docs:
    search_engine.store_data(text, metadata)

print("Stored 4 documents.")

# 3. Pre-compute the index for optimal performance
print("Building search index...")
search_engine.index()
print("Index built.")

# 4. Perform a search
query = "quick fox"
results = search_engine.search(query, {})

print(f"\nSearching for: '{query}'")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'proverb'}: 'The quick brown fox jumps over the lazy dog'

# 5. Perform a search with a metadata filter
query = "apple"
metadata_query = {"source": "health"}
results = search_engine.search(query, metadata_query)

print(f"\nSearching for: '{query}' with metadata filter {metadata_query}")
for doc_id, text, metadata in results:
    print(f"  - Found doc {doc_id[:8]} with metadata {metadata}: '{text}'")
# Expected output:
#   - Found doc ... with metadata {'source': 'health'}: 'An apple a day keeps the doctor away'

# 6. Perform a search with advanced metadata filtering
print("\n--- Advanced Metadata Filtering ---")

# Search for documents where the author is either "John" or "Sarah"
docs_authors = [
    ("Text by John", {"author": "John", "year": 2020}),
    ("Text by Sarah", {"author": "Sarah", "year": 2021}),
    ("Text by Mike", {"author": "Mike", "year": 2022}),
]
for text, metadata in docs_authors:
    search_engine.store_data(text, metadata)
search_engine.index()

# Using the $in operator
query_in = "Text"
metadata_in = {"author": {"$in": ["John", "Sarah"]}}
results_in = search_engine.search(query_in, metadata_in)
print(f"Searching for '{query_in}' with metadata {metadata_in}:")
for _, text, _ in results_in:
    print(f"  - Found: '{text}'")

# Using the $gte operator for a range query
query_gte = "Text"
metadata_gte = {"year": {"$gte": 2021}}
results_gte = search_engine.search(query_gte, metadata_gte)
print(f"\nSearching for '{query_gte}' with metadata {metadata_gte}:")
for _, text, _ in results_gte:
    print(f"  - Found: '{text}'")

# Example with datetime
import datetime
now = datetime.datetime.now()
docs_dates = [
    ("Event today", {"date": now}),
    ("Event tomorrow", {"date": (now + datetime.timedelta(days=1))}),
]
for text, metadata in docs_dates:
    search_engine.store_data(text, metadata)
search_engine.index()

query_date = "Event"
metadata_date = {"date": {"$gte": now}}
results_date = search_engine.search(query_date, metadata_date)
print(f"\nSearching for '{query_date}' with metadata {metadata_date}:")
for _, text, _ in results_date:
    print(f"  - Found: '{text}'")

# Clean up the storage directories
search_engine.storage.close()
search_engine.metadata_storage.close()
search_engine.metadata_index_storage.close()
for path in [storage_path, metadata_path, metadata_index_path, matrix_path]:
    if os.path.exists(path):
        shutil.rmtree(path)
```

## Low-Memory Architecture

This search engine is designed to be both fast and memory-efficient. It achieves this by using a memory-mapped sparse matrix for the search index.

- **Indexing**: The `index()` method builds the full document-term matrix in memory (a one-time cost) and then saves it to disk.
- **Searching**: For subsequent searches, the matrix is loaded back as a memory-mapped object. This allows the operating system to efficiently manage paging the index between RAM and disk, providing the speed of an in-memory index without requiring the entire matrix to be loaded into RAM at once.

This approach provides a good balance between performance and memory usage, allowing the engine to handle large datasets with a small, predictable memory footprint.

## Performance

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

| Metric                | Value           |
|-----------------------|-----------------|
| Number of documents   | 1000            |
| Document size (chars) | 500             |
| Storage throughput    | 607.57 docs/sec |
| Search throughput     | 66.11 queries/sec|

*These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.*
