Metadata-Version: 2.4
Name: fileproxy
Version: 0.1.0
Summary: File-based RPC for running Python functions across network-isolated nodes.
Author: Timothé Boulet
License: MIT
Project-URL: Homepage, https://github.com/tboulet/fileproxy
Project-URL: Repository, https://github.com/tboulet/fileproxy
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# fileproxy

File-based RPC for running Python functions across network-isolated nodes.

Designed for HPC clusters where compute nodes lack internet access but share a filesystem with login nodes that do.

## Installation

```bash
pip install fileproxy
```

Or from source:

```bash
git clone https://github.com/tboulet/fileproxy.git
cd fileproxy
pip install -e .
```

## Quick Start

### 1. Define and start the server (login node)

Create a server script (example here with `litellm.completion` as the function to proxy):

```python
# server_script.py
import fileproxy
import litellm

if __name__ == "__main__":
    fileproxy.run_server({
        "litellm_completion": litellm.completion,
    })
```

Run it on the login node:

```bash
python server_script.py
```

> **Tip**: On HPC clusters, run the server in a persistent terminal session (e.g., TMUX) so it survives SSH disconnections. See [guide_TMUX.md](guide_TMUX.md) for a quick reference.

### 2. Use the proxy in your code (compute node)

```python
import fileproxy

# Create a proxy that behaves like the original function
completion = fileproxy.proxy("litellm_completion")

# Use it exactly like litellm.completion
response = completion(model="gpt-4", messages=[{"role": "user", "content": "Hello"}])
```

The proxy serializes the arguments to a file, the server picks it up, runs the real function, and writes the result back. The proxy polls for the result and returns it.

## Multiple Functions

Register multiple functions on the same server:

```python
# Server
import fileproxy
import litellm
import requests

if __name__ == "__main__":
    fileproxy.run_server({
        "litellm_completion": litellm.completion,
        "http_post": requests.post,
        "http_get": requests.get,
    })
```

```python
# Client
import fileproxy

completion = fileproxy.proxy("litellm_completion")
http_post = fileproxy.proxy("http_post")
http_get = fileproxy.proxy("http_get")
```

## Configuration

### Data directory

By default, fileproxy stores request/response files in `~/.cache/fileproxy/`. Override with:

1. **Constructor argument**: `fileproxy.proxy("func", base_dir="/path/to/dir")`
2. **Environment variable**: `export FILEPROXY_DIR=/path/to/dir`

The server and client must use the same base directory on a shared filesystem.

### Workers (parallel execution)

By default, the server processes requests sequentially. To handle multiple requests concurrently (useful when registering multiple functions or serving multiple clients):

```python
# Process up to 4 requests in parallel
fileproxy.run_server(functions, workers=4)
```

With `workers=1` (default), requests are executed one at a time. With `workers=2` or more, requests are dispatched to a thread pool. This is particularly useful when mixing slow functions (e.g., LLM calls) with fast ones (e.g., HTTP requests) — a slow call won't block unrelated requests.

> **Note**: Registered functions must be **thread-safe** when using `workers > 1`. Most common use cases (HTTP requests, API calls) are thread-safe.

### Timeouts

```python
# Client waits 10s for server acknowledgement (default: 10s)
func = fileproxy.proxy("my_func", no_server_timeout=15.0)
```

The timeout only applies while waiting for the server to *acknowledge* the request (pick it up). Once the server starts processing, the client waits indefinitely — slow functions will not cause false timeouts.

### Poll interval

```python
# Server checks for new requests every 0.5s (default: 0.2s)
fileproxy.run_server(functions, poll_interval=0.5)

# Client checks for response every 0.2s (default: 0.1s)
func = fileproxy.proxy("my_func", poll_interval=0.2)
```

## How It Works

```
Compute Node (no internet)          Login Node (has internet)
─────────────────────────          ──────────────────────────

proxy("func")(args, kwargs)        Server polls input dir
  │                                  │
  ├─ Write request.pkl ──────────────┤
  │  to input dir                    ├─ Read request.pkl
  │                                  ├─ Create _started sentinel
  │  (client sees _started,          ├─ Call func(*args, **kwargs)
  │   disables timeout)              ├─ Write response.pkl (atomic)
  ├──────────────────────────────────┤  to output dir
  ├─ Read response.pkl              │
  ├─ Return result                  │
```

### Directory structure

```
~/.cache/fileproxy/
├── func_name_1/
│   ├── input/       # Request files (.pkl)
│   └── output/      # Response files (.pkl) + _started sentinels
├── func_name_2/
│   ├── input/
│   └── output/
├── logs/
│   └── server_20260310_143000.log
└── server_heartbeat.json
```

## Error Handling

fileproxy uses custom exception types to distinguish infrastructure errors from function errors:

```python
import fileproxy
from fileproxy import FileProxyError, ServerNotRunningError

func = fileproxy.proxy("my_func")

try:
    result = func(args)
except ServerNotRunningError:
    # fileproxy infrastructure problem: server is not running
    print("Start the fileproxy server!")
except FileProxyError:
    # Other fileproxy infrastructure problem
    print("Something went wrong with the file proxy")
except ValueError:
    # Exception raised by the actual function on the server side
    # (re-raised with original type)
    print("The function itself failed")
```

- `FileProxyError`: Base class for all fileproxy infrastructure errors.
- `ServerNotRunningError(FileProxyError)`: Server did not acknowledge the request within the timeout.
- Server-side function exceptions are re-raised with their **original type** (not wrapped in `FileProxyError`).

### Exception propagation details

When the proxied function raises an exception on the server, the proxy re-raises it on the client **with the original exception type** in most cases. For example, a server-side `ValueError("bad input")` becomes a client-side `ValueError("bad input")`.

However, some exception classes have non-standard `__init__` signatures that prevent Python's `pickle` from reconstructing them (e.g., `litellm.RateLimitError` requires `llm_provider` and `model` arguments). In these cases, the original exception cannot be faithfully reconstructed, so the proxy raises a **`RuntimeError`** instead, with a message of the form:

```
RuntimeError: Server-side RateLimitError: rate limited
```

In summary:
- **Standard exceptions** (e.g., `ValueError`, `TypeError`, `KeyError`, most custom exceptions with a simple `__init__(self, message)` signature): re-raised with original type and message.
- **Non-picklable exceptions** (non-standard `__init__` that fails to round-trip through pickle): raised as `RuntimeError("Server-side {OriginalType}: {original_message}")`.

## Logs

Server logs are written to `{base_dir}/logs/server_YYYYMMDDHHMMSS.log` and also printed to the server terminal. Each log file corresponds to one server session.

## Important Notes

### Multiple servers

Do **not** run multiple fileproxy servers with the same `base_dir`. On startup, the server checks for an existing heartbeat and raises `FileProxyError` if another server appears to be running. To override and kill the old server, use `force=True`:

```python
# force=True signals the old server to stop, waits for it to shut down,
# then starts the new server
fileproxy.run_server(functions, force=True)
```

If you need truly independent servers running simultaneously, use different `base_dir` values:

```bash
FILEPROXY_DIR=~/.cache/fileproxy-project-a python server_a.py
FILEPROXY_DIR=~/.cache/fileproxy-project-b python server_b.py
```

### Restarting the server

When you restart the server, it clears all pending request/response files. Any client calls that were in-flight will eventually time out with `ServerNotRunningError`. This is by design — it prevents stale requests from a previous session from being processed.

### Checking server status

From any node that shares the filesystem:

```python
import fileproxy

info = fileproxy.status()
print(info["alive"])       # True/False
print(info["functions"])   # ["litellm_completion", "http_post", ...]
print(info["pid"])         # Server process ID
print(info["requests_processed"])  # Total requests handled
```

## Safety Mechanisms

- **Atomic writes**: Responses are written to a `.tmp` file then renamed, preventing clients from reading partial data.
- **Started sentinel**: When the server begins processing a request, it creates a `_started` marker file. The client uses this to distinguish "server is processing (wait)" from "server is not running (fail fast)."
- **Exception propagation**: If the function raises an exception on the server, the exception object is pickled and re-raised on the client side with its original type.
- **Unpicklable response handling**: If the server cannot pickle the response (e.g., it contains open file handles), the client receives a `FileProxyError` instead of hanging.
- **Cleanup**: Request, response, and sentinel files are removed after processing.
- **Startup cleanup**: The server clears stale files from previous runs on startup.

## Limitations

- Arguments and return values must be **picklable** (most Python objects are — strings, dicts, lists, numbers, dataclasses, etc. Lambdas, open file handles, and generators are not).
- Latency overhead of ~100-200ms per call due to filesystem polling.
- Server and client must share a filesystem (e.g., NFS home directory on HPC clusters). Local-only filesystems like `/tmp` won't work across nodes.
- If the server crashes (e.g., killed by OOM) while processing a request, the client will wait indefinitely for that request. Restart the server to recover.
- If the server and client use different Python environments, server-side exceptions from libraries not installed on the client will be raised as `RuntimeError` instead of their original type.

## License

MIT
