This is a short overview of how this works, intended primarily for those
who work on the code. More user focused documentation is available in
separate repos.

Processes
=========

There are a few different processes involved in a running system:
main server
	python version specific runner (runner.py)
		when running a job this forks to create the actual job processes
	maybe more runners (one per py\d line in the conf)
	short lived worker pools when loading metadata about jobs.
runner (asks the server to run jobs)

Communication between the runner and the server is over http (by default
over a unix socket). You can find the code in automata_common.py and
server.py.

Communication between the server and the runners is over anonymous sockets
(from socketpair()), each packet is a length prefix followed by pickled
data. All the communications code is in runner.py.

Communication between the runner and the main job process is a single json
object written back over a pipe. Job setup is partly loaded from setup.json
in the job dir and partly inherited from the parent process (normal fork
behaviour).

Communication between the main job process and the analysis processes (the
parallell part) is over a multiprocessing.Queue object.

Yes, this has more variation than there is any reason for. Rewriting it is
probably not useful though, all the communication methods work.


setup.json
==========

The job parameters are all in setup.json in the job dir, which contains
mostly a normalized version of the same things the runner sends to the
server to dispatch the job. The valid contents are specified by the
{options, datasets, jobids} variables in the method source. There is also
some meta info about the method.

There is also a short version of the profiling information for completed
jobs.


Identification of jobs for reuse
================================

When a job is requested a matching job is first searched for. A job with
the same parameters is valid assuming the source code for the method has
not been changed since it was run, or if the new code uses
equivalent_hashes to indicate that it is compatible with the old code. This
code (in control.py, database.py, dependency.py, deptree.py, methods.py,
workspace.py) is much harder to follow than there is any reason for.
database.Database.match_exact is called from dependency.initialise_jobs
which is called from control.initialise_jobs which is called from the
submit portion of server.py.

The data about method versions and compatibility comes from the runners,
the setup.json files are loaded directly by the server (using a worker
pool).


Datasets
========

Datasets are our main data storage system, suitable for data you stream
through. (Other data is normally stored in pickles.) Each column is stored
separately in one file per slice (except for small columns, where all
slices are in one file). Each column has a single type, one of the types in
sourcedata.type2iter (which is a few more than you see at the top of the
file). Most of these types are handled through gzutil, a C extension.

Each job can contain any number of datasets. On disk each dataset is a
directory containing a pickle with metainfo and all the column files.

Look in dataset.py more details.


List of files
=============

Runnable files
	server.py
		Main server
	automatarunner.py
		Runs your automata scripts
	dsgrep.py
		Grep one or more datasets
	dsinfo.py
		Print some info about a dataset

Bookeeping around jobs
	control.py
	database.py
	dependency.py
	deptree.py
	methods.py
	workspace.py

Datasets and ds/jobchaining
	chaining.py
	dataset.py
	sourcedata.py

Launching of jobs, forking magic
	dispatch.py
	launch.py
	runner.py

Other
	automata_common.py
		support functions for automatarunner/subjobs
	extras.py
		dumping ground for a lot of useful utility functions
	status.py
		sending and recieving of status-tree messages (the ^T stuff)
	subjobs.py
		running jobs from within other jobs
	iowrapper.py
		all (print) output goes through pipes to this

Not so interesting files
	autoflush.py
	blob.py
	compat.py
		py2/py3 compat
	configfile.py
	dscmdhelper.py
	g.py
	gzwrite.py
	init.py
	jobid.py
	report.py
	safe_pool.py
	setupfile.py
	status_messaging.py
	unixhttp.py
	web.py
	workarounds.py

standard_methods
	Directory with the bundled standard methods.

test_methods
	Directory with the bundled test methods.
