Metadata-Version: 2.1
Name: databird
Version: 0.6.2
Summary: Keeps a local data repository up to date with different remote data sources.
Home-page: https://github.com/jonas-hagen/databird
Author: Jonas Hagen
Author-email: jonas.hagen@iap.unibe.ch
Maintainer: Jonas Hagen
Maintainer-email: jonas.hagen@iap.unibe.ch
License: MIT
Download-URL: https://github.com/jonas-hagen/databird/archive/databird-0.6.2.tar.gz
Platform: UNKNOWN
Classifier: Operating System :: OS Independent
Requires-Python: >=3.5.*, <4
Description-Content-Type: text/markdown
Requires-Dist: click
Requires-Dist: dict-recursive-update
Requires-Dist: frozendict
Requires-Dist: mr.bob
Requires-Dist: ruamel.yaml

[![Build Status](https://travis-ci.org/jonas-hagen/databird.svg?branch=master)](https://travis-ci.org/jonas-hagen/databird)

# databird

Periodically retrieve data from different sources.

The `databird` package only provides a framework to plan and run the tasks needed to keep a local data-file-store up do date with various remote sources.
The remote sources can be anything (e.g. FTP Server, ECMWF, HTTP Api, SQL database, ...), as long as there is a *databird-driver* available for the specific source.

## Usage

Databird is configured with configuration files and invoked by

```
$ databird retrieve -c /etc/databird/databird.conf

# or (as the above is the default)
$ databird retrieve
```

You can store the configuration files anywhere and for example run the above command periodically as cron job.

## Configuration

The following example configuration defines a repository, which is populated with daily GNSS data from [ftp://cddis.nasa.gov/gnss/data/daily/](ftp://cddis.nasa.gov/gnss/data/daily/).

The main configuration file (usually `databird.conf`) could look like that:

```yml
general:
  root: /data/repos # root path for data repositories
  num-workers: 16   # max number of async workers
  include: "databird.conf.d/*.conf"  # include config files
```

Generally you can configure anything in any file, as all configuration files are merged to one configuration tree. The `include` option is an exception, as it can only be declared in the top config file.

Then in `databird.conf.d/cddis.conf` you can configure a profile and a repository:

```yml
profiles:
  nasa_cddis:
    driver: standard.FtpDriver
    configuration:
      host: cddis.nasa.gov
      user: anonymous
      password: ""
      tls: False

repositories:
  nasa_gnss:
    description: Data from NASAs Archive of Space Geodesy Data
    profile: nasa_cddis
    period: 1 day
    delay: 2 days
    start: 2019-01-01
    targets:
      status: "{time:%Y}/cddis_gnss_{iso_date}.status"
    configuration:
      user: anonymous  # this could override 'user' from profile
      root: "/gnss/data/daily"
      patterns:
        status: "{time:%Y}/{time:%j}/{time:%y%j}.status"
```

When calling databird with this configuration the following is achieved:

* A repository in the folder `/data/repos/nasa_gnss/` is created
* For every day, a file like `2019/nasa_gnss_2019-01-20.status` is expected
* If that file is missing, retrieve it from `ftp://cddis.nasa.gov/gnss/data/daily/2019/020/19020.status`
* If there are many files missing, the data is retrieved asynchronously

This example used the `standard.FTPDriver`.

## Drivers

Anyone can write drivers (see below). Currently, the following drivers are available:

* `standard.FilesystemDriver`: Retrieve data from the local filesystem (included)
* `standard.FtpDriver`: Retrieve data from an FTP server (included)
* `ecmwf.EcmwfDriver`: Retrieve data from the European Centre for Medium-Range Weather Forecasts (ECMWF) via their API (to be anounced)
* `c3s.C3SDriver`: Retrieve data from the Copernicus Climate Change Service (C3S) via their API (to be anounced)


## Development

1. Create a Python environment and activate it
   ``` shell
   $ python3 -m venv . && source bin/activate
   ```
2. Install the development environment:
   ``` shell
   (databird) $ pip install -r requirements/development.txt
   ```

### Writing a new driver

Drivers are published in a namespace package `databird-drivers`. Everyone can develop drivers and share them.

Install `databird` and run mr.bob to create a new driver package:

```
(databird) $ cd $HOME/projects
(databird) $ python -m mrbob.cli databird.blueprints:driver
```

After answering some questions, a new directory `databird-driver-<chosen_name>` is created.
Lets asume `<chosen_name> = foo`, then your driver is usually implemented in `databird/drivers/foo/foo.py` in a class named `FooDriver()`.
Until more documentation is available, you have to look at the code to figure out how to write a driver.

Other people will be able to use it with `driver: foo.FooDriver`.

Tell me if you wrote a new driver, so I can include it in the list.


