Metadata-Version: 2.1
Name: dataset-creator
Version: 0.6.0
Summary: Takes SeqRecordExpanded objects and creates datasets for phylogenetic software
Home-page: https://github.com/carlosp420/dataset-creator
Author: Carlos Peña
Author-email: mycalesis@gmail.com
License: BSD
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: Unix
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Utilities
License-File: LICENSE
License-File: AUTHORS.rst
Requires-Dist: degenerate-dna ==0.0.9
Requires-Dist: seqrecord-expanded ==0.2.13
Requires-Dist: six ==1.10.0

.. image:: https://rawgit.com/carlosp420/dataset-creator/master/media/logo.svg
    :width: 240px
    :align: center
    :alt: Dataset-creator


=========================================
Dataset creator for phylogenetic software
=========================================

.. list-table::
    :stub-columns: 1

    * - tests
      - | |travis| |requires| |coveralls|
    * - package
      - |version| |wheel| |supported-versions| |supported-implementations|

.. |travis| image:: https://travis-ci.org/carlosp420/dataset-creator.svg?branch=master
    :alt: Travis-CI Build Status
    :target: https://travis-ci.org/carlosp420/dataset-creator

.. |requires| image:: https://requires.io/github/carlosp420/dataset-creator/requirements.svg?branch=master
    :alt: Requirements Status
    :target: https://requires.io/github/carlosp420/dataset-creator/requirements/?branch=master

.. |coveralls| image:: https://coveralls.io/repos/carlosp420/dataset-creator/badge.svg?branch=master&service=github
    :alt: Coverage Status
    :target: https://coveralls.io/r/carlosp420/dataset-creator

.. |version| image:: https://img.shields.io/pypi/v/dataset-creator.svg?style=flat
    :alt: PyPI Package latest release
    :target: https://pypi.python.org/pypi/dataset-creator

.. |wheel| image:: https://img.shields.io/pypi/wheel/dataset-creator.svg?style=flat
    :alt: PyPI Wheel
    :target: https://pypi.python.org/pypi/dataset-creator

.. |supported-versions| image:: https://img.shields.io/pypi/pyversions/dataset-creator.svg?style=flat
    :alt: Supported versions
    :target: https://pypi.python.org/pypi/dataset-creator

.. |supported-implementations| image:: https://img.shields.io/pypi/implementation/dataset-creator.svg?style=flat
    :alt: Supported implementations
    :target: https://pypi.python.org/pypi/dataset-creator


Dataset-Creator - easy way to creat phylogenetic datasets in many formats
=========================================================================

Documentation: `dataset-creator.readthedocs.org <http://dataset-creator.readthedocs.org/en/latest/>`_
-----------------------------------------------------------------------------------------------------

Takes SeqRecordExpanded objects and creates datasets for phylogenetic software
such as MrBayes, TNT, BEAST, RAxML, MEGA, etc.

Features
--------

- Creates datasets in the following formats: FASTA, GenBankFASTA, NEXUS, TNT, MEGA
  and Phylip.
- Can generate datasets of DNA and aminoacid sequences.
- Can generate datasets of degenerated sequences.
- It can partition datasets by codon positions or by gene.

Quick start
-----------

First::

    pip install dataset_creator


Then the list of SeqRecordExpanded objects should be sorted by gene_code first
then by voucher_code.

.. code-block:: python

    >>> from seqrecord_expanded import SeqRecord
    >>> from dataset_creator import Dataset
    >>>
    >>> # `table` is the Translation Table code based on NCBI
    >>> seq_record1 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='RpS5',
    ...                         table=1, voucher_code='CP100-10',
    ...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
    >>>
    >>> seq_record2 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='RpS5',
    ...                         table=1, voucher_code='CP100-10',
    ...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
    >>>
    >>> seq_record3 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='wingless',
    ...                         table=1, voucher_code='CP100-10',
    ...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
    >>>
    >>> seq_record4 = SeqRecord('ACTACCTA', reading_frame=2, gene_code='winglesss',
    ...                         table=1, voucher_code='CP100-10',
    ...                         taxonomy={'genus': 'Aus', 'species': 'bus'})
    >>>
    >>> seq_records = [
    ...    seq_record1, seq_record2, seq_record3, seq_record4,
    ... ]

    >>> # codon positions can be 1st, 2nd, 3rd, 1st-2nd, ALL (default)
    >>> dataset = Dataset(seq_records, format='TNT', partitioning='by codon position',
    ...                   codon_positions='ALL')

    >>> dataset = Dataset(seq_records, format='PHYLIP', partitioning='1st-2nd, 3rd',
    ...                   codon_positions='ALL')

    >>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
    ...                   codon_positions='1st')

    >>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
    ...                   codon_positions='ALL', aminoacids=True)

    >>> # Produce a dataset of degenerated sequences using the 'S' method:
    >>> dataset = Dataset(seq_records, format='NEXUS', partitioning='by gene',
    ...                   codon_positions='ALL', degenerate='S')

    >>> print(dataset.dataset_str)
    #NEXUS
    blah blah ...

Further documentation can be found at
`dataset-creator.readthedocs.org <http://dataset-creator.readthedocs.org/en/latest/>`_

Development
===========

To run the all tests run::

    tox

Changelog
=========

0.5.0 (2021-03-20)
------------------
* added support for bankit format

0.4.0 (2020-06-28)
------------------
* dropped support for python 2
* added support for long taxon names in generated dataset files

0.3.20 (2018-01-07)
-------------------
* Updated seq record expanded.

0.3.19 (2018-01-06)
-------------------
* Fixed version of seqrecord expanded in setup.py.

0.3.18 (2018-01-06)
-------------------
* Support lineages for genbank fasta files.

0.3.17 (2018-01-06)
-------------------
* Avoid raising exception when translating sequence with dash.

0.3.16 (2017-10-01)
-------------------
* Fixed creating dataset with 1st, 2nd or 3rd codon positions.

0.3.14 (2016-09-11)
-------------------
* upgrade `seqrecord-expanded`.

0.3.13 (2016-08-27)
-------------------
* Fixed bug that did not replace all white spaces for underscores in taxon names
  when building datasets. Due to taxon names with whitespaces, the NEXUS
  interpreter assumed that part of the name was actually part of the sequence,
  rendering the sequence invalid.
* Added some dependencies to requirements.

0.3.11 (2016-06-25)
-------------------
* Upgraded seqrecord-expanded requirement.

0.3.10 (2015-12-01)
-------------------
* Fixed bug that produced FASTA sequences with underscores. Now all voucher codes
  will have their dashes replaced by underscores.

0.3.9 (2015-11-06)
------------------
* Create datasets using the GenBankFASTA format. This format has the following
  extra info in the description of sequences:
  >Aus_aus_CP100-10 [org=Aus aus] [Specimen-voucher=CP100-10] [note=ArgKin gene, partial cds.] [Lineage=]

0.3.8 (2015-10-30)
------------------
* Fixed making dataset as aminoacid seqs for MEGA format.
* Fixed making dataset as degenerated seqs for MEGA format.
* Fixed making dataset as degenerated seqs for TNT format.
* Fixed making dataset as aa seqs with specified outgroup for TNT format.
* Raise ValueError when asked to degenerate seqs that will go to partitioning
  based on codon positions.
* Dataset creator returns warnings if translated sequences have stop codons '*'.
* Cannot generate MEGA datasets with partitioning.

0.3.7 (2015-10-30)
------------------
* Fixed 2nd, 3rd codon positions bug that returned empty FASTA datasets.

0.3.6 (2015-10-30)
------------------
* Fixed 3rd codon positions bug that returned FASTA datasets with 3rd codon
  positions even if they were not needed.

0.3.5 (2015-10-29)
------------------
* If user provides outgroup, then TNT datasets will place its sequences in first
  position in the dataset blocks.

0.3.4 (2015-10-02)
------------------
* Fixed bug that did not show DATATYPE=PROTEIN in Nexus files when aminoacid
  sequences were requested by user.

0.3.3 (2015-10-02)
------------------
* Fixed bug that raised an exception when SeqExpandedRecords did not have data
  in the ``taxonomy`` field.

0.3.2 (2015-10-01)
------------------
* Fixed bug that raised an exception when user wanted partitioned dataset as
  1st-2nd and 3rd codon positions of only one codon.

0.3.1 (2015-10-01)
------------------
* Fixed bug that raised an exception when user wanted partitioned dataset by
  codon positions of only one codon.

0.3.0 (2015-10-01)
------------------
* Accepts voucher code as string that will be used to generate the outgroup
  string needed for NEXUS and TNT files.

0.2.0 (2015-09-30)
------------------
* Creates datasets as degenerated sequences using the method by Zwick et al.

0.1.1 (2015-09-30)
------------------

* It will issue errors if reading frames are not specified unless they
  are strictly necessary to build the dataset (datasets need to be divided by
  codon positions).
* Added documentation using sphinx-doc
* Creates datasets as aminoacid sequences.

0.1.0 (2015-09-23)
------------------

* Creates Nexus, Tnt, Fasta, Phylip and Mega dataset formats.

0.0.1 (2015-06-10)
------------------

* First release on PyPI.
