Metadata-Version: 2.1
Name: idemux
Version: 0.1.5
Summary: A Lexogen tool for demultiplexing and  index error correcting fastq files. Works with Lexogen i7, i5 and i1 barcodes.
Home-page: https://github.com/lexogen-tools/idemux
Author: Falko Hofmann, Michael Moldaschl, Andreas Tuerk
Author-email: falko.hofmann@lexogen.com, michael.moldaschl@lexogen.com, andreas.tuerk@lexogen.com
License: UNKNOWN
Keywords: idemux
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: tqdm
Requires-Dist: importlib-resources ; python_version < "3.7"
Requires-Dist: dataclasses ; python_version < "3.7"
Provides-Extra: dev
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: coverage ; extra == 'dev'
Provides-Extra: docs
Requires-Dist: sphinx ; extra == 'docs'

======================================
idemux - inline barcode demultiplexing
======================================
.. image:: https://badge.fury.io/py/idemux.svg
   :target: https://badge.fury.io/py/idemux
   :alt: Latest Version

.. image:: https://travis-ci.org/lexogen-tools/idemux.svg?branch=master
   :target: https://travis-ci.org/lexogen-tools/idemux

.. image:: https://coveralls.io/repos/github/Lexogen-Tools/idemux/badge.svg?branch=master&service=github
   :target: https://coveralls.io/github/Lexogen-Tools/idemux?branch=master


Idemux is a command line tool designed to demultiplex paired-end FASTQ files from
`QuantSeq-Pool <https://www.lexogen.com/quantseq-pool-sample-barcoded-3mrna-sequencing/>`_.

Idemux can demultiplex based on i7, i5, and i1 inline barcodes. While this tool
can generally be used to demultiplex any barcodes (as long as they are correctly supplied
and in the fastq header), it performs best when used in combination with
`Lexogen indices <https://www.lexogen.com/indexing/12nt-dual-indexing-kits/>`_, as it
will correct common sequencing errors in the sequenced barcodes. This will allow you
to retain more reads from your sequencing experiment while minimizing cross contamination.


Idemux use is permitted under the following `licence <https://github.com/Lexogen-Tools/idemux/blob/master/LICENCE>`_.

**General usage:**
::

    idemux [-h] --r1 READ1 --r2 READ2 [--sample-sheet SAMPLE_SHEET]
           --out OUTPUT_DIR [--i1-start I1_START] [--i5-rc] [-v]


**Run idemux:**
::

    idemux --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out /some/output/path --i1-start pos_in_read_2

Features
--------

* FASTQ file demultiplexing based on i7, i5, and i1 barcodes.
* Correction of barcode sequencing errors to maximize read yield (only works
  with `Lexogen 12 nt UDIs <https://www.lexogen.com/indexing/12nt-dual-indexing-kits/>`_
  that have been sequenced at least 8 nt).
* Reverse complementation in case the i5 index has been sequenced as reverse complement.


Getting started
---------------
To get stated with demultiplexing you need to:

1. `Install idemux <1. Installation_>`_
2. `Prepare a sample sheet csv <2. Preparing the sample sheet_>`_
3. `Extract non-demultiplexed read data from a sequencing run <3. Extract non-demultiplexed read data from a sequencing run_>`_
4. `Run idemux <4. Running idemux_>`_

1. Installation
===============

Idemux is available as  `conda <https://conda.io/>`_ and  `PyPI <https://pypi.org/>`_ package. 

To install via bioconda: 

``$ conda install -c bioconda idemux``

To install via pip:

``$ pip install idemux``

|

If you dont use conda and want to install idemux into a `virtual env <https://virtualenv.pypa.io/en/latest/>`_
(always a good idea to avoid dependency conflicts), do the following:
::

    $ cd /path/you/want/it/installed/to
    # creates the venv
    $ virtualenv idemux
    # activates the venv, run 'deactivate' to deactivate
    $ source idemux/bin/activate
    $ pip install idemux


Alternatively, you can clone this repository and install from there:
::

    $ cd /path/you/want/it/installed/to
    # creates the venv
    $ virtualenv idemux
    $ git clone https://github.com/Lexogen-Tools/idemux.git
    $ source idemux/bin/activate
    $ python setup.py install


2. Preparing the sample sheet
=============================
In order to run idemux on your QuantSeq-Pool data, you first need to prepare a `csv file
<https://en.wikipedia.org/wiki/Comma-separated_values>`_.
We call this csv a sample sheet and it specifies which barcodes correspond to each
sample.

This is a necessity as the software needs to know which bins reads should be
sorted into during demultiplexing. A sample sheet can easily be generated by filling in an
excel spreadsheet and exporting it as csv.


Example sample sheet (i7, i5, and i1 demuliplexing):
::

    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT
    sample_2,GAAAATTTACGC,GCCCCTTTCAGA,GAAAATTTACGC
    sample_3,AAACTAACTGTC,CCCATCCATGTA,AAACTAACTGTC


A sample sheet consists of 4 columns and always starts with the header illustrated
above. 'Sample_name' values will be used as output file names, while the
sequences specified in i7, i5, and i1 will be used for demultiplexing.

Therefore, only unique, specific combinations of sample names and barcodes are
allowed. This means using duplicated or ambiguous combinations will result in an error.
However, idemux will do its best to tell you where the problem lies, if this happens.

|

**In brief, the rules are:**

1. Sample names need to be unique.
2. Barcode combinations need to be unique.
3. i7 and/or i5 indices have to be used consistently within the csv file. i7 and/or i5 indices need to either be present for all samples or for none at all.
4. In contrast to i7/i5 indices, i1 indices can be used for a subset of samples in the csv file.
5. Absence of a barcode needs to be indicated by an empty field (no value between
   commas ``,,``).
6. If your i5 has been sequenced as reverse complement, *do not* enter the reverse
   complement sequences in the sample sheet. Use the ``--i5-rc`` option!


See `below <Sample sheet examples_>`_ for more showcases of sample/barcode combinations that are *allowed* or
*not allowed*.

3. Extract non-demultiplexed read data from a sequencing run
============================================================
The read input files for idemux are non-demultiplexed read files which you can get by using demultiplexing software to extract reads from a sequencing run without demultiplexing by sample.  
You can use any demultiplexing software available to you, but the resulting read file(s) should contain all reads of the sequencing run you want to demultiplex with idemux.
Further, the reads should contain the read-out of the i7 + i5 barcode sequences in the read ID.

The following part of this section outlines how to use Illumina's bcl2fastq software to obtain the reads.
::

   # Demultiplexing with bcl2fastq:
   $ bcl2fastq -R /path/to/sequencing/run -o /path/to/output -l WARNING --no-lane-splitting --sample-sheet Illumina_EMPTY_SampleSheet.csv --barcode-mismatches 0 --mask-short-adapter-reads 10

This commands bcl2fastq to "demultiplex" the run at */path/to/sequencing/run* to the output directory */path/to/output*.
The content of the file *Illumina_EMPTY_SampleSheet.csv* has to match Illumina's format for the respective sequencer.

The following text is an example for the content of a SampleSheet for a Illumina Nextseq run:
::

   [Header],,,,,,,
   IEMFileVersion,4,,,,,,
   Date,30.05.2017,,,,,,
   Workflow,GenerateFASTQ,,,,,,
   Application,NextSeq FASTQ Only,,,,,,
   Assay,TruSeq RNA,,,,,,
   Description,,,,,,,
   Chemistry,Default,,,,,,
   ,,,,,,,
   [Reads],,,,,,,
   ,,,,,,,
   [Settings],,,,,,,
   ,,,,,,,
   [Data],,,,,,,

   Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
   1,1,,,9999,AAAAAAAAAAAA,9999,AAAAAAAAAAAA,,

As you can see, no settings are specified and only one 'sample' was defined with a squence combination that is not likely to be close to any of the utilized barcode sequences.
**You have to adjust the length of the A\* stretches to the sequenced length of the i7/i5 barcodes!**
This specification is necessary to command bcl2fastq to write the i7+i5 sequence information in each read in the *Undetermined_S0_R1_001.fastq.gz* (*Undetermined_S0_R2_001.fastq.gz*) file(s)
The resulting reads in *Undetermined_S0_R1_001.fastq.gz* (*Undetermined_S0_R2_001.fastq.gz*) should follow this formatting style:
::

   @NB502007:379:HM7H2BGXF:1:11101:19231:1159 1:N:0:TTAGGACGCAAA+GGGTCTGCCGAA
   GCTCATCCATCTTTTTGAAAACTCTTCATACTCGTTAGATCGGAAGAG
   +
   AAAAAEEEAEEEEEEAEEEEEEEEEEEEEEEEEEEE/E/EEEEE/EEE
   @NB502007:379:HM7H2BGXF:1:11101:17406:1159 1:N:0:AAGTAACAGCTT+AATCGTGGACGG
   CACACCTCCGTTCACGACGCTCTTCCGATATAGATGTAACTGGAGGAA
   +
   AAAAAEEEEEAEE/EEEEEEEEEE/EEEEAEA/EEEEEEEEEEEEEEE
   @NB502007:379:HM7H2BGXF:1:11101:18203:1159 1:N:0:CTGCCAACACGA+GCTGTGGTTCAT
   GACATGTATACAGTCTACGGATGAACGTTTAGATCGGAAGAGCACACG
   +
   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEE
   @NB502007:379:HM7H2BGXF:1:11101:7322:1159 1:N:0:TACATGGCCACT+ATGTTCCAGTGA
   CTTGGTCACGCTACTGTACTCCAGCCAGGGCGACAGAGCAAGACCTAT
   +
   AAAAAEEEEEEEEEEEE/EEEEEEEEAEEEEEAEEEEEEEEEEEAEEE
   ...

4. Running idemux
=================
Once you have installed the tool, you can run it by typing ``idemux`` in the terminal.

Idemux accepts the following arguments:
::

    required arguments:
      --r1 READ1                   path to gzipped read 1 FASTQ file
      --r2 READ2                   path to gzipped read 2 FASTQ file
      --sample-sheet CSV           csv file describing sample names, and barcode combinations
      --out OUTPUT_DIR             where to write the output files

    optional arguments:
      --i5-rc                      when the i5 barcode has been sequenced as reverse complement.
                                   make sure to always use non-reverse complement sequences in the sample sheet
      --i1_start POS               start position of the i1 index (1-based) on read 2 (default: 11)
      -v, --version                show program's version number and exit
      -h, --help                   show help message and exit


Example commands:
::

    # demultiplexes read 1 and 2 into the folder 'demux'
    idemux --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out demux

    # demultiplexing assuming the i1 barcode starts at the first base
    idemux --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out demux --i1_start 1

    # demultiplexing assuming i5 is present as reverse complement in the fastq header
    # if the i5 has been sequenced as reverse complement use this option and provide
    # the NON reverse complement sequences in the sample sheet.
    idemux --r1 read_1.fastq.gz --r2 read_2.fastq.gz --sample-sheet samples.csv --out demux

After a successfully completed run, idemux will write a summary report to the output folder
('demultipexing_stats.tsv').

Technicalities
---------------

When you run idemux, the following will happen:

* It will check if your sample sheet is okay. See `here <Sample sheet examples_>`_ for examples.

* It will check the FASTQ header for barcodes and it expects them in the following format:

    single index (i7 or i5): @NB502007:379:HM7H2BGXF:1:11101:24585:1069 1:N:0:TCAGGTAANNTT

    where TCAGGTAANNTT is the sequence of the i7 or i5 index

    dual index (i7 and i5): @NB502007:379:HM7H2BGXF:1:11101:24585:1069 1:N:0:TCAGGTAANNTT+NANGGNNCNNNN

    where TCAGGTAANNTT is the sequence of the i7 index and NANGGNNCNNNN is the sequence of the i5 index.

* Reads with incorrect i7,i5 or i1 index sequences which can be corrected by idemux will be written to the
  correct output file. However, the incorrect index sequence will not be replaced in the read header. This
  allows for additional processing of the incorrect sequences.
* Reads that cannot be demultiplexed will be written to undetermined_R{1/2}.fastq.gz.

* When you demultiplex based on i1 inline barcodes, a successfully recognized barcode
  sequence of 12 nt will be cut out and removed from read 2. This will leave
  you with the 10 nt UMI + the nucleotides that potentially follow the i1 barcode.

This allows you to:

1. Use other software, such as UMI_tools, to deal with the 10nt UMI, if desired.
2. To demuliplex lanes where QuantSeq-Pool has been pooled with other libraries and read
   2 has been sequenced longer than the actual barcode.

Help
------
If you are demuliplexing a large number of samples (more than 500), you might encounter the
following error:

* ``OSError: [Errno 24] Too many open files``

This error occurs because most OS have a limit on how many files can be opened and
written to at the same time. In order to temporarily increase the limit on Linux run:
::

    # multiply your sample number*2 (as data is paired end)
    # then round to the next multiple of 1024
    $ ulimit -n the_number_above

If you are looking for a permanent solution, you can change your ulimit values
`this way <https://access.redhat.com/solutions/61334>`_.

In case you experience any issues with this software please open an issue describing your
problem. Make sure to post the version of the tool you are running (``-v, --version``)
and your os.

Sample sheet examples
---------------------
*This is allowed:*
::

    # demultiplexing via full i7, i5, i1
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT

    # demultiplexing via full i7, i5 and sparse i1
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,AAAATCCCAGTT,CCCCTAAACGTT,

    # demultiplexing via full i7, i5
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,
    sample_1,AAAATCCCAGTT,CCCCTAAACGTT,

    # demultiplexing via full i7, no i5 and sparse i1
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,,AAAACATGCGTT
    sample_1,AAAATCCCAGTT,,

    # demultiplexing via full i7 only
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,,
    sample_1,AAAATCCCAGTT,,

    # demultiplexing via full i5 and i1
    sample_name,i7,i5,i1
    sample_0,,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,,CCCCTAAACGTT,AAAATCCCAGTT

    # demultiplexing via full i5 and sparse i1
    sample_name,i7,i5,i1
    sample_0,,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,,CCCCTAAACGTT,

    # demultiplexing via full i5
    sample_name,i7,i5,i1
    sample_0,,CCCCACTGAGTT,
    sample_1,,CCCCTAAACGTT,

    # demultiplexing via full i1
    sample_name,i7,i5,i1
    sample_0,,,AAAACATGCGTT
    sample_1,,,AAAATCCCAGTT

*This is not allowed:*
::

    # missing i1 column (or any other)
    sample_name,i7,i5,
    sample_0,AAAACATGCGTT,CCCCACTGAGTT
    sample_1,AAAATCCCAGTT,CCCCTAAACGTT

    # duplicated barcode combination
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT

    # duplicated sample names
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT
    sample_0,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT

    # mixed, potentially ambiguous indexing (full i7 and sparse i5, i1)
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,AAAATCCCAGTT,,AAAATCCCAGTT
    sample_2,GAAAATTTACGC,GCCCCTTTCAGA,GAAAATTTACGC
    sample_3,AAACTAACTGTC,,AAACTAACTGTC

    # mixed, potentially ambiguous indexing indexing (no i7, sparse i5 & i1)
    sample_name,i7,i5,i1
    sample_0,,CCCCACTGAGTT,
    sample_1,,,AAAATCCCAGTT

    # mixed, potentially ambiguous indexing indexing (sparse i7, full i5 & i1)
    sample_name,i7,i5,i1
    sample_0,,CCCCACTGAGTT,AAAACATGCGTT
    sample_1,AAAATCCCAGTT,CCCCTAAACGTT,AAAATCCCAGTT
    sample_2,,GCCCCTTTCAGA,GAAAATTTACGC
    sample_3,AAACTAACTGTC,CCCATCCATGTA,AAACTAACTGTC

    # missing comma separator
    sample_name,i7,i5,i1
    sample_0,AAAACATGCGTTCCCCACTGAGTT,AAAACATGCGTT

    # no barcodes
    sample_name,i7,i5,i1
    sample_0,,,

    # wrong column headers
    wrong_col_name,i7,i5,i1
    sample_0,AAAACATGCGTT,CCCCACTGAGTT,AAAACATGCGTT


=======
History
=======


0.1.5 (2020-11-24)
------------------

* Bug fix: Idemux now prints version properly
* Bug fix: README.rst contained some formatting errors
* Bug fix: Broken licence link on pypi now works


0.1.4 (2020-11-12)
------------------

* Bug fix: Demultiplexing with i1 barcodes only raised an incorrect exception (when no barcodes were present in the fastq header)


0.1.3 (2020-08-21)
------------------

* First release on PyPI


0.1.2 (2020-08-21)
------------------

* Bumped version number to avoid upload conflicts


0.1.1 (2020-08-21)
------------------

* Fixed rst files with linter
* First release on test PyPI.


0.1.0 (2020-08-21)
------------------

* First building version.


