Metadata-Version: 2.1
Name: smallBixTools
Version: 0.0.29
Summary: A few small methods for bioinformatics
Home-page: https://bitbucket.org/hivdiversity/small_bix_tools
Author: David Matten
Author-email: david.matten@uct.ac.za
License: GNU GENERAL PUBLIC LICENSE
Description: # smallBixTools
        a few small functions for bioinformatics
        
        
        
        # smallBixTools a few small functions for bioinformatics.
        
        See readme for full details.
        
        Repo location:
        
        https://bitbucket.org/hivdiversity/small_bix_tools
        
        Docs:
        https://small-bix-tools.readthedocs.io/en/latest/
        
        List of functions:
        (INCOMPLETE)
        
        get_regions_from_panel:
        
        Slices regions out of a fasta formatted file, joins them together, and writes the resulting fasta file to the given location.
        an example call might be: get_regions_from_panel("test.fasta", 0, 10], [20, 30, "/tmp", "outfile.fasta")
        which would, for each sequence in the input file: "test.fasta", take the region from 0 to 10 joined with the
        region from 20 to 30, and write the result to the file: "/tmp/outfile.fasta".
        
        find_ranges
        
        Find contiguous ranges in a list of numerical values.
        eg: data = [1,2,3,4,8,9,10]
        find_ranges(data) will return:
        1, 2, 3, 4], [8, 9, 10
        
        hamdist
        
        Use this after aligning sequences.
        This counts the number of differences between equal length str1 and str2
        The order of the input sequences does not matter.
        
        fasta_to_dct
        
        a dictionary of the contents of the file name given. Dictionary in the format:
        {sequence_id: sequence_string, id_2: sequence_2, etc.}
        
        dct_to_fasta
        
        :param d: dictionary in the form: {sequence_id: sequence_string, id_2: sequence_2, etc.}
        :param fn: The file name to write the fasta formatted file to.
        :return: Returns True if successfully wrote to file.
        
        find_duplicate_ids
        
        customdist
        
        hyphen_to_underscore_fasta
        
        auto_duplicate_removal
        
        Attempts to automatically remove duplicate sequences from the specifed file. Writes results to output file
        specified. Uses BioPython SeqIO to parse the in file specified. Replaces spaces in the sequence id with underscores.
        Itterates over all sequences found - for each one, checking if its key already exists in an accumulating, if it
        does: check if the sequence which each specifies is the same. If they have the same key, and the same sequence -
        then keep the second instance encountered. Once the file has been parsed - write to the output file specified all
        sequences found which
        Will raise an exception if an error occurs during execution.
        
        build_cons_seq
        
        # https://www.biostars.org/p/14026/
        
        own_cons_maker
        
        split_file_into_timepoints
        
        size_selector
        
        py2_fasta_iter
        
        from Brent Pedersen: https://www.biostars.org/p/710/#1412
        given a fasta file. yield tuples of header, sequence
        
        py3_fasta_iter
        
        modified from Brent Pedersen: https://www.biostars.org/p/710/#1412
        given a fasta file. yield tuples of header, sequence
        
        convert_count_to_frequency_on_fasta
        
        when running vsearch as such:
        vsearch –cluster_fast {} –id 0.97 –sizeout –centroids {}
        We get a centroids.fasta file with seqid header lines like:
        >ATTCCGGTATCT_9;size=1432;
        >CATCATCGTAAG_14;size=1;
        etc.
        This method converts those count values into frequencies.
        Notes: The delimiter between sections in the sequence id must be ";".
        There must be a section in the sequence id which has exactly: "size=x" where x is an integer.
        This must be surrounded by ";"'s
        
        countNinPrimer
        
        Motifbinner2 requires values to be specified for primer id length and primer length. Its tiresome to have to
        calculate this for many strings. So, I wrote this to help myself.
        An example of a primer sequence might be: NNNNNNNAAGGGCCAAAGGAACCCTTTAGAGACTATG
        And we would like to know how many N's there are, how many other characters there are, and what the combined
        total lenght is.
        
        compare_fasta_files
        
        Compares two fasta files, to see if they contain the same data. The sequences must be named the same. We check if
        sequence A from file 1 is the same as sequence A from file 2.
        The order in the files does not matter.
        Gaps are considered.
        
        unmake_hash_of_seqids
        
        When calling mafft - sequence ids over 253 in length are truncated. This can result in non-unique ids if the first
        253 characters of the seqid are the same, with a difference following that.
        To get around this - we can has the sequence ids, and write a new .fasta file for mafft to work on, then
        translate the sequence ids back afterwards.
        
        This function does the translation back afterwards.
        
        This is a sibling function to: make_hash_of_seqIDS.
        
        Will raise an exception on error
        
        make_hash_of_seqids
        
        When calling mafft - sequence ids over 253 in length are truncated. This can result in non-unique ids if the first
        253 characters of the seqid are the same, with a difference following that.
        To get around this - we can has the sequence ids, and write a new .fasta file for mafft to work on, then
        translate the sequence ids back afterwards.
        
        This function does the hashing and writing to file.
        
        This is a sibling function to: unmake_hash_of_seqIDS
        
        Will raise an exception on error
        
Keywords: bioinformatics computational-biology
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Education
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Provides-Extra: dev
Provides-Extra: test
