Frequently Asked Questions

How to report an issue?

If you encounter a problem while running MacSyLib, please submit an issue on the dedicated page of the GitHub project

To ensure we have all elements to help, please provide:

  • a concise description of the issue

  • the expected behavior VS observed one

  • the exact command-line used

  • the version of MacSyLib used

  • the exact error message, and if applicable, the <macsylib>.log and <macsylib>.conf files

  • if applicable, an archive (or link to it) with the output files obtained

  • if possible, the smallest dataset there is to reproduce the issue

  • if applicable, this would also include the macsy-models (XML models plus HMM profiles) used (or precise version of the models if there are publicly available). Same as above, if possible, please provide the smallest set possible of models and HMM profiles.

All these will definitely help us to help you! ;-)

Note

If you use macsylib in higher level script you can change <macsylib> by the name of your tool by setting it in macsylib.config.MacsyDefault prog_name parameter.

How to cite MacSyLib and published macy-models?

  • Néron, Bertrand et al 2023, Peer Community Journal MacSyFinder v2: Improved modelling and search engine to identify molecular systems in genomes.

  • Abby and Rocha 2012, PLoS Genetics, for the study of the evolutionary relationship between the T3SS and the bacterial flagellum, and how were designed the corresponding HMM protein profiles.

  • Abby et al. 2016, Scientific Reports, for the description of bacterial protein secretion systems’ models (TXSScan: T1SS, T2SS, T5SS, T6SS, T9SS, Tad, T4P).

  • Denise et al. 2019, PLoS Biology, for the description of type IV-filament super-family models (TFF-SF: T2SS, T4aP, T4bP, Com, Tad, archaeal T4P).

  • Rendueles et al. 2017, PLoS Pathogens, for the CapsuleFinder set of models.

  • Couvin, Bernheim et al. 2018, Nucleic Acids Research, for the updated version of the set of Cas systems’ models, CasFinder.

How to use MacSyLib ?

Here are an example of analyse with a python script using macsylib:

import os
import logging
from argparse import Namespace

import macsylib.config
import macsylib.registries
import macsylib.utils
import macsylib.search_systems
import macsylib.io
from macsylib.system import HitSystemTracker

defaults = macsylib.config.MacsyDefaults()
settings = Namespace(
    db_type='ordered_replicon',
    sequence_db='test.fasta',
    models=['TXSScan', 'all'],  # this model must be installed with msl_data scripts
    worker=4,
    out_dir='my_results'
)
config = macsylib.config.Config(defaults, settings)

os.makedirs(config.working_dir())  # working_dir = out_dir; you have to create this directory

macsylib.init_logger(log_file=os.path.join(config.working_dir(), config.log_file()))
macsylib.logger_set_level(level=logging.INFO)
logger = logging.getLogger('macsylib')
model_registry = macsylib.registries.ModelRegistry()

for model_dir in config.models_dir():
    models_loc_available = macsylib.registries.scan_models_dir(model_dir)
    for model_loc in models_loc_available:
        model_registry.add(model_loc)
models_def_to_detect, models_fam_name, models_version = macsylib.utils.get_def_to_detect(config.models(),
                                                                                         model_registry)

all_systems, rejected_candidates = macsylib.search_systems.search_systems(config, model_registry, models_def_to_detect,
                                                                          logger)
track_multi_systems_hit = HitSystemTracker(all_systems)

with open(os.path.join(config.working_dir(),'all_systems.txt'), "w", encoding='utf8') as tsv_file:
    macsylib.io.systems_to_tsv('TXSScan', '1.1.3',
                               all_systems,
                               track_multi_systems_hit,
                               tsv_file,
                               header=lambda model_name, model_v, skipped_replicons: f'# created by {__file__} script with models {model_name}-{model_v}')

The script above will produce the file below with a summary of all systems found.

# created by /home/bneron/Projects/GEM/MacSyFinder/src/macsylib/Sandbox/msl_example.py script with models TXSScan-1.1.3
# Systems found:
replicon	hit_id	gene_name	hit_pos	model_fqn	sys_id	sys_loci	locus_num	sys_wholeness	sys_score	sys_occ	hit_gene_ref	hit_status	hit_seq_len	hit_i_eval	hit_score	hit_profile_cov	hit_seq_cov	hit_begin_match	hit_end_match	counterpart	used_in
test	VICH001.B.00001.C001_01397	T1SS_abc	11	TXSScan/bacteria/diderm/T1SS	test_T1SS_1	1	1	1.000	2.700	2	T1SS_abc	mandatory	721	2.4e-157	516.300	0.996	0.698	175	677		
test	VICH001.B.00001.C001_01398	T1SS_mfp	12	TXSScan/bacteria/diderm/T1SS	test_T1SS_1	1	1	1.000	2.700	2	T1SS_mfp	mandatory	467	2.3e-72	235.600	1.000	0.797	85	456		
test	VICH001.B.00001.C001_01399	T1SS_abc	13	TXSScan/bacteria/diderm/T1SS	test_T1SS_1	1	1	1.000	2.700	2	T1SS_abc	mandatory	720	2.1e-158	519.700	1.000	0.699	167	669		
test	VICH001.B.00001.C001_01506	T1SS_omf	23	TXSScan/bacteria/diderm/T1SS	test_T1SS_1	1	-1	1.000	2.700	2	T1SS_omf	mandatory	419	9.3e-35	111.500	0.998	0.912	25	406		

test	VICH001.B.00001.C001_01397	T1SS_abc	11	TXSScan/bacteria/diderm/T1SS	test_T1SS_2	1	1	1.000	2.700	2	T1SS_abc	mandatory	721	2.4e-157	516.300	0.996	0.698	175	677		
test	VICH001.B.00001.C001_01398	T1SS_mfp	12	TXSScan/bacteria/diderm/T1SS	test_T1SS_2	1	1	1.000	2.700	2	T1SS_mfp	mandatory	467	2.3e-72	235.600	1.000	0.797	85	456		
test	VICH001.B.00001.C001_01399	T1SS_abc	13	TXSScan/bacteria/diderm/T1SS	test_T1SS_2	1	1	1.000	2.700	2	T1SS_abc	mandatory	720	2.1e-158	519.700	1.000	0.699	167	669		
test	VICH001.B.00001.C001_01506	T1SS_omf	23	TXSScan/bacteria/diderm/T1SS	test_T1SS_2	1	-1	1.000	2.700	2	T1SS_omf	mandatory	419	9.3e-35	111.500	0.998	0.912	25	406		

The code and the data are available

msl_example.py .

test.fasta .

For more details check the developer guide Developer Guide and api documentation MacSyLib API documentation
For more example check macsyfinder source code

What search mode to be used?

Depending on the type of dataset you have, you will have to adapt MacSyLib’s search mode.

  • If you have a fasta file from a complete genome where proteins are ordered according to the corresponding genes’ order along the replicon, your dataset is entitled to the most powerful search mode (see below): ordered_replicon and use the following option –db-type ordered_replicon.

  • If you have a fasta file of proteins with no sense of the order of the corresponding genes along the chromosome(s) or replicon(s), you will have to use the unordered search mode with the following option: –db-type unordered

  • If you have multiple ordered replicons to analyse at once, you can follow the Gembase convention to name the proteins in the fasta file, so that the original replicons can be assessed from their name: see here for a description.

Note

  • When the gene order is known (ordered_replicon search mode) the power of the analysis is maximal, since both the genomic content and context are taken into account for the search.

  • When the gene order is unknown (unordered search mode) the power of the analysis is more limited since the presence of systems can only be suggested on the basis of the quorum of components - and not based on genomic context information.

More on MacSyLib’s functioning here.

How to deal with fragmented genomes (MAGs, SAGs, draft genomes)?

There are more and more genomes available which are not completely assembled, or are fragmented and incomplete. In this case, several options can be considered.

1. If your genome is at least partially assembled and contigs are not too short, you might “feel lucky” and first consider to run MacSyLib with the ordered_replicon mode. It could be particularly efficient if you are investigating systems encoded by compact loci (Cas systems, some secretion systems…), as they might be encoded by a single contig.

2. On top of the ordered_replicon mode, you might add the option “multi-loci” to the systems to annotate (if not already the case), in order to maximize the chance to annotate an entire system, even if encoded across several contigs.

3. The unordered mode can be used in complement of the two above options, e.g. to retrieve some of the missing components. It will enable to assess the genetic potential and possible presence of a system, independently of the quality of assembly of the genome. It might also be the only reasonable option if the genome is too fragmented and/or too incomplete.

Note

  • The results obtained with the ordered_replicon mode on a fragmented genome have to be considered carefully, especially with respect to the contigs’ borders, as some proteins from different contigs might be artificially considered as closely encoded.

  • To retrieve “fragments” of a system not found to reach the quorum in the ordered_replicon mode, it is possible to retrieve clusters of genes from the rejected_candidates.tsv file.

Where to find MacSyLib models?

Since version 2, there is a tool to enable the download and installation of published models from a repository: the msl_data tool.

See here for details on how to use it.

What are the rules for options precedence?

MacSyLib offers many ways to parametrize the systems’ search: through the command-line, through various configuration files (for the models, for the run, etc…). It offers a large control over the search engine. But it also means you can get lost in configuration. ;-)

Here is a recap of the rules for options precedence. In a general manner, the command line always wins.

The precedence rules between the different levels of configuration are:

system < home < model < project < --cfg-file | --previous-run < command line options
  • system: the <macsylib>.conf file either in ${VIRTUAL_ENV}/etc/<macsylib>/ in case of a virtualenv this configuration affects only the MacSyLib version installed in this virtualenv

  • home: the ~/.<macsylib>/<macsylib>.conf file

  • model: the model_conf.xml file at the root of the model package

  • project: the <macsylib>.conf file found in the directory where the macsylib command was run

  • cfgfile: any configuration file specified by the user on the command line

  • previous-run: the <macsylib>.conf file found in the results directory of the previous run