cluster

A cluster is an ordered set of hits related to a model which satisfy the model distance constraints.

cluster API reference

Class Cluster

class macsylib.cluster.Cluster(hits: list[CoreHit] | list[ModelHit], model, hit_weights)[source]

Handle hits relative to a model which collocates

__contains__(m_hit: ModelHit) bool[source]
Parameters:

m_hit – The hit to test

Returns:

True if the hit is in the cluster hits, False otherwise

__init__(hits: list[CoreHit] | list[ModelHit], model, hit_weights) None[source]
Parameters:
  • hits – the hits constituting this cluster

  • model – the model associated to this cluster

  • hit_weights – the weight of the hit to compute the score

__str__() str[source]
Returns:

a string representation of this cluster

__weakref__

list of weak references to the object

_check_replicon_consistency() None[source]
Raise:

MacsylibError if all hits of a cluster are NOT related to the same replicon

fulfilled_function(*genes: ModelGene | str) frozenset[str][source]
Parameters:

genes – The genes which must be tested.

Returns:

the common functions between genes and this cluster.

property functions: frozenset[str]
Returns:

The set of functions encoded by this cluster function mean gene name or reference gene name for exchangeables genes for instance

<model vers="2.0">
    <gene a presence="mandatory"/>
    <gene b presence="accessory"/>
       <exchangeable>
           <gene c />
       </exchangeable>
    <gene/>
</model>

the functions for a cluster corresponding to this model wil be {‘a’ , ‘b’}

property hit_weights: HitWeight
Returns:

the different weight for the hits used to compute the score

property hits: list[CoreHit | ModelHit]
Returns:

the hits sorted by the increasing position

property loner: bool
Returns:

True if this cluster is made of only some hits representing the same gene and this gene is tag as loner False otherwise:

  • contains several hits coding for different genes

  • contains one hit but gene is not tag as loner (max_gene_required = 1)

merge(cluster: Cluster, before: bool = False) None[source]

merge the cluster param in this one. (do it in place)

Parameters:
  • cluster

  • before (bool) – If False the hits of the cluster will be added at the end of this one, Otherwise the cluster hits will be inserted before the hits of this one.

Raises:

MacError – if the two clusters have not the same model

property multi_system: bool
Returns:

True if this cluster is made of only one hit representing a multi_system gene False otherwise:

  • contains several hits

  • contains one hit but gene is not tag as loner (max_gene_required = 1)

replace(old: ModelHit, new: ModelHit) None[source]

replace hit old in this cluster by new one. (do it in place) beware the hits in a cluster are sorted by their position so if old hit and new hit have not same position the order will be changed

Parameters:
  • old – the hit to replace

  • new – the new hit

Returns:

None

property replicon_name: str
Returns:

The name of the replicon where this cluster is located

Return type:

str

property score: float
Returns:

The score for this cluster

cluster functions

Functions that help to build macsylib.cluster.Cluster object.

class macsylib.cluster.Cluster(hits: list[CoreHit] | list[ModelHit], model, hit_weights)[source]

Handle hits relative to a model which collocates

fulfilled_function(*genes: ModelGene | str) frozenset[str][source]
Parameters:

genes – The genes which must be tested.

Returns:

the common functions between genes and this cluster.

property functions: frozenset[str]
Returns:

The set of functions encoded by this cluster function mean gene name or reference gene name for exchangeables genes for instance

<model vers="2.0">
    <gene a presence="mandatory"/>
    <gene b presence="accessory"/>
       <exchangeable>
           <gene c />
       </exchangeable>
    <gene/>
</model>

the functions for a cluster corresponding to this model wil be {‘a’ , ‘b’}

property hit_weights: HitWeight
Returns:

the different weight for the hits used to compute the score

property hits: list[CoreHit | ModelHit]
Returns:

the hits sorted by the increasing position

property loner: bool
Returns:

True if this cluster is made of only some hits representing the same gene and this gene is tag as loner False otherwise:

  • contains several hits coding for different genes

  • contains one hit but gene is not tag as loner (max_gene_required = 1)

merge(cluster: Cluster, before: bool = False) None[source]

merge the cluster param in this one. (do it in place)

Parameters:
  • cluster

  • before (bool) – If False the hits of the cluster will be added at the end of this one, Otherwise the cluster hits will be inserted before the hits of this one.

Raises:

MacError – if the two clusters have not the same model

property multi_system: bool
Returns:

True if this cluster is made of only one hit representing a multi_system gene False otherwise:

  • contains several hits

  • contains one hit but gene is not tag as loner (max_gene_required = 1)

replace(old: ModelHit, new: ModelHit) None[source]

replace hit old in this cluster by new one. (do it in place) beware the hits in a cluster are sorted by their position so if old hit and new hit have not same position the order will be changed

Parameters:
  • old – the hit to replace

  • new – the new hit

Returns:

None

property replicon_name: str
Returns:

The name of the replicon where this cluster is located

Return type:

str

property score: float
Returns:

The score for this cluster

macsylib.cluster.build_clusters(hits: list[ModelHit], rep_info: RepliconInfo, model: Model, hit_weights: HitWeight) tuple[list[~macsylib.cluster.Cluster], dict[slice(<class 'str'>, macsylib.hit.Loner | macsylib.hit.LonerMultiSystem, None)]][source]

From a list of filtered hits, and replicon information (topology, length), build all lists of hits that satisfied the constraints:

  • max_gene_inter_space

  • loner

  • multi_system

If Yes create a cluster. A cluster contains at least two hits separated by less or equal than max_gene_inter_space Except for loner genes which are allowed to be alone in a cluster

Parameters:
  • hits – list of filtered hits

  • rep_info – the replicon to analyse

  • model – the model to study

  • hit_weights – the hit weight needed to compute the cluster score

Returns:

list of regular clusters, the special clusters (loners not in cluster and multi systems)

Return type:

tuple with 2 elements

  • true_clusters which is list of Cluster objects

  • true_loners: a dict { str function: :class:macsylib.hit.Loner | :class:macsylib.hit.LonerMultiSystem object}

macsylib.cluster.closest_hit(hit: ModelHit, ref_hits: list[ModelHit]) ModelHit[source]
Parameters:
  • hit – the hit

  • ref_hits – The reference hits. the distance between hit and each ref_hit will be computed. the closest ref_hit will be returned

Returns:

The closest ref_hit to the hit. If two ref_hits are equidistant form the hit

return those with the lowest position. for isnstance:

position     40  20  60
closest_hit( ref_hit, [H1, H2]

will return H1

macsylib.cluster.clusterize_hits_around_key_genes(key_genes: set[str], hits: list[ModelHit], model: Model, hit_weights: HitWeight, rep_info: RepliconInfo) list[Cluster][source]

clusterize hit regarding the distance between them and around key_gene

Parameters:
  • hits (list of macsylib.model.ModelHit objects) – the hits to clusterize

  • model (macsylib.model.Model object) – the model to consider

  • hit_weights (macsylib.hit.HitWeight object) – the hit weight to compute the score

Returns:

the clusters

Return type:

list of macsylib.cluster.Cluster objects.

macsylib.cluster.clusterize_hits_on_distance_only(hits: list[ModelHit], model: Model, hit_weights: HitWeight, rep_info: RepliconInfo) list[Cluster][source]

clusterize hit regarding the distance between them

Parameters:
  • hits – the hits to clusterize

  • model – the model to consider

  • hit_weights – the hit weight to compute the score

  • rep_info – The information on the replicon

Returns:

the clusters

macsylib.cluster.is_a(hit: ModelHit | CoreHit, ref_hits: set[str]) bool[source]
Parameters:
  • hit – The hit to check

  • ref_hits – the gene name of the reference hit

Returns:

True if the hit belong to the reference hits, False otherwise

macsylib.cluster.scaffold_to_cluster(cluster_scaffold: list[ModelHit], model: Model, hit_weights: HitWeight) Cluster[source]

transform a list of ModelHit in a cluster if the hit colocalize and they are not all neutral and they do not code for same gene add the new cluster to the clusters

Parameters:
  • cluster_scaffold – model hit to transform in cluster

  • model – The model related to thus cluster

  • hit_weights – the hit weight to compute scores

Returns:

Cluster

macsylib.cluster.split_cluster_on_key_genes(key_genes: set[str], cluster: Cluster) list[Cluster][source]

split a Cluster containing several key genes to have one cluster per key genes, with their closest hits

For instance if a set of gene clusterize as following (we considering that all gene are 10 genea between next one:

positions  10   20    30   40   50    60     70
genes       A   KG1    B    C    D    KG2     E

The resulting cluster after split around the 2 KG (key genes):

c1 = [A, KG1, B, C], c2 = [D, KG2, E]

The question is for gene C which is equidistant from KG1 KG2 C will be clustered with the most left cluster

Parameters:
  • key_genes – the gene names which be seed for cluster

  • cluster – The cluster to split

Returns: