cis_gs.enrichment.idmap

Gene-ID conversion utilities. Batched MyGene.info, live NCBI Taxonomy search, and the prefix / version stripper shared across the package.

Auto-detecting gene-identifier converter.

The high-level flow is: guess the input ID type, translate to a canonical key, return a (user_input, ensembl_gene_id, species) frame. Cis-GS replaces the ~17 GB SQLite mapping table the naive approach would need with three lightweight back-ends:

An offline regex pre-classifier (cheap, no network) that recognises the common plant + animal ID syntaxes Cis-GS encounters in practice.
MyGene.info /query and /querymany REST endpoints for any ID type the regex doesn’t catch (vertebrates + plants).
A small handcrafted Arabidopsis-locus table (TAIR uses AT[1-5MC]G\d{5}) because TAIR is the most common Cis-GS use-case and MyGene.info’s Arabidopsis coverage is uneven.

The order matters: fast regex first, network only on misses, with an optional species hint that accelerates the lookup when supplied.

cis_gs.enrichment.idmap.detect_id_type(gene_id)[source]

Return the first ID-type label whose regex matches gene_id.

Detects the ID type offline using a regex pre-classifier. Detection runs on the prefix-stripped form so ‘gene-LOC123’ classifies the same as ‘LOC123’.

Parameters:: gene_id (str)
Return type:: str

cis_gs.enrichment.idmap.consensus_id_type(gene_ids)[source]

Take a vote across a gene list - useful for picking a single scopes= value to send to MyGene.info /querymany.

Useful for picking a single scopes= value to send to MyGene.info /querymany.

Parameters:: gene_ids (Iterable[str])
Return type:: str

class cis_gs.enrichment.idmap.IDMapping(user_input, ensembl_gene_id, entrez_id, symbol, species)[source]

Bases: object

Single row of the conversion result, canonical.

Parameters:

user_input (str)
ensembl_gene_id (str | None)
entrez_id (str | None)
symbol (str | None)
species (str | None)

user_input: str

ensembl_gene_id: str | None

entrez_id: str | None

symbol: str | None

species: str | None

class cis_gs.enrichment.idmap.IDConverter(species=None, timeout=12.0, cache=None)[source]

Bases: object

Auto-detecting ID converter.

Usage

>>> conv = IDConverter(species="arabidopsis_thaliana")
>>> df = conv.convert(["AT1G01010", "AT2G18790", "PHYB"])
>>> df.columns
Index(['user_input', 'ensembl_gene_id', 'entrez_id', 'symbol', 'species'])

param species:: Either a taxonomy ID (integer or numeric string) or a MyGene.info species shortcut (“human”, “mouse”) or the binomial with underscore (“arabidopsis_thaliana”). None lets MyGene.info auto-detect.
type species:: str | int | None
param timeout:: Per-request HTTP timeout in seconds.
type timeout:: float
param cache:: Optional in-memory cache to amortise repeated lookups across calls.
type cache:: dict | None

convert(gene_ids, progress_callback=None)[source]

Translate a gene list to a canonical DataFrame.

progress_callback(done:int, total:int, label:str) is invoked repeatedly during the run so a GUI can show a real-time bar.

Parameters:: gene_ids (Iterable[str])
Return type:: DataFrame

Parameters:

species (str | int | None)
timeout (float)
cache (dict[tuple[str, str], IDMapping] | None)

cis_gs.enrichment.idmap.search_ncbi_taxonomy(query, max_results=25, timeout=15.0)[source]

Search NCBI’s Taxonomy database for any organism name and return [{taxid, scientific_name, common_name}, …].

Used to power a real-time auto-complete in the ID-Convert / GO panels: the user types ‘oryza’ or ‘rice’, we hit NCBI Taxonomy esearch+esummary and surface every match with its taxon ID.

Parameters:

query (str)
max_results (int)
timeout (float)

Return type:

list[dict]