cis_gs.enrichment.idmap
Gene-ID conversion utilities. Batched MyGene.info, live NCBI Taxonomy search, and the prefix / version stripper shared across the package.
Auto-detecting gene-identifier converter.
The high-level flow is: guess the input ID type, translate to a canonical key, return a (user_input, ensembl_gene_id, species) frame. Cis-GS replaces the ~17 GB SQLite mapping table the naive approach would need with three lightweight back-ends:
An offline regex pre-classifier (cheap, no network) that recognises the common plant + animal ID syntaxes Cis-GS encounters in practice.
MyGene.info
/queryand/querymanyREST endpoints for any ID type the regex doesn’t catch (vertebrates + plants).A small handcrafted Arabidopsis-locus table (TAIR uses
AT[1-5MC]G\d{5}) because TAIR is the most common Cis-GS use-case and MyGene.info’s Arabidopsis coverage is uneven.
The order matters: fast regex first, network only on misses, with an optional species hint that accelerates the lookup when supplied.
- cis_gs.enrichment.idmap.detect_id_type(gene_id)[source]
Return the first ID-type label whose regex matches gene_id.
Detects the ID type offline using a regex pre-classifier. Detection runs on the prefix-stripped form so ‘gene-LOC123’ classifies the same as ‘LOC123’.
- cis_gs.enrichment.idmap.consensus_id_type(gene_ids)[source]
Take a vote across a gene list - useful for picking a single scopes= value to send to MyGene.info /querymany.
Useful for picking a single scopes= value to send to MyGene.info /querymany.
- class cis_gs.enrichment.idmap.IDMapping(user_input, ensembl_gene_id, entrez_id, symbol, species)[source]
Bases:
objectSingle row of the conversion result, canonical.
- Parameters:
- class cis_gs.enrichment.idmap.IDConverter(species=None, timeout=12.0, cache=None)[source]
Bases:
objectAuto-detecting ID converter.
Usage
>>> conv = IDConverter(species="arabidopsis_thaliana") >>> df = conv.convert(["AT1G01010", "AT2G18790", "PHYB"]) >>> df.columns Index(['user_input', 'ensembl_gene_id', 'entrez_id', 'symbol', 'species'])
- param species:
Either a taxonomy ID (integer or numeric string) or a MyGene.info species shortcut (“human”, “mouse”) or the binomial with underscore (“arabidopsis_thaliana”). None lets MyGene.info auto-detect.
- type species:
str | int | None
- param timeout:
Per-request HTTP timeout in seconds.
- type timeout:
float
- param cache:
Optional in-memory cache to amortise repeated lookups across calls.
- type cache:
dict | None
- cis_gs.enrichment.idmap.search_ncbi_taxonomy(query, max_results=25, timeout=15.0)[source]
Search NCBI’s Taxonomy database for any organism name and return [{taxid, scientific_name, common_name}, …].
Used to power a real-time auto-complete in the ID-Convert / GO panels: the user types ‘oryza’ or ‘rice’, we hit NCBI Taxonomy esearch+esummary and surface every match with its taxon ID.