cis_gs.enrichment.idmap

Gene-ID conversion utilities. Batched MyGene.info, live NCBI Taxonomy search, and the prefix / version stripper shared across the package.

Auto-detecting gene-identifier converter.

The high-level flow is: guess the input ID type, translate to a canonical key, return a (user_input, ensembl_gene_id, species) frame. Cis-GS replaces the ~17 GB SQLite mapping table the naive approach would need with three lightweight back-ends:

  1. An offline regex pre-classifier (cheap, no network) that recognises the common plant + animal ID syntaxes Cis-GS encounters in practice.

  2. MyGene.info /query and /querymany REST endpoints for any ID type the regex doesn’t catch (vertebrates + plants).

  3. A small handcrafted Arabidopsis-locus table (TAIR uses AT[1-5MC]G\d{5}) because TAIR is the most common Cis-GS use-case and MyGene.info’s Arabidopsis coverage is uneven.

The order matters: fast regex first, network only on misses, with an optional species hint that accelerates the lookup when supplied.

cis_gs.enrichment.idmap.detect_id_type(gene_id)[source]

Return the first ID-type label whose regex matches gene_id.

Detects the ID type offline using a regex pre-classifier. Detection runs on the prefix-stripped form so ‘gene-LOC123’ classifies the same as ‘LOC123’.

Parameters:

gene_id (str)

Return type:

str

cis_gs.enrichment.idmap.consensus_id_type(gene_ids)[source]

Take a vote across a gene list - useful for picking a single scopes= value to send to MyGene.info /querymany.

Useful for picking a single scopes= value to send to MyGene.info /querymany.

Parameters:

gene_ids (Iterable[str])

Return type:

str

class cis_gs.enrichment.idmap.IDMapping(user_input, ensembl_gene_id, entrez_id, symbol, species)[source]

Bases: object

Single row of the conversion result, canonical.

Parameters:
  • user_input (str)

  • ensembl_gene_id (str | None)

  • entrez_id (str | None)

  • symbol (str | None)

  • species (str | None)

user_input: str
ensembl_gene_id: str | None
entrez_id: str | None
symbol: str | None
species: str | None
class cis_gs.enrichment.idmap.IDConverter(species=None, timeout=12.0, cache=None)[source]

Bases: object

Auto-detecting ID converter.

Usage

>>> conv = IDConverter(species="arabidopsis_thaliana")
>>> df = conv.convert(["AT1G01010", "AT2G18790", "PHYB"])
>>> df.columns
Index(['user_input', 'ensembl_gene_id', 'entrez_id', 'symbol', 'species'])
param species:

Either a taxonomy ID (integer or numeric string) or a MyGene.info species shortcut (“human”, “mouse”) or the binomial with underscore (“arabidopsis_thaliana”). None lets MyGene.info auto-detect.

type species:

str | int | None

param timeout:

Per-request HTTP timeout in seconds.

type timeout:

float

param cache:

Optional in-memory cache to amortise repeated lookups across calls.

type cache:

dict | None

convert(gene_ids, progress_callback=None)[source]

Translate a gene list to a canonical DataFrame.

progress_callback(done:int, total:int, label:str) is invoked repeatedly during the run so a GUI can show a real-time bar.

Parameters:

gene_ids (Iterable[str])

Return type:

DataFrame

Parameters:
cis_gs.enrichment.idmap.search_ncbi_taxonomy(query, max_results=25, timeout=15.0)[source]

Search NCBI’s Taxonomy database for any organism name and return [{taxid, scientific_name, common_name}, …].

Used to power a real-time auto-complete in the ID-Convert / GO panels: the user types ‘oryza’ or ‘rice’, we hit NCBI Taxonomy esearch+esummary and surface every match with its taxon ID.

Parameters:
Return type:

list[dict]