cis_gs.enrichment
KEGG ORA + the shared hypergeometric kernel.
cis_gs.enrichment.core
Generic over-representation analysis (ORA) primitives shared by every enrichment back-end (KEGG, custom gene sets, motif targets, …).
Statistical kernel
Standard one-sided hypergeometric over-representation test with Benjamini-Hochberg FDR correction:
pval = scipy.stats.hypergeom.sf(k - 1, totalN, n, listN)
over-rep filter = keep terms where k/n > listN/totalN
fold-enrichment = (k / listN) / (n / totalN)
bh_fdr() re-uses Cis-GS’s from-scratch Benjamini-Hochberg routine
from app_v4_open.py (no scipy.multitest dependency).
Two implementation notes specific to Cis-GS
A single vectorised pass over thousands of pathways - we treat the whole pathway table as a NumPy column operation instead of looping one category at a time.
Optional minimum-overlap and gene-symbol-cleaning guards that prevent the spurious “1-gene-overlap pathway with q~0” hits that surface on small query lists.
- cis_gs.enrichment.core.bh_fdr(pvals)[source]
Benjamini-Hochberg step-up FDR.
This is the same from-scratch implementation Cis-GS already ships in app_v4_open.py for motif-hit q-values; reused verbatim so we don’t add a statsmodels dependency just for one call.
order = argsort(p) ranks = invert(order) + 1 adj = min(1, p · n / rank) adj[i] = min(adj[i], adj[i+1]) (monotone step-up sweep)
- Parameters:
pvals (array-like of float) – Raw p-values (any length, any order).
- Returns:
BH-adjusted q-values, same order as input.
- Return type:
np.ndarray
- cis_gs.enrichment.core.fold_enrichment(k, list_n, n, total_n)[source]
Fold-enrichment definition:
FE = (k / listN) / (n / totalN)
- where
k = #genes shared by query list and pathway listN = size of query list n = #genes in pathway (within the background) totalN = size of the background universe
- class cis_gs.enrichment.core.EnrichmentResult(table, n_query, n_universe, method='hypergeometric', notes=<factory>)[source]
Bases:
objectTidy container returned by hypergeometric_enrichment().
- cis_gs.enrichment.core.hypergeometric_enrichment(query_genes, gene_sets, universe=None, *, set_descriptions=None, min_overlap=2, min_set_size=5, max_set_size=2000, one_sided=True)[source]
Vectorised over-representation analysis.
- Parameters:
query_genes (iterable of str) – The user’s gene list (e.g. K-means cluster from Cis-GS Module 6, or the gene list a user pastes into the GO/KEGG dialog).
gene_sets (mapping {term_id: iterable of gene IDs}) – The annotation database - GO terms, KEGG pathways, custom MSigDB-style sets, etc. Term IDs are used as the table primary key.
universe (iterable of str, optional) – Background gene set (totalN). If None, taken as the union of every gene appearing in any value of gene_sets. Default is the protein-coding gene complement; pass that explicitly when you have it.
set_descriptions (mapping {term_id: human-readable name}, optional) – For the Description column. KEGG’s pathway_name table or QuickGO’s name field both fit naturally here.
min_overlap (int) – Drop terms with fewer than this many query-genes in them. Defaults to 2 (raised from 1 to silence single-gene noise on small K-means clusters).
min_set_size (int) – Drop pathways/categories outside the size window. Defaults (5..2000). Filters out near-empty and near-universal terms that produce uninterpretable p-values.
max_set_size (int) – Drop pathways/categories outside the size window. Defaults (5..2000). Filters out near-empty and near-universal terms that produce uninterpretable p-values.
one_sided (bool) – If True, apply a one-sided over-representation guard (k/n > listN/totalN). If False, every term gets a p-value.
- Returns:
- .table is a pandas.DataFrame with columns:
term, description, k, list_n, n, total_n, fold_enrichment, p_value, q_value, genes
- Return type:
cis_gs.enrichment.kegg
KEGG pathway over-representation analysis.
Internal data model
Per-species KEGG pathway tables:
pathway- gene to pathway-ID mappingpathwayInfo- pathway-ID to (name, gene_count, URL)categories- pathway-ID to high-level category
We keep the same three logical tables that a bundled SQLite dump would expose, just materialised on-the-fly from KEGG REST instead of shipped as a 5 GB blob.
KEGG identifies every species by a three-letter organism code
(ath = Arabidopsis, hsa = Human, mmu = Mouse, …) and we
keep all enrichment scoped to one organism at a time.
What’s original here
Direct REST queries to
https://rest.kegg.jp/- three endpoints:/list/pathway/<org>- pathway IDs + descriptions/link/<org>/pathway- gene to pathway membership/find/genes/<symbol>- resolve gene symbols to KEGG IDs
Together those provide everything a per-species SQLite dump would, streamed over HTTP in under 2 s per organism.
Two-tier disk + memory cache so the second
enrich-keggcall on the same organism is instantaneous.Auto fallback: if the user’s gene IDs are gene symbols / Ensembl / Entrez (KEGG itself uses NCBI Gene ID for animals and locus tags for plants), we route through KEGG’s
convendpoint to translate.
- class cis_gs.enrichment.kegg.KEGGClient(cache_dir=PosixPath('/home/runner/.cis-gs/kegg'), timeout=30.0, retries=3)[source]
Bases:
objectThin retrying wrapper around the four KEGG REST endpoints we need.
- Parameters:
cache_dir (str | os.PathLike)
timeout (float)
retries (int)
- list_pathways(organism)[source]
{path:ath00010 → ‘Glycolysis / Gluconeogenesis - Arabidopsis thaliana’}.
- class cis_gs.enrichment.kegg.KEGGEnricher(organism, client=None, background=None)[source]
Bases:
objectDrop-in KEGG enrichment.
>>> e = KEGGEnricher(organism="ath") # Arabidopsis >>> result = e.enrich(["AT1G01010", "AT2G18790", ...]) >>> result.table.head()
For animals the input is typically NCBI Gene IDs:
>>> e = KEGGEnricher(organism="hsa") # Human >>> result = e.enrich(["7157", "672"]) # TP53, BRCA1
- Parameters:
organism (str)
client (KEGGClient | None)
- client: KEGGClient | None = None
cis_gs.enrichment.plots
cis_gs.enrichment.plots ─────────────────────── Matplotlib renditions of the two canonical enrichment views (dot-plot + bar-plot).
Provenance
We render the same two canonical plots (dot-plot ordered by fold enrichment, top-N bar of −log10 q) in matplotlib so no R bridge is needed.