cis_gs.enrichment

KEGG ORA + the shared hypergeometric kernel.

cis_gs.enrichment.core

Generic over-representation analysis (ORA) primitives shared by every enrichment back-end (KEGG, custom gene sets, motif targets, …).

Statistical kernel

Standard one-sided hypergeometric over-representation test with Benjamini-Hochberg FDR correction:

pval            = scipy.stats.hypergeom.sf(k - 1, totalN, n, listN)
over-rep filter = keep terms where k/n > listN/totalN
fold-enrichment = (k / listN) / (n / totalN)

bh_fdr() re-uses Cis-GS’s from-scratch Benjamini-Hochberg routine from app_v4_open.py (no scipy.multitest dependency).

Two implementation notes specific to Cis-GS

  1. A single vectorised pass over thousands of pathways - we treat the whole pathway table as a NumPy column operation instead of looping one category at a time.

  2. Optional minimum-overlap and gene-symbol-cleaning guards that prevent the spurious “1-gene-overlap pathway with q~0” hits that surface on small query lists.

cis_gs.enrichment.core.bh_fdr(pvals)[source]

Benjamini-Hochberg step-up FDR.

This is the same from-scratch implementation Cis-GS already ships in app_v4_open.py for motif-hit q-values; reused verbatim so we don’t add a statsmodels dependency just for one call.

order = argsort(p) ranks = invert(order) + 1 adj = min(1, p · n / rank) adj[i] = min(adj[i], adj[i+1]) (monotone step-up sweep)

Parameters:

pvals (array-like of float) – Raw p-values (any length, any order).

Returns:

BH-adjusted q-values, same order as input.

Return type:

np.ndarray

cis_gs.enrichment.core.fold_enrichment(k, list_n, n, total_n)[source]

Fold-enrichment definition:

FE = (k / listN) / (n / totalN)

where

k = #genes shared by query list and pathway listN = size of query list n = #genes in pathway (within the background) totalN = size of the background universe

Parameters:
Return type:

ndarray

class cis_gs.enrichment.core.EnrichmentResult(table, n_query, n_universe, method='hypergeometric', notes=<factory>)[source]

Bases: object

Tidy container returned by hypergeometric_enrichment().

Parameters:
table: DataFrame
n_query: int
n_universe: int
method: str = 'hypergeometric'
notes: list[str]
cis_gs.enrichment.core.hypergeometric_enrichment(query_genes, gene_sets, universe=None, *, set_descriptions=None, min_overlap=2, min_set_size=5, max_set_size=2000, one_sided=True)[source]

Vectorised over-representation analysis.

Parameters:
  • query_genes (iterable of str) – The user’s gene list (e.g. K-means cluster from Cis-GS Module 6, or the gene list a user pastes into the GO/KEGG dialog).

  • gene_sets (mapping {term_id: iterable of gene IDs}) – The annotation database - GO terms, KEGG pathways, custom MSigDB-style sets, etc. Term IDs are used as the table primary key.

  • universe (iterable of str, optional) – Background gene set (totalN). If None, taken as the union of every gene appearing in any value of gene_sets. Default is the protein-coding gene complement; pass that explicitly when you have it.

  • set_descriptions (mapping {term_id: human-readable name}, optional) – For the Description column. KEGG’s pathway_name table or QuickGO’s name field both fit naturally here.

  • min_overlap (int) – Drop terms with fewer than this many query-genes in them. Defaults to 2 (raised from 1 to silence single-gene noise on small K-means clusters).

  • min_set_size (int) – Drop pathways/categories outside the size window. Defaults (5..2000). Filters out near-empty and near-universal terms that produce uninterpretable p-values.

  • max_set_size (int) – Drop pathways/categories outside the size window. Defaults (5..2000). Filters out near-empty and near-universal terms that produce uninterpretable p-values.

  • one_sided (bool) – If True, apply a one-sided over-representation guard (k/n > listN/totalN). If False, every term gets a p-value.

Returns:

.table is a pandas.DataFrame with columns:

term, description, k, list_n, n, total_n, fold_enrichment, p_value, q_value, genes

Return type:

EnrichmentResult

cis_gs.enrichment.kegg

KEGG pathway over-representation analysis.

Internal data model

Per-species KEGG pathway tables:

  • pathway - gene to pathway-ID mapping

  • pathwayInfo - pathway-ID to (name, gene_count, URL)

  • categories - pathway-ID to high-level category

We keep the same three logical tables that a bundled SQLite dump would expose, just materialised on-the-fly from KEGG REST instead of shipped as a 5 GB blob.

KEGG identifies every species by a three-letter organism code (ath = Arabidopsis, hsa = Human, mmu = Mouse, …) and we keep all enrichment scoped to one organism at a time.

What’s original here

  • Direct REST queries to https://rest.kegg.jp/ - three endpoints:

    • /list/pathway/<org> - pathway IDs + descriptions

    • /link/<org>/pathway - gene to pathway membership

    • /find/genes/<symbol> - resolve gene symbols to KEGG IDs

    Together those provide everything a per-species SQLite dump would, streamed over HTTP in under 2 s per organism.

  • Two-tier disk + memory cache so the second enrich-kegg call on the same organism is instantaneous.

  • Auto fallback: if the user’s gene IDs are gene symbols / Ensembl / Entrez (KEGG itself uses NCBI Gene ID for animals and locus tags for plants), we route through KEGG’s conv endpoint to translate.

class cis_gs.enrichment.kegg.KEGGClient(cache_dir=PosixPath('/home/runner/.cis-gs/kegg'), timeout=30.0, retries=3)[source]

Bases: object

Thin retrying wrapper around the four KEGG REST endpoints we need.

Parameters:
list_pathways(organism)[source]

{path:ath00010 → ‘Glycolysis / Gluconeogenesis - Arabidopsis thaliana’}.

Parameters:

organism (str)

Return type:

dict[str, str]

pathway_genes(organism)[source]

{pathway_id → set(gene_ids)} for one organism. Returns KEGG’s native gene IDs (e.g. ath:AT1G01010 for Arabidopsis, hsa:7157 for Human TP53). Caller is responsible for matching the query gene list to the same namespace; see convert_to_kegg.

Parameters:

organism (str)

Return type:

dict[str, set[str]]

convert_to_kegg(organism, gene_ids)[source]

Translate Ensembl / NCBI-Gene / UniProt → KEGG gene IDs via /conv. Returns {original_id: kegg_id} with misses left out.

Parameters:
Return type:

dict[str, str]

class cis_gs.enrichment.kegg.KEGGEnricher(organism, client=None, background=None)[source]

Bases: object

Drop-in KEGG enrichment.

>>> e = KEGGEnricher(organism="ath")        # Arabidopsis
>>> result = e.enrich(["AT1G01010", "AT2G18790", ...])
>>> result.table.head()

For animals the input is typically NCBI Gene IDs:

>>> e = KEGGEnricher(organism="hsa")        # Human
>>> result = e.enrich(["7157", "672"])      # TP53, BRCA1
Parameters:
organism: str
client: KEGGClient | None = None
background: list[str] | None = None
enrich(query_genes, **kwargs)[source]
Parameters:

query_genes (Iterable[str])

Return type:

EnrichmentResult

cis_gs.enrichment.plots

cis_gs.enrichment.plots ─────────────────────── Matplotlib renditions of the two canonical enrichment views (dot-plot + bar-plot).

Provenance

We render the same two canonical plots (dot-plot ordered by fold enrichment, top-N bar of −log10 q) in matplotlib so no R bridge is needed.

cis_gs.enrichment.plots.dot_plot(table, top_n=20, out_path=None, title='Enrichment dot plot')[source]

Top-N enriched terms as a dot plot.

X-axis : fold enrichment Y-axis : term description (ordered by q ascending) Size : k (overlap count) Colour : −log10(q)

Parameters:
Return type:

Figure

cis_gs.enrichment.plots.bar_plot(table, top_n=20, out_path=None, title='Top enriched terms')[source]

Top-N bar plot of −log10(q).

Parameters:
Return type:

Figure