cis_gs.enrichment

KEGG ORA + the shared hypergeometric kernel.

cis_gs.enrichment.core

Generic over-representation analysis (ORA) primitives shared by every enrichment back-end (KEGG, custom gene sets, motif targets, …).

Statistical kernel

Standard one-sided hypergeometric over-representation test with Benjamini-Hochberg FDR correction:

pval            = scipy.stats.hypergeom.sf(k - 1, totalN, n, listN)
over-rep filter = keep terms where k/n > listN/totalN
fold-enrichment = (k / listN) / (n / totalN)

bh_fdr() re-uses Cis-GS’s from-scratch Benjamini-Hochberg routine from app_v4_open.py (no scipy.multitest dependency).

Two implementation notes specific to Cis-GS

A single vectorised pass over thousands of pathways - we treat the whole pathway table as a NumPy column operation instead of looping one category at a time.
Optional minimum-overlap and gene-symbol-cleaning guards that prevent the spurious “1-gene-overlap pathway with q~0” hits that surface on small query lists.

cis_gs.enrichment.core.bh_fdr(pvals)[source]

Benjamini-Hochberg step-up FDR.

This is the same from-scratch implementation Cis-GS already ships in app_v4_open.py for motif-hit q-values; reused verbatim so we don’t add a statsmodels dependency just for one call.

order = argsort(p) ranks = invert(order) + 1 adj = min(1, p · n / rank) adj[i] = min(adj[i], adj[i+1]) (monotone step-up sweep)

Parameters:: pvals (array-like of float) – Raw p-values (any length, any order).
Returns:: BH-adjusted q-values, same order as input.
Return type:: np.ndarray

cis_gs.enrichment.core.fold_enrichment(k, list_n, n, total_n)[source]

Fold-enrichment definition:

FE = (k / listN) / (n / totalN)

where: k = #genes shared by query list and pathway listN = size of query list n = #genes in pathway (within the background) totalN = size of the background universe

Parameters:

k (ndarray)
list_n (int)
n (ndarray)
total_n (int)

Return type:

ndarray

class cis_gs.enrichment.core.EnrichmentResult(table, n_query, n_universe, method='hypergeometric', notes=<factory>)[source]

Bases: object

Tidy container returned by hypergeometric_enrichment().

Parameters:

table (DataFrame)
n_query (int)
n_universe (int)
method (str)
notes (list[str])

table: DataFrame

n_query: int

n_universe: int

method: str = 'hypergeometric'

notes: list[str]

cis_gs.enrichment.core.hypergeometric_enrichment(query_genes, gene_sets, universe=None, *, set_descriptions=None, min_overlap=2, min_set_size=5, max_set_size=2000, one_sided=True)[source]

Vectorised over-representation analysis.

Parameters:

query_genes (iterable of str) – The user’s gene list (e.g. K-means cluster from Cis-GS Module 6, or the gene list a user pastes into the GO/KEGG dialog).
gene_sets (mapping {term_id: iterable of gene IDs}) – The annotation database - GO terms, KEGG pathways, custom MSigDB-style sets, etc. Term IDs are used as the table primary key.
universe (iterable of str, optional) – Background gene set (totalN). If None, taken as the union of every gene appearing in any value of gene_sets. Default is the protein-coding gene complement; pass that explicitly when you have it.
set_descriptions (mapping {term_id: human-readable name}, optional) – For the Description column. KEGG’s pathway_name table or QuickGO’s name field both fit naturally here.
min_overlap (int) – Drop terms with fewer than this many query-genes in them. Defaults to 2 (raised from 1 to silence single-gene noise on small K-means clusters).
min_set_size (int) – Drop pathways/categories outside the size window. Defaults (5..2000). Filters out near-empty and near-universal terms that produce uninterpretable p-values.
max_set_size (int) – Drop pathways/categories outside the size window. Defaults (5..2000). Filters out near-empty and near-universal terms that produce uninterpretable p-values.
one_sided (bool) – If True, apply a one-sided over-representation guard (k/n > listN/totalN). If False, every term gets a p-value.

Returns:

.table is a pandas.DataFrame with columns:: term, description, k, list_n, n, total_n, fold_enrichment, p_value, q_value, genes

Return type:

EnrichmentResult

cis_gs.enrichment.kegg

KEGG pathway over-representation analysis.

Internal data model

Per-species KEGG pathway tables:

pathway - gene to pathway-ID mapping
pathwayInfo - pathway-ID to (name, gene_count, URL)
categories - pathway-ID to high-level category

We keep the same three logical tables that a bundled SQLite dump would expose, just materialised on-the-fly from KEGG REST instead of shipped as a 5 GB blob.

KEGG identifies every species by a three-letter organism code (ath = Arabidopsis, hsa = Human, mmu = Mouse, …) and we keep all enrichment scoped to one organism at a time.

What’s original here

Direct REST queries to https://rest.kegg.jp/ - three endpoints:
- /list/pathway/<org> - pathway IDs + descriptions
- /link/<org>/pathway - gene to pathway membership
- /find/genes/<symbol> - resolve gene symbols to KEGG IDs
Together those provide everything a per-species SQLite dump would, streamed over HTTP in under 2 s per organism.
Two-tier disk + memory cache so the second enrich-kegg call on the same organism is instantaneous.
Auto fallback: if the user’s gene IDs are gene symbols / Ensembl / Entrez (KEGG itself uses NCBI Gene ID for animals and locus tags for plants), we route through KEGG’s conv endpoint to translate.

class cis_gs.enrichment.kegg.KEGGClient(cache_dir=PosixPath('/home/runner/.cis-gs/kegg'), timeout=30.0, retries=3)[source]

Bases: object

Thin retrying wrapper around the four KEGG REST endpoints we need.

Parameters:

cache_dir (str | os.PathLike)
timeout (float)
retries (int)

list_pathways(organism)[source]

{path:ath00010 → ‘Glycolysis / Gluconeogenesis - Arabidopsis thaliana’}.

Parameters:: organism (str)
Return type:: dict[str, str]

pathway_genes(organism)[source]

{pathway_id → set(gene_ids)} for one organism. Returns KEGG’s native gene IDs (e.g. ath:AT1G01010 for Arabidopsis, hsa:7157 for Human TP53). Caller is responsible for matching the query gene list to the same namespace; see convert_to_kegg.

Parameters:: organism (str)
Return type:: dict[str, set[str]]

convert_to_kegg(organism, gene_ids)[source]

Translate Ensembl / NCBI-Gene / UniProt → KEGG gene IDs via /conv. Returns {original_id: kegg_id} with misses left out.

Parameters:

organism (str)
gene_ids (Iterable[str])

Return type:

dict[str, str]

class cis_gs.enrichment.kegg.KEGGEnricher(organism, client=None, background=None)[source]

Bases: object

Drop-in KEGG enrichment.

>>> e = KEGGEnricher(organism="ath")        # Arabidopsis
>>> result = e.enrich(["AT1G01010", "AT2G18790", ...])
>>> result.table.head()

For animals the input is typically NCBI Gene IDs:

>>> e = KEGGEnricher(organism="hsa")        # Human
>>> result = e.enrich(["7157", "672"])      # TP53, BRCA1

Parameters:

organism (str)
client (KEGGClient | None)
background (list[str] | None)

organism: str

client: KEGGClient | None = None

background: list[str] | None = None

enrich(query_genes, **kwargs)[source]

Parameters:: query_genes (Iterable[str])
Return type:: EnrichmentResult

cis_gs.enrichment.plots

cis_gs.enrichment.plots ─────────────────────── Matplotlib renditions of the two canonical enrichment views (dot-plot + bar-plot).

Provenance

We render the same two canonical plots (dot-plot ordered by fold enrichment, top-N bar of −log10 q) in matplotlib so no R bridge is needed.

cis_gs.enrichment.plots.dot_plot(table, top_n=20, out_path=None, title='Enrichment dot plot')[source]

Top-N enriched terms as a dot plot.

X-axis : fold enrichment Y-axis : term description (ordered by q ascending) Size : k (overlap count) Colour : −log10(q)

Parameters:

table (DataFrame)
top_n (int)
out_path (str | None)
title (str)

Return type:

Figure

cis_gs.enrichment.plots.bar_plot(table, top_n=20, out_path=None, title='Top enriched terms')[source]

Top-N bar plot of −log10(q).

Parameters:

table (DataFrame)
top_n (int)
out_path (str | None)
title (str)

Return type:

Figure