Step 1 — Promoter Extraction
Pulls the upstream region of every gene out of a genome FASTA, using the coordinates and strand information from a matching GFF3 annotation.
Inputs
Genome FASTA — any standard
.fa/.fasta/.fa.gz.Annotation GFF3 — NCBI RefSeq, Ensembl, or any GFF3 that contains
genefeatures withID=attributes.Promoter length (bp) — default 2000.
What it does
For every gene feature in the GFF3:
Read the gene’s chromosome, start, end, and strand.
On the + strand, take
[start - upstream, start).On the - strand, take
(end, end + upstream]and reverse-complement.Clip at neighbouring gene boundaries (intergenic only) if the “Avoid overlapping genes” box is checked.
Output
promoters.fa— one entry per gene; FASTA header is the gene ID, optionally prefixed with chromosome and coordinate range.
CLI equivalent
cis-gs extract --fasta genome.fa --gff annot.gff3 \
--upstream 2000 --avoid-overlap --out promoters.fa