Reference Genomes
The ScaleBio RNA workflow requires a reference genome for analysis. This page explains how to set up and use reference genomes with the workflow.
A reference consists of the genome sequence used for alignment of the RNA reads and a gene annotation, used to match aligned reads to genes to quantify gene expression.
The workflow uses a simple JSON configuration file to define all genome-related settings, making it easy to switch between different species or genome versions.
Quick Start
- Choose your genome: Use a pre-built genome (recommended) or create your own
- Download and extract: Get the genome files and place them on your system
- Run the workflow: Point to the genome configuration file using
--genome
Example:
# Download and extract a pre-built human genome
wget http://scale.pub.s3.amazonaws.com/genomes/rna/grch38.tgz
tar -xzf grch38.tgz
# Run the workflow with this genome
nextflow run ScaleBio/ScaleRna --genome /path/to/grch38/grch38.json ...`
Pre-built Genomes
We provide pre-built reference genomes for common species. These are ready to use and include all necessary files.
Available Genomes
Species | Genome Version | Download Link | Size |
---|---|---|---|
Human | GRCh38 | Download | ~3.5 GB |
Mouse | mm39 | Download | ~2.8 GB |
Human/Mouse Barnyard | GRCh38 + mm39 | Download | ~6.3 GB |
Tip
Pre-built genomes are the easiest option for most users. They're tested and ready to use.
Creating Custom Genomes
If you need a genome that's not available pre-built, you can create your own.
- Create a directory for your genome files and gather sequence (
.fasta
) and annotation (.gtf
) - Create a STAR index and place it in a subdirectory
- Create
genome.json
with the required fields
Please see a more detailed explanation on how to generate a custom genome here.
Building a STAR Index
Requirements
- STAR version: ≥ 2.7.4a
- Genome FASTA: Primary assembly recommended (no alt. haplotypes)
- Gene annotation: GTF format with exon entries
STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir /path/to/star_index \
--genomeFastaFiles /path/to/genome.fa \
--sjdbGTFfile /path/to/genes.gtf \
Creating the Configuration File
All genome settings are defined in a genome.json
file. This file tells the workflow where to find the reference files and how to interpret them.
Required Fields
Field | Description | Example |
---|---|---|
name |
Species/genome version identifier | GRCh38 , mm39 |
star_index |
Path to the STAR index directory | /path/to/star_index/ |
Optional Fields
Field | Description | Example |
---|---|---|
gtf |
Path to gene annotation file | genes.gtf |
speciesName |
Full species name (for OrgDb) | Homo sapiens |
isBarnyard |
Multi-species genome flag | false |
Example Configuration
{
"name": "GRCh38",
"star_index": "/path/to/star_index/",
"gtf": "genes.gtf",
"speciesName": "Homo sapiens",
"isBarnyard": false
}
File Paths
You can specify file paths in three ways:
- Absolute path:
/full/path/to/genome/
- Relative path:
genes.gtf
(relative to thegenome.json
file location) - AWS S3 URL:
s3://bucket/path/to/genome/
Gene Annotation Considerations
GTF File Requirements
- Format: Standard GTF format
- Required entries:
exon
features (used by STAR) - Included in analysis: All transcripts in the GTF will appear in your gene expression matrix
Filtering Annotations
To exclude certain gene types (e.g., pseudogenes, non-coding RNAs):
- Filter the GTF before building the STAR index
- Use biotype filtering if available in your annotation source
- Check the annotation source for quality control recommendations
Example filtering command:
# Filter to keep only protein-coding genes
grep -E "gene_type \"protein_coding\"" genes.gtf > filtered_genes.gtf
Multi-Species Genomes (Barnyard)
For experiments with mixed species (e.g., human + mouse), use the barnyard genome:
- Set
isBarnyard: true
in yourgenome.json
- Use the combined genome that includes both species
- The workflow will automatically separate reads by species
Example barnyard configuration:
{
"name": "GRCh38_mm39",
"star_index": "/path/to/barnyard_index/",
"gtf": "combined_genes.gtf",
"speciesName": "Homo sapiens + Mus musculus",
"isBarnyard": true
}
Troubleshooting
Common Issues
Problem | Solution |
---|---|
STAR index not found | Check the path in genome.json |
GTF file missing | Ensure the file exists and path is correct |
Permission errors | Check file/directory permissions |
Memory issues | Use a machine with sufficient RAM for STAR |
Further Reading
Example Genome Files
See the examples directory for complete genome configuration examples.
Need Help?
For more information, please contact support@scale.bio or visit our support website.