Skip to content

Reference Genomes

The ScaleBio RNA workflow requires a reference genome for analysis. This page explains how to set up and use reference genomes with the workflow.

A reference consists of the genome sequence used for alignment of the RNA reads and a gene annotation, used to match aligned reads to genes to quantify gene expression.

The workflow uses a simple JSON configuration file to define all genome-related settings, making it easy to switch between different species or genome versions.


Quick Start

  1. Choose your genome: Use a pre-built genome (recommended) or create your own
  2. Download and extract: Get the genome files and place them on your system
  3. Run the workflow: Point to the genome configuration file using --genome

Example:

# Download and extract a pre-built human genome
wget http://scale.pub.s3.amazonaws.com/genomes/rna/grch38.tgz
tar -xzf grch38.tgz

# Run the workflow with this genome
nextflow run ScaleBio/ScaleRna --genome /path/to/grch38/grch38.json ...`

Pre-built Genomes

We provide pre-built reference genomes for common species. These are ready to use and include all necessary files.

Available Genomes

Species Genome Version Download Link Size
Human GRCh38 Download ~3.5 GB
Mouse mm39 Download ~2.8 GB
Human/Mouse Barnyard GRCh38 + mm39 Download ~6.3 GB

Tip

Pre-built genomes are the easiest option for most users. They're tested and ready to use.

Creating Custom Genomes

If you need a genome that's not available pre-built, you can create your own.

  1. Create a directory for your genome files and gather sequence (.fasta) and annotation (.gtf)
  2. Create a STAR index and place it in a subdirectory
  3. Create genome.json with the required fields

Please see a more detailed explanation on how to generate a custom genome here.


Building a STAR Index

Requirements

  • STAR version: ≥ 2.7.4a
  • Genome FASTA: Primary assembly recommended (no alt. haplotypes)
  • Gene annotation: GTF format with exon entries
STAR --runMode genomeGenerate \
  --runThreadN 16 \
  --genomeDir /path/to/star_index \
  --genomeFastaFiles /path/to/genome.fa \
  --sjdbGTFfile /path/to/genes.gtf \

Creating the Configuration File

All genome settings are defined in a genome.json file. This file tells the workflow where to find the reference files and how to interpret them.

Required Fields

Field Description Example
name Species/genome version identifier GRCh38, mm39
star_index Path to the STAR index directory /path/to/star_index/

Optional Fields

Field Description Example
gtf Path to gene annotation file genes.gtf
speciesName Full species name (for OrgDb) Homo sapiens
isBarnyard Multi-species genome flag false

Example Configuration

{
  "name": "GRCh38",
  "star_index": "/path/to/star_index/",
  "gtf": "genes.gtf",
  "speciesName": "Homo sapiens",
  "isBarnyard": false
}

File Paths

You can specify file paths in three ways:

  • Absolute path: /full/path/to/genome/
  • Relative path: genes.gtf (relative to the genome.json file location)
  • AWS S3 URL: s3://bucket/path/to/genome/

Gene Annotation Considerations

GTF File Requirements

  • Format: Standard GTF format
  • Required entries: exon features (used by STAR)
  • Included in analysis: All transcripts in the GTF will appear in your gene expression matrix

Filtering Annotations

To exclude certain gene types (e.g., pseudogenes, non-coding RNAs):

  1. Filter the GTF before building the STAR index
  2. Use biotype filtering if available in your annotation source
  3. Check the annotation source for quality control recommendations

Example filtering command:

# Filter to keep only protein-coding genes
grep -E "gene_type \"protein_coding\"" genes.gtf > filtered_genes.gtf

Multi-Species Genomes (Barnyard)

For experiments with mixed species (e.g., human + mouse), use the barnyard genome:

  1. Set isBarnyard: true in your genome.json
  2. Use the combined genome that includes both species
  3. The workflow will automatically separate reads by species

Example barnyard configuration:

{
  "name": "GRCh38_mm39",
  "star_index": "/path/to/barnyard_index/",
  "gtf": "combined_genes.gtf",
  "speciesName": "Homo sapiens + Mus musculus",
  "isBarnyard": true
}

Troubleshooting

Common Issues

Problem Solution
STAR index not found Check the path in genome.json
GTF file missing Ensure the file exists and path is correct
Permission errors Check file/directory permissions
Memory issues Use a machine with sufficient RAM for STAR

Further Reading

Example Genome Files

See the examples directory for complete genome configuration examples.


Need Help?

For more information, please contact support@scale.bio or visit our support website.