Reference Genomes

The ScaleBio RNA workflow requires a reference genome for analysis. This page explains how to set up and use reference genomes with the workflow.

A reference consists of the genome sequence used for alignment of the RNA reads and a gene annotation, used to match aligned reads to genes to quantify gene expression.

The workflow uses a simple JSON configuration file to define all genome-related settings, making it easy to switch between different species or genome versions.

Quick Start

Choose your genome: Use a pre-built genome (recommended) or create your own
Download and extract: Get the genome files and place them on your system
Run the workflow: Point to the genome configuration file using --genome

Example:

# Download and extract a pre-built human genome
wget http://scale.pub.s3.amazonaws.com/genomes/rna/grch38.tgz
tar -xzf grch38.tgz

# Run the workflow with this genome
nextflow run ScaleBio/ScaleRna --genome /path/to/grch38/grch38.json ...`

Pre-built Genomes

We provide pre-built reference genomes for common species. These are ready to use and include all necessary files.

Available Genomes

Species	Genome Version	Download Link	Size
Human	GRCh38	Download	~3.5 GB
Mouse	mm39	Download	~2.8 GB
Human/Mouse Barnyard	GRCh38 + mm39	Download	~6.3 GB

Tip

Pre-built genomes are the easiest option for most users. They're tested and ready to use.

Creating Custom Genomes

If you need a genome that's not available pre-built, you can create your own.

Create a directory for your genome files and gather sequence (.fasta) and annotation (.gtf)
Create a STAR index and place it in a subdirectory
Create genome.json with the required fields

Please see a more detailed explanation on how to generate a custom genome here.

Building a STAR Index

Requirements

STAR version: ≥ 2.7.4a
Genome FASTA: Primary assembly recommended (no alt. haplotypes)
Gene annotation: GTF format with exon entries

STAR --runMode genomeGenerate \
  --runThreadN 16 \
  --genomeDir /path/to/star_index \
  --genomeFastaFiles /path/to/genome.fa \
  --sjdbGTFfile /path/to/genes.gtf \

Creating the Configuration File

All genome settings are defined in a genome.json file. This file tells the workflow where to find the reference files and how to interpret them.

Required Fields

Field	Description	Example
`name`	Species/genome version identifier	`GRCh38`, `mm39`
`star_index`	Path to the STAR index directory	`/path/to/star_index/`

Optional Fields

Field	Description	Example
`gtf`	Path to gene annotation file	`genes.gtf`
`speciesName`	Full species name (for OrgDb)	`Homo sapiens`
`isBarnyard`	Multi-species genome flag	`false`

Example Configuration

{
  "name": "GRCh38",
  "star_index": "/path/to/star_index/",
  "gtf": "genes.gtf",
  "speciesName": "Homo sapiens",
  "isBarnyard": false
}

File Paths

You can specify file paths in three ways:

Absolute path: /full/path/to/genome/
Relative path: genes.gtf (relative to the genome.json file location)
AWS S3 URL: s3://bucket/path/to/genome/

Gene Annotation Considerations

GTF File Requirements

Format: Standard GTF format
Required entries: exon features (used by STAR)
Included in analysis: All transcripts in the GTF will appear in your gene expression matrix

Filtering Annotations

To exclude certain gene types (e.g., pseudogenes, non-coding RNAs):

Filter the GTF before building the STAR index
Use biotype filtering if available in your annotation source
Check the annotation source for quality control recommendations

Example filtering command:

# Filter to keep only protein-coding genes
grep -E "gene_type \"protein_coding\"" genes.gtf > filtered_genes.gtf

Multi-Species Genomes (Barnyard)

For experiments with mixed species (e.g., human + mouse), use the barnyard genome:

Set isBarnyard: true in your genome.json
Use the combined genome that includes both species
The workflow will automatically separate reads by species

Example barnyard configuration:

{
  "name": "GRCh38_mm39",
  "star_index": "/path/to/barnyard_index/",
  "gtf": "combined_genes.gtf",
  "speciesName": "Homo sapiens + Mus musculus",
  "isBarnyard": true
}

Troubleshooting

Common Issues

Problem	Solution
STAR index not found	Check the path in `genome.json`
GTF file missing	Ensure the file exists and path is correct
Permission errors	Check file/directory permissions
Memory issues	Use a machine with sufficient RAM for STAR

Example Genome Files

See the examples directory for complete genome configuration examples.

Need Help?

For more information, please contact support@scale.bio or visit our support website.