Custom Reference Genome Setup

The ScaleBio RNA workflow supports custom reference genomes for non-standard analyses. This guide explains how to build, configure, and use your own genome and annotation files, including technical details and best practices.

1. Overview

The workflow requires a STAR (v2.7.4 or higher) genome index, built from a genome FASTA and a gene annotation GTF file.
You can use pre-built genomes (recommended for human/mouse) or create your own for other species, custom assemblies, or to add transgenes.

2. Using Pre-built Genomes (Recommended)

ScaleBio provides pre-built reference genomes for: - Human (GRCh38): Download - Mouse (mm39): Download - Human/Mouse Barnyard: Download

Instructions: 1. Download and unpack: bash tar -xzf grch38.tgz 2. Use the included JSON file (e.g. grch38/grch38.json) with --genome or in your runParams.yml.

Important

Do not use the URLs directly in the --genome parameter. Always unpack locally first. The example genome (docs/examples/genome.json) is for test runs only and should not be used for real data.

3. Building a Custom STAR Genome Index

3.1. Prepare Input Files

Genome FASTA: Should contain only primary contigs (no alt haplotypes or patches).
Gene Annotation GTF: Must match the genome version and chromosome naming (e.g. chr1 vs 1).
Filter GTF to exclude unwanted biotypes (e.g. pseudogenes) if needed.

3.2. Example STAR Index Command

STAR --runMode genomeGenerate \
  --runThreadN 16 \
  --genomeDir star.ref \
  --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  --sjdbGTFfile Homo_sapiens.GRCh38.103.biotypeFiltered.gtf

star.ref is the output directory (required for analysis)
The entire directory must be available for the workflow

Tip

See the STAR manual for advanced options.

4. Creating the Genome Configuration File (`genome.json`)

The workflow uses a JSON file to specify genome resources. Example:

{
  "name": "MyGenome",
  "star_index": "/path/to/star.ref/",
  "gtf": "/path/to/genes.gtf",
  "speciesName": "Custom species",
  "isBarnyard": false
}

See Reference Genomes documentation for all options and details.

5. Adding Extra Sequences or Transgenes

If your experiment includes a transgene or custom sequence:

5.1. Add to FASTA

Append the sequence as a new contig:

>chr1
TAACCCTAACCCTAACC...
>transgene1
ATGGTGAGCAAGGGCGAGGAGGATAACAT

5.2. Add to GTF

Add an exon entry for the new contig:

transgene1 custom exon 1 1000 . + . gene_id "transgene1"; transcript_id "transgene1";

This counts all reads mapping to the transgene as expression.

5.3. Rebuild STAR Index

Use the modified FASTA and GTF to build a new index (see above).

6. Technical Notes & Best Practices

GTF requirements: Only exon features are used by STAR. All fields are case-sensitive.
Gene naming: gene_id, transcript_id, and optionally gene_name are used for output.
Chromosome naming: Ensure consistency between FASTA and GTF (e.g. chr1 vs 1).
Filtering: To exclude certain gene types, filter the GTF before index generation.
Transgenes: Ensure the sequence is unique enough to avoid ambiguous mapping.
STAR version: Use v2.7.4 or higher for compatibility.

7. References & Further Reading

Need Help?

For more information, please contact support@scale.bio or visit our support website.