Skip to content

Custom Reference Genome Setup

The ScaleBio RNA workflow supports custom reference genomes for non-standard analyses. This guide explains how to build, configure, and use your own genome and annotation files, including technical details and best practices.


1. Overview

  • The workflow requires a STAR (v2.7.4 or higher) genome index, built from a genome FASTA and a gene annotation GTF file.
  • You can use pre-built genomes (recommended for human/mouse) or create your own for other species, custom assemblies, or to add transgenes.

ScaleBio provides pre-built reference genomes for: - Human (GRCh38): Download - Mouse (mm39): Download - Human/Mouse Barnyard: Download

Instructions: 1. Download and unpack: bash tar -xzf grch38.tgz 2. Use the included JSON file (e.g. grch38/grch38.json) with --genome or in your runParams.yml.

Important

Do not use the URLs directly in the --genome parameter. Always unpack locally first. The example genome (docs/examples/genome.json) is for test runs only and should not be used for real data.


3. Building a Custom STAR Genome Index

3.1. Prepare Input Files

  • Genome FASTA: Should contain only primary contigs (no alt haplotypes or patches).
  • Gene Annotation GTF: Must match the genome version and chromosome naming (e.g. chr1 vs 1).
  • Filter GTF to exclude unwanted biotypes (e.g. pseudogenes) if needed.

3.2. Example STAR Index Command

STAR --runMode genomeGenerate \
  --runThreadN 16 \
  --genomeDir star.ref \
  --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  --sjdbGTFfile Homo_sapiens.GRCh38.103.biotypeFiltered.gtf
  • star.ref is the output directory (required for analysis)
  • The entire directory must be available for the workflow

Tip

See the STAR manual for advanced options.


4. Creating the Genome Configuration File (genome.json)

The workflow uses a JSON file to specify genome resources. Example:

{
  "name": "MyGenome",
  "star_index": "/path/to/star.ref/",
  "gtf": "/path/to/genes.gtf",
  "speciesName": "Custom species",
  "isBarnyard": false
}

5. Adding Extra Sequences or Transgenes

If your experiment includes a transgene or custom sequence:

5.1. Add to FASTA

Append the sequence as a new contig:

>chr1
TAACCCTAACCCTAACC...
>transgene1
ATGGTGAGCAAGGGCGAGGAGGATAACAT

5.2. Add to GTF

Add an exon entry for the new contig:

transgene1 custom exon 1 1000 . + . gene_id "transgene1"; transcript_id "transgene1";
  • This counts all reads mapping to the transgene as expression.

5.3. Rebuild STAR Index

Use the modified FASTA and GTF to build a new index (see above).


6. Technical Notes & Best Practices

  • GTF requirements: Only exon features are used by STAR. All fields are case-sensitive.
  • Gene naming: gene_id, transcript_id, and optionally gene_name are used for output.
  • Chromosome naming: Ensure consistency between FASTA and GTF (e.g. chr1 vs 1).
  • Filtering: To exclude certain gene types, filter the GTF before index generation.
  • Transgenes: Ensure the sequence is unique enough to avoid ambiguous mapping.
  • STAR version: Use v2.7.4 or higher for compatibility.

7. References & Further Reading


Need Help?

For more information, please contact support@scale.bio or visit our support website.