Custom Reference Genome Setup
The ScaleBio RNA workflow supports custom reference genomes for non-standard analyses. This guide explains how to build, configure, and use your own genome and annotation files, including technical details and best practices.
1. Overview
- The workflow requires a STAR (v2.7.4 or higher) genome index, built from a genome FASTA and a gene annotation GTF file.
- You can use pre-built genomes (recommended for human/mouse) or create your own for other species, custom assemblies, or to add transgenes.
2. Using Pre-built Genomes (Recommended)
ScaleBio provides pre-built reference genomes for: - Human (GRCh38): Download - Mouse (mm39): Download - Human/Mouse Barnyard: Download
Instructions:
1. Download and unpack:
bash
tar -xzf grch38.tgz
2. Use the included JSON file (e.g. grch38/grch38.json
) with --genome
or in your runParams.yml
.
Important
Do not use the URLs directly in the --genome
parameter. Always unpack locally first. The example genome (docs/examples/genome.json
) is for test runs only and should not be used for real data.
3. Building a Custom STAR Genome Index
3.1. Prepare Input Files
- Genome FASTA: Should contain only primary contigs (no alt haplotypes or patches).
- Gene Annotation GTF: Must match the genome version and chromosome naming (e.g.
chr1
vs1
). - Filter GTF to exclude unwanted biotypes (e.g. pseudogenes) if needed.
3.2. Example STAR Index Command
STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir star.ref \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile Homo_sapiens.GRCh38.103.biotypeFiltered.gtf
star.ref
is the output directory (required for analysis)- The entire directory must be available for the workflow
Tip
See the STAR manual for advanced options.
4. Creating the Genome Configuration File (genome.json
)
The workflow uses a JSON file to specify genome resources. Example:
{
"name": "MyGenome",
"star_index": "/path/to/star.ref/",
"gtf": "/path/to/genes.gtf",
"speciesName": "Custom species",
"isBarnyard": false
}
- See Reference Genomes documentation for all options and details.
5. Adding Extra Sequences or Transgenes
If your experiment includes a transgene or custom sequence:
5.1. Add to FASTA
Append the sequence as a new contig:
>chr1
TAACCCTAACCCTAACC...
>transgene1
ATGGTGAGCAAGGGCGAGGAGGATAACAT
5.2. Add to GTF
Add an exon entry for the new contig:
transgene1 custom exon 1 1000 . + . gene_id "transgene1"; transcript_id "transgene1";
- This counts all reads mapping to the transgene as expression.
5.3. Rebuild STAR Index
Use the modified FASTA and GTF to build a new index (see above).
6. Technical Notes & Best Practices
- GTF requirements: Only
exon
features are used by STAR. All fields are case-sensitive. - Gene naming:
gene_id
,transcript_id
, and optionallygene_name
are used for output. - Chromosome naming: Ensure consistency between FASTA and GTF (e.g.
chr1
vs1
). - Filtering: To exclude certain gene types, filter the GTF before index generation.
- Transgenes: Ensure the sequence is unique enough to avoid ambiguous mapping.
- STAR version: Use v2.7.4 or higher for compatibility.
7. References & Further Reading
- STAR Manual (PDF)
- Example genome.json
- ScaleBio Reference Genomes Guide
- ScaleBio/ScaleCRISPR for barcode/CRISPR workflows
Need Help?
For more information, please contact support@scale.bio or visit our support website.