Scale RNA 1.6 to 2.1 Changes

Overview

Version 2 of the ScaleBio Seq Suite: RNA Workflow (ScaleRNA) introduces support for the QuantumScale RNA assay and Ultima Genomics sequencing data, along with general performance enhancements. It continues to support the previous (3 level) Scale RNA assay. This document provides a high-level summary of the differences in input, output, and data processing between versions 1 and 2. For a detailed list of changes, refer to the changelog file included in the workflow.

Input Changes

For QuantumScale data, the workflow no longer requires input FASTQ filenames to match the libName specified in the sample table (samples.csv). Instead, each FASTQ file is assigned to the correct sub-library based on the index2 read sequence. Only the libIndex2 column in samples.csv is required for this purpose (see samplesCsv.md). The only requirement for input FASTQ files is that the sub-libraries are demultiplexed during FASTQ generation, which can be achieved using the example samplesheets (fastqGeneration.md).

Output Changes

Metrics

The primary outputs, such as the gene expression matrix and QC reports, remain the same in version 2. However, the definitions of several QC metrics have been updated:

Reads Mapped to Genome: Includes all multimapping reads
Passing Read Alignments: The fraction of mapped reads retained after alignment filtering (e.g. ribosomal RNA multimappers)
Reads Mapped to Transcriptome: The fraction of passing read alignments that match an annotated gene
Exonic Reads: The fraction of transcriptome reads overlapping an exon in the sense orientation
Saturation: Now consistently defined as the ratio of unique to total transcriptome read.
Reads per Cell: Calculated as Total Sample Reads divided by Cells Called (includes unaligned reads, reads not assigned to a passing cell, etc.). This definition is also used for the x-axis in Complexity plots.

See qcReports.md for a full list of all QC metric definitions.

Cell Metadata

Only called cells are included in the allCells.csv cell-metadata files. Information for non-cell barcodes is stored in allBarcodes binary parquet format files for space efficiency.

The cell-barcode (cell name) now uses aliases of the different barcodes rather than their sequences, e.g., QSR-1+03+15+42+1A instead of TCGTCTAT+CCGAATAG+...+CAAGAGTC. For the sub-library barcode this corresponds to the name of the PCR primer pool and for the RT barcode to the RT plate well of the cell.

Workflow Changes

Cell-barcode processing and sample demultiplexing are performed in the bcParser step of the workflow. In version 2.1, this step generates unaligned BAM files containing all RNA reads per sample, with barcode information stored in BAM tags. This output replaces the previous per-sample FASTQ files. Note that the workflow still requires FASTQ files as input, as produced by Illumina FASTQ generation tools (e.g., bcl_convert), and continues to produce per-sample aligned BAM files from STAR as before.

Starting with version 2.1, for analyses combining data from multiple sub-libraries (e.g., QuantumScale RNA Large or XL kits), cell-calling is performed separately for each sub-library. The called cells are then merged across sub-libraries into a single output. This approach may result in slightly different total cell counts compared to the previous method of combining all cell-barcodes across sub-libraries before running cell-calling on the merged data.

Additionally, CellFinder, a method similar to EmptyDrops to compare the expression profile between a potential cell and the ambient RNA profile, is now the default. This, along with general optimizations to the cell calling methods, can increase the number of cells called, particularly in barnyard experiments or datasets with higher RNA background.

Need Help?

For more information, please contact support@scale.bio or visit our support website.