Cell Typing and Clustering Analysis

The ScaleBio RNA workflow includes cell typing and clustering capabilities using Seurat and Azimuth. This guide explains how to enable, configure, and interpret these analyses for single-cell RNA sequencing data.

Overview

The cell typing workflow provides: - Seurat-based clustering: Unsupervised clustering and dimension reduction - Azimuth cell type annotation: Reference-based cell type classification - Co-analysis: Combine multiple samples for integrated analysis - AnnData output: Export data for downstream analysis in Python

Configuration

Seurat Clustering Analysis

What is Seurat?

Seurat performs normalization, dimension reduction, and clustering of UMI count matrices for each sample. It's the foundation for unsupervised cell type discovery. See more here: Seurat.

Enable Seurat Analysis

# In your runParams.yml
seurat: true

Large Dataset Handling

For datasets with >50,000 cells, Seurat automatically performs sketch-based analysis: - Selects a subset (20% by default) of cells for analysis - Projects results back to the full dataset - Maintains computational efficiency while preserving biological insights

Azimuth Cell Type Annotation

What is Azimuth?

Azimuth provides reference-based cell type annotation by mapping your data to reference datasets.

Enable Azimuth Analysis

# In your runParams.yml
azimuth: true

Requirements

Sample must have a relevant Azimuth reference dataset available
Compatible with Seurat clustering (can be used together)

AnnData Output

Enable AnnData Export

# In your runParams.yml
annData: true

Output Format

Creates AnnData objects containing UMI count matrices
One AnnData file per sample in your samplesheet
Compatible with Python-based analysis tools (scanpy, etc.)

Co-Analysis of Multiple Samples

Enable Co-Analysis

# In your runParams.yml
compSamples: true

How It Works

Concatenates filtered UMI count matrices from all samples
Performs clustering on the combined dataset
Optionally applies Azimuth annotation to the combined data

Sample Grouping

Add a compSamples column to your samplesheet to control which samples are combined:

Example 1: Basic Sample Combination

sample	barcodes	compSamples
Foo	1A-6H	combine
Bar	7A-12H	combine

Example 2: Multiple Groups

sample	barcodes	compSamples
Foo1	1A-6H	group1
Bar1	7A-12H	group2
Foo2	13A-18H	group1
Bar2	19A-24H	group2

Example 3: Reporting Workflow Integration

For combining samples from multiple runs using the reporting workflow:

sample	resultDir	compSamples
Foo1	resultDir_1	fooSamples
Bar1	resultDir_2	barSamples
Foo2	resultDir_3	fooSamples
Bar2	resultDir_4	barSamples

Parameter Combinations

Common Use Cases

Individual sample clustering:

seurat: true
azimuth: false
compSamples: false
annData: false

Multi-sample co-analysis with annotation:

seurat: true
azimuth: true
compSamples: true
annData: true

Python-based downstream analysis:

seurat: false
azimuth: false
compSamples: false
annData: true

Parameter Dependencies

azimuth: true requires a compatible reference dataset
compSamples: true works with both seurat and azimuth
annData: true can be used independently or with other parameters

Output Files and Structure

The cellTyping directory contains automated cell type annotation results and analysis files for each sample in the analysis.

Directory Structure

cellTyping
└── QS-SmallKit-PBMCs.QSR-2
    ├── QS-SmallKit-PBMCs.QSR-2.html
    ├── QS-SmallKit-PBMCs.QSR-2_AzimuthObject.rds
    ├── QS-SmallKit-PBMCs.QSR-2_SeuratObject.rds
    ├── QS-SmallKit-PBMCs.QSR-2_azimuth_mapping_results.csv
    ├── QS-SmallKit-PBMCs.QSR-2_bpcells
    │   └── out
    │       ├── col_names
    │       ├── idxptr
    │       ├── index_data
    │       ├── index_idx
    │       ├── index_idx_offsets
    │       ├── index_starts
    │       ├── row_names
    │       ├── shape
    │       ├── storage_order
    │       ├── val
    │       └── version
    ├── QS-SmallKit-PBMCs.QSR-2_cellTypingResults.csv
    ├── QS-SmallKit-PBMCs.QSR-2_seurat_cluster_markers.csv
    ├── QS-SmallKit-PBMCs.QSR-2_seurat_clustering_results.csv
    ├── l1.celltype_summary.csv
    ├── l2.celltype_summary.csv
    ├── l3.celltype_summary.csv
    ├── outlier_summary.csv
    └── seurat.cluster_summary.csv

Key Files and Directories

Interactive Report (`*.html`)

Comprehensive HTML report containing:

Cell type distribution plots
Quality metrics and statistics
Interactive visualizations
Summary tables and figures
Methodology and parameters used

R Analysis Objects

Seurat Object (*_SeuratObject.rds)

Complete Seurat object with all analysis results
Contains normalized data, clustering, and cell type annotations
Can be loaded directly into R for further analysis

Azimuth Object (*_AzimuthObject.rds)

Azimuth reference mapping results
Contains reference-based cell type predictions
Includes confidence scores and mapping quality metrics

Cell Type Annotations (PBMCs only)

Primary Results (*_cellTypingResults.csv)

Primary cell type annotation file containing:

Cell barcodes
Predicted cell types
Confidence scores
Reference mapping quality metrics
Alternative cell type predictions

Reference Mapping (*_azimuth_mapping_results.csv)

Detailed Azimuth reference mapping results:

Reference dataset used
Mapping scores and confidence
Cell type hierarchy levels
Quality metrics per cell

Clustering Analysis

Cluster Assignments (*_seurat_clustering_results.csv)

Unsupervised clustering results:

Cluster assignments for each cell
Cluster quality metrics
Cell distribution across clusters

Marker Genes (*_seurat_cluster_markers.csv)

Differential expression analysis:

Marker genes for each cluster
Statistical significance metrics
Expression fold changes
Gene ontology enrichment

Summary Statistics

Cell Type Counts (l*.celltype_summary.csv)

Cell type counts at different annotation levels
Level 1: Major cell types (e.g., T cells, B cells)
Level 2: Subtypes (e.g., CD4+ T cells, CD8+ T cells)
Level 3: Detailed subtypes (e.g., Naive CD4+ T cells)

Cluster Summary (seurat.cluster_summary.csv)

Summary statistics for each cluster
Cell counts per cluster
Average gene expression metrics
Quality indicators

Outlier Analysis (outlier_summary.csv)

Cells that could not be confidently annotated
Quality metrics for outlier cells
Potential reasons for annotation failure

Optimized Matrix (`*_bpcells/`)

Optimized sparse matrix format for efficient analysis:

col_names: Cell barcodes
row_names: Gene names
val: Expression values
idxptr: Column pointers for sparse matrix
shape: Matrix dimensions
storage_order: Data storage format
version: BPCells format version

See full details here: BPCells

Analysis Methods

Azimuth Reference Mapping

Reference Datasets: Human PBMC, mouse brain, and other tissue references
Mapping Algorithm: Seurat's reference mapping approach
Quality Metrics: Prediction confidence scores
Hierarchical Annotation: Multiple levels of cell type specificity

Unsupervised Clustering

Clustering Algorithm: Louvain community detection
Resolution Parameters: Optimized for cell type discovery
Marker Gene Analysis: Differential expression to identify cell types
Quality Assessment: Cluster stability and separation metrics

File Formats

CSV Files

Cell Type Results: Cell barcodes, annotations, confidence scores
Summary Statistics: Counts and proportions by cell type
Marker Genes: Differential expression results
Quality Metrics: Various quality indicators

RDS Files

Seurat Objects: Complete analysis objects for R
Azimuth Objects: Reference mapping results
Compatibility: Can be loaded directly into R/Seurat

Usage Examples

Loading Results in R

library(Seurat)

# Load Seurat object
seurat_obj <- readRDS("cellTyping/*_SeuratObject.rds")

# Load cell type annotations
cell_types <- read.csv("cellTyping/*_cellTypingResults.csv")

# View cell type distribution
table(cell_types$predicted_cell_type)

Python Analysis

import pandas as pd
import scanpy as sc

# Load cell type results
cell_types = pd.read_csv("cellTyping/*_cellTypingResults.csv")

# Load summary statistics
summary = pd.read_csv("cellTyping/l1.celltype_summary.csv")

Technical Notes & Best Practices

Memory requirements: Large datasets may require significant RAM
Computational time: Sketch-based analysis reduces runtime for large datasets
Reference compatibility: Ensure Azimuth reference matches your species/tissue
Sample quality: Poor quality samples may affect co-analysis results
Batch effects: Consider batch correction for multi-sample analyses

References & Further Reading

Need Help?

For more information, please contact support@scale.bio or visit our support website.