Cell Typing and Clustering Analysis
The ScaleBio RNA workflow includes cell typing and clustering capabilities using Seurat and Azimuth. This guide explains how to enable, configure, and interpret these analyses for single-cell RNA sequencing data.
Overview
The cell typing workflow provides: - Seurat-based clustering: Unsupervised clustering and dimension reduction - Azimuth cell type annotation: Reference-based cell type classification - Co-analysis: Combine multiple samples for integrated analysis - AnnData output: Export data for downstream analysis in Python
Configuration
Seurat Clustering Analysis
What is Seurat?
Seurat performs normalization, dimension reduction, and clustering of UMI count matrices for each sample. It's the foundation for unsupervised cell type discovery. See more here: Seurat.
Enable Seurat Analysis
# In your runParams.yml
seurat: true
Large Dataset Handling
For datasets with >50,000 cells, Seurat automatically performs sketch-based analysis: - Selects a subset (20% by default) of cells for analysis - Projects results back to the full dataset - Maintains computational efficiency while preserving biological insights
Azimuth Cell Type Annotation
What is Azimuth?
Azimuth provides reference-based cell type annotation by mapping your data to reference datasets.
Enable Azimuth Analysis
# In your runParams.yml
azimuth: true
Requirements
- Sample must have a relevant Azimuth reference dataset available
- Compatible with Seurat clustering (can be used together)
AnnData Output
Enable AnnData Export
# In your runParams.yml
annData: true
Output Format
- Creates AnnData objects containing UMI count matrices
- One AnnData file per sample in your samplesheet
- Compatible with Python-based analysis tools (scanpy, etc.)
Co-Analysis of Multiple Samples
Enable Co-Analysis
# In your runParams.yml
compSamples: true
How It Works
- Concatenates filtered UMI count matrices from all samples
- Performs clustering on the combined dataset
- Optionally applies Azimuth annotation to the combined data
Sample Grouping
Add a compSamples
column to your samplesheet to control which samples are combined:
Example 1: Basic Sample Combination
sample | barcodes | compSamples |
---|---|---|
Foo | 1A-6H | combine |
Bar | 7A-12H | combine |
Example 2: Multiple Groups
sample | barcodes | compSamples |
---|---|---|
Foo1 | 1A-6H | group1 |
Bar1 | 7A-12H | group2 |
Foo2 | 13A-18H | group1 |
Bar2 | 19A-24H | group2 |
Example 3: Reporting Workflow Integration
For combining samples from multiple runs using the reporting workflow:
sample | resultDir | compSamples |
---|---|---|
Foo1 | resultDir_1 | fooSamples |
Bar1 | resultDir_2 | barSamples |
Foo2 | resultDir_3 | fooSamples |
Bar2 | resultDir_4 | barSamples |
Parameter Combinations
Common Use Cases
Individual sample clustering:
seurat: true
azimuth: false
compSamples: false
annData: false
Multi-sample co-analysis with annotation:
seurat: true
azimuth: true
compSamples: true
annData: true
Python-based downstream analysis:
seurat: false
azimuth: false
compSamples: false
annData: true
Parameter Dependencies
azimuth: true
requires a compatible reference datasetcompSamples: true
works with bothseurat
andazimuth
annData: true
can be used independently or with other parameters
Output Files and Structure
The cellTyping
directory contains automated cell type annotation results and analysis files for each sample in the analysis.
Directory Structure
cellTyping
└── QS-SmallKit-PBMCs.QSR-2
├── QS-SmallKit-PBMCs.QSR-2.html
├── QS-SmallKit-PBMCs.QSR-2_AzimuthObject.rds
├── QS-SmallKit-PBMCs.QSR-2_SeuratObject.rds
├── QS-SmallKit-PBMCs.QSR-2_azimuth_mapping_results.csv
├── QS-SmallKit-PBMCs.QSR-2_bpcells
│ └── out
│ ├── col_names
│ ├── idxptr
│ ├── index_data
│ ├── index_idx
│ ├── index_idx_offsets
│ ├── index_starts
│ ├── row_names
│ ├── shape
│ ├── storage_order
│ ├── val
│ └── version
├── QS-SmallKit-PBMCs.QSR-2_cellTypingResults.csv
├── QS-SmallKit-PBMCs.QSR-2_seurat_cluster_markers.csv
├── QS-SmallKit-PBMCs.QSR-2_seurat_clustering_results.csv
├── l1.celltype_summary.csv
├── l2.celltype_summary.csv
├── l3.celltype_summary.csv
├── outlier_summary.csv
└── seurat.cluster_summary.csv
Key Files and Directories
Interactive Report (*.html
)
Comprehensive HTML report containing:
- Cell type distribution plots
- Quality metrics and statistics
- Interactive visualizations
- Summary tables and figures
- Methodology and parameters used
R Analysis Objects
Seurat Object (*_SeuratObject.rds
)
- Complete Seurat object with all analysis results
- Contains normalized data, clustering, and cell type annotations
- Can be loaded directly into R for further analysis
Azimuth Object (*_AzimuthObject.rds
)
- Azimuth reference mapping results
- Contains reference-based cell type predictions
- Includes confidence scores and mapping quality metrics
Cell Type Annotations (PBMCs only)
Primary Results (*_cellTypingResults.csv
)
Primary cell type annotation file containing:
- Cell barcodes
- Predicted cell types
- Confidence scores
- Reference mapping quality metrics
- Alternative cell type predictions
Reference Mapping (*_azimuth_mapping_results.csv
)
Detailed Azimuth reference mapping results:
- Reference dataset used
- Mapping scores and confidence
- Cell type hierarchy levels
- Quality metrics per cell
Clustering Analysis
Cluster Assignments (*_seurat_clustering_results.csv
)
Unsupervised clustering results:
- Cluster assignments for each cell
- Cluster quality metrics
- Cell distribution across clusters
Marker Genes (*_seurat_cluster_markers.csv
)
Differential expression analysis:
- Marker genes for each cluster
- Statistical significance metrics
- Expression fold changes
- Gene ontology enrichment
Summary Statistics
Cell Type Counts (l*.celltype_summary.csv
)
- Cell type counts at different annotation levels
- Level 1: Major cell types (e.g., T cells, B cells)
- Level 2: Subtypes (e.g., CD4+ T cells, CD8+ T cells)
- Level 3: Detailed subtypes (e.g., Naive CD4+ T cells)
Cluster Summary (seurat.cluster_summary.csv
)
- Summary statistics for each cluster
- Cell counts per cluster
- Average gene expression metrics
- Quality indicators
Outlier Analysis (outlier_summary.csv
)
- Cells that could not be confidently annotated
- Quality metrics for outlier cells
- Potential reasons for annotation failure
Optimized Matrix (*_bpcells/
)
Optimized sparse matrix format for efficient analysis:
col_names
: Cell barcodesrow_names
: Gene namesval
: Expression valuesidxptr
: Column pointers for sparse matrixshape
: Matrix dimensionsstorage_order
: Data storage formatversion
: BPCells format version
See full details here: BPCells
Analysis Methods
Azimuth Reference Mapping
- Reference Datasets: Human PBMC, mouse brain, and other tissue references
- Mapping Algorithm: Seurat's reference mapping approach
- Quality Metrics: Prediction confidence scores
- Hierarchical Annotation: Multiple levels of cell type specificity
Unsupervised Clustering
- Clustering Algorithm: Louvain community detection
- Resolution Parameters: Optimized for cell type discovery
- Marker Gene Analysis: Differential expression to identify cell types
- Quality Assessment: Cluster stability and separation metrics
File Formats
CSV Files
- Cell Type Results: Cell barcodes, annotations, confidence scores
- Summary Statistics: Counts and proportions by cell type
- Marker Genes: Differential expression results
- Quality Metrics: Various quality indicators
RDS Files
- Seurat Objects: Complete analysis objects for R
- Azimuth Objects: Reference mapping results
- Compatibility: Can be loaded directly into R/Seurat
Usage Examples
Loading Results in R
library(Seurat)
# Load Seurat object
seurat_obj <- readRDS("cellTyping/*_SeuratObject.rds")
# Load cell type annotations
cell_types <- read.csv("cellTyping/*_cellTypingResults.csv")
# View cell type distribution
table(cell_types$predicted_cell_type)
Python Analysis
import pandas as pd
import scanpy as sc
# Load cell type results
cell_types = pd.read_csv("cellTyping/*_cellTypingResults.csv")
# Load summary statistics
summary = pd.read_csv("cellTyping/l1.celltype_summary.csv")
Technical Notes & Best Practices
- Memory requirements: Large datasets may require significant RAM
- Computational time: Sketch-based analysis reduces runtime for large datasets
- Reference compatibility: Ensure Azimuth reference matches your species/tissue
- Sample quality: Poor quality samples may affect co-analysis results
- Batch effects: Consider batch correction for multi-sample analyses
References & Further Reading
- Seurat Documentation
- Seurat Tutorial
- Sketch-based Analysis
- Azimuth Cell Type Reference
- Azimuth Documentation
- AnnData Format
- Analysis Parameters
- Output Documentation
Need Help?
For more information, please contact support@scale.bio or visit our support website.