Skip to content

Cell Typing and Clustering Analysis

The ScaleBio RNA workflow includes cell typing and clustering capabilities using Seurat and Azimuth. This guide explains how to enable, configure, and interpret these analyses for single-cell RNA sequencing data.

Overview

The cell typing workflow provides: - Seurat-based clustering: Unsupervised clustering and dimension reduction - Azimuth cell type annotation: Reference-based cell type classification - Co-analysis: Combine multiple samples for integrated analysis - AnnData output: Export data for downstream analysis in Python

Configuration

Seurat Clustering Analysis

What is Seurat?

Seurat performs normalization, dimension reduction, and clustering of UMI count matrices for each sample. It's the foundation for unsupervised cell type discovery. See more here: Seurat.

Enable Seurat Analysis

# In your runParams.yml
seurat: true

Large Dataset Handling

For datasets with >50,000 cells, Seurat automatically performs sketch-based analysis: - Selects a subset (20% by default) of cells for analysis - Projects results back to the full dataset - Maintains computational efficiency while preserving biological insights

Azimuth Cell Type Annotation

What is Azimuth?

Azimuth provides reference-based cell type annotation by mapping your data to reference datasets.

Enable Azimuth Analysis

# In your runParams.yml
azimuth: true

Requirements

  • Sample must have a relevant Azimuth reference dataset available
  • Compatible with Seurat clustering (can be used together)

AnnData Output

Enable AnnData Export

# In your runParams.yml
annData: true

Output Format

  • Creates AnnData objects containing UMI count matrices
  • One AnnData file per sample in your samplesheet
  • Compatible with Python-based analysis tools (scanpy, etc.)

Co-Analysis of Multiple Samples

Enable Co-Analysis

# In your runParams.yml
compSamples: true

How It Works

  • Concatenates filtered UMI count matrices from all samples
  • Performs clustering on the combined dataset
  • Optionally applies Azimuth annotation to the combined data

Sample Grouping

Add a compSamples column to your samplesheet to control which samples are combined:

Example 1: Basic Sample Combination

sample barcodes compSamples
Foo 1A-6H combine
Bar 7A-12H combine

Example 2: Multiple Groups

sample barcodes compSamples
Foo1 1A-6H group1
Bar1 7A-12H group2
Foo2 13A-18H group1
Bar2 19A-24H group2

Example 3: Reporting Workflow Integration

For combining samples from multiple runs using the reporting workflow:

sample resultDir compSamples
Foo1 resultDir_1 fooSamples
Bar1 resultDir_2 barSamples
Foo2 resultDir_3 fooSamples
Bar2 resultDir_4 barSamples

Parameter Combinations

Common Use Cases

Individual sample clustering:

seurat: true
azimuth: false
compSamples: false
annData: false

Multi-sample co-analysis with annotation:

seurat: true
azimuth: true
compSamples: true
annData: true

Python-based downstream analysis:

seurat: false
azimuth: false
compSamples: false
annData: true

Parameter Dependencies

  • azimuth: true requires a compatible reference dataset
  • compSamples: true works with both seurat and azimuth
  • annData: true can be used independently or with other parameters

Output Files and Structure

The cellTyping directory contains automated cell type annotation results and analysis files for each sample in the analysis.

Directory Structure

cellTyping
└── QS-SmallKit-PBMCs.QSR-2
    ├── QS-SmallKit-PBMCs.QSR-2.html
    ├── QS-SmallKit-PBMCs.QSR-2_AzimuthObject.rds
    ├── QS-SmallKit-PBMCs.QSR-2_SeuratObject.rds
    ├── QS-SmallKit-PBMCs.QSR-2_azimuth_mapping_results.csv
    ├── QS-SmallKit-PBMCs.QSR-2_bpcells
    │   └── out
    │       ├── col_names
    │       ├── idxptr
    │       ├── index_data
    │       ├── index_idx
    │       ├── index_idx_offsets
    │       ├── index_starts
    │       ├── row_names
    │       ├── shape
    │       ├── storage_order
    │       ├── val
    │       └── version
    ├── QS-SmallKit-PBMCs.QSR-2_cellTypingResults.csv
    ├── QS-SmallKit-PBMCs.QSR-2_seurat_cluster_markers.csv
    ├── QS-SmallKit-PBMCs.QSR-2_seurat_clustering_results.csv
    ├── l1.celltype_summary.csv
    ├── l2.celltype_summary.csv
    ├── l3.celltype_summary.csv
    ├── outlier_summary.csv
    └── seurat.cluster_summary.csv

Key Files and Directories

Interactive Report (*.html)

Comprehensive HTML report containing:

  • Cell type distribution plots
  • Quality metrics and statistics
  • Interactive visualizations
  • Summary tables and figures
  • Methodology and parameters used

R Analysis Objects

Seurat Object (*_SeuratObject.rds)

  • Complete Seurat object with all analysis results
  • Contains normalized data, clustering, and cell type annotations
  • Can be loaded directly into R for further analysis

Azimuth Object (*_AzimuthObject.rds)

  • Azimuth reference mapping results
  • Contains reference-based cell type predictions
  • Includes confidence scores and mapping quality metrics

Cell Type Annotations (PBMCs only)

Primary Results (*_cellTypingResults.csv)

Primary cell type annotation file containing:

  • Cell barcodes
  • Predicted cell types
  • Confidence scores
  • Reference mapping quality metrics
  • Alternative cell type predictions

Reference Mapping (*_azimuth_mapping_results.csv)

Detailed Azimuth reference mapping results:

  • Reference dataset used
  • Mapping scores and confidence
  • Cell type hierarchy levels
  • Quality metrics per cell

Clustering Analysis

Cluster Assignments (*_seurat_clustering_results.csv)

Unsupervised clustering results:

  • Cluster assignments for each cell
  • Cluster quality metrics
  • Cell distribution across clusters

Marker Genes (*_seurat_cluster_markers.csv)

Differential expression analysis:

  • Marker genes for each cluster
  • Statistical significance metrics
  • Expression fold changes
  • Gene ontology enrichment

Summary Statistics

Cell Type Counts (l*.celltype_summary.csv)

  • Cell type counts at different annotation levels
  • Level 1: Major cell types (e.g., T cells, B cells)
  • Level 2: Subtypes (e.g., CD4+ T cells, CD8+ T cells)
  • Level 3: Detailed subtypes (e.g., Naive CD4+ T cells)

Cluster Summary (seurat.cluster_summary.csv)

  • Summary statistics for each cluster
  • Cell counts per cluster
  • Average gene expression metrics
  • Quality indicators

Outlier Analysis (outlier_summary.csv)

  • Cells that could not be confidently annotated
  • Quality metrics for outlier cells
  • Potential reasons for annotation failure

Optimized Matrix (*_bpcells/)

Optimized sparse matrix format for efficient analysis:

  • col_names: Cell barcodes
  • row_names: Gene names
  • val: Expression values
  • idxptr: Column pointers for sparse matrix
  • shape: Matrix dimensions
  • storage_order: Data storage format
  • version: BPCells format version

See full details here: BPCells

Analysis Methods

Azimuth Reference Mapping

  • Reference Datasets: Human PBMC, mouse brain, and other tissue references
  • Mapping Algorithm: Seurat's reference mapping approach
  • Quality Metrics: Prediction confidence scores
  • Hierarchical Annotation: Multiple levels of cell type specificity

Unsupervised Clustering

  • Clustering Algorithm: Louvain community detection
  • Resolution Parameters: Optimized for cell type discovery
  • Marker Gene Analysis: Differential expression to identify cell types
  • Quality Assessment: Cluster stability and separation metrics

File Formats

CSV Files

  • Cell Type Results: Cell barcodes, annotations, confidence scores
  • Summary Statistics: Counts and proportions by cell type
  • Marker Genes: Differential expression results
  • Quality Metrics: Various quality indicators

RDS Files

  • Seurat Objects: Complete analysis objects for R
  • Azimuth Objects: Reference mapping results
  • Compatibility: Can be loaded directly into R/Seurat

Usage Examples

Loading Results in R

library(Seurat)

# Load Seurat object
seurat_obj <- readRDS("cellTyping/*_SeuratObject.rds")

# Load cell type annotations
cell_types <- read.csv("cellTyping/*_cellTypingResults.csv")

# View cell type distribution
table(cell_types$predicted_cell_type)

Python Analysis

import pandas as pd
import scanpy as sc

# Load cell type results
cell_types = pd.read_csv("cellTyping/*_cellTypingResults.csv")

# Load summary statistics
summary = pd.read_csv("cellTyping/l1.celltype_summary.csv")

Technical Notes & Best Practices

  • Memory requirements: Large datasets may require significant RAM
  • Computational time: Sketch-based analysis reduces runtime for large datasets
  • Reference compatibility: Ensure Azimuth reference matches your species/tissue
  • Sample quality: Poor quality samples may affect co-analysis results
  • Batch effects: Consider batch correction for multi-sample analyses

References & Further Reading


Need Help?

For more information, please contact support@scale.bio or visit our support website.