Nextflow Basics for ScaleRna
This guide covers the fundamental concepts of using Nextflow with ScaleRna, including configuration, execution, and troubleshooting. For comprehensive Nextflow documentation, visit the official Nextflow documentation.
What is Nextflow?
Nextflow is a workflow management system for creating scalable, portable, and reproducible workflows. It uses a dataflow programming model that simplifies writing distributed workflows by allowing you to focus on data flow and computation. Nextflow can deploy workflows on a variety of execution platforms, including your local machine, HPC schedulers, and cloud environments.
Key Nextflow Concepts
- Processes: Individual computational steps in your workflow
- Channels: Data structures that connect processes
- Executors: Backends that determine where and how processes are executed
- Profiles: Pre-configured settings for different execution environments
For detailed information on these concepts, see: - Nextflow Overview - Processes - Channels - Executors
Basic ScaleRna Command
A typical command to run the ScaleRna workflow:
nextflow run /PATH/TO/ScaleRna/ -profile docker --samples samples.csv --genome /PATH/TO/GRCh38/grch38.json --runFolder /PATH/TO/runFolder --outDir output
Command Line Syntax
- Nextflow options use single dash (
-profile
,-params-file
) - Workflow parameters use double dash (
--outDir
,--samples
) - See Nextflow CLI Reference for complete command-line options
Configuration
Configuration Hierarchy
ScaleRna uses Nextflow configuration files to manage system settings, compute resources, and execution parameters. The configuration hierarchy is:
- Command-line parameters (highest priority)
- runParams.yml file (recommended for analysis parameters)
- nextflow.config (system settings and defaults)
For detailed configuration information, see: - Nextflow Configuration - Configuration Options
Specifying Analysis Parameters
Analysis parameters can be defined in a runParams.yml file or directly on the command line. See Analysis Parameters for details on available options.
Custom Configuration
Use a custom configuration file for system-specific settings:
nextflow run -c path/to/user.config ScaleRna -profile docker --params-file runParams.yml --outDir output
Resource Management
Optimized Workflow-Level Resources
In general the default resource allocations are optimized to run small to large datasets. We highly recommend only modifying the resource request if absolutely necessary. The workflow will identify process failures due to lack of resources and will automatically retry by multiplying the resource request by the number of task attempts, with the default being 3 attempts (retries).
Decrease Resources
To decrease the maximum resources any given task can request, please use the workflow level command line parameters below :
--taskMaxMemory 124.GB --taskMaxCpus 12
Increase Resources
This requires increasing individual task cpus
and memory
and increasing taskMaxMemory
and taskMaxCpus
if the intended resource allocation is higher than the default.
Default configuration:
taskMaxMemory = 256.GB
taskMaxCpus = 16
...
withName:BclConvert {
container = 'nfcore/bclconvert:3.9.3'
cpus = { max_cpu(16) }
memory = { max_mem(120.GB * task.attempt) }
}
Increased resources (custom.nextflow.config
):
taskMaxMemory = 500.GB
taskMaxCpus = 64
...
withName:BclConvert {
container = 'nfcore/bclconvert:3.9.3'
cpus = { max_cpu(32) }
memory = { max_mem(250.GB * task.attempt) }
}
For detailed process configuration, see: - Process Directives - Resource Limits
Cluster Execution
Setting Up Cluster Executors
Using a scheduler is highly recommended if available. Runtimes and compute costs can decrease dramatically with proper configuration.
Useful Nextflow resources for executor configuration: - Executors Overview - Basic Training - Executors
Example SLURM configuration:
process {
executor = 'slurm'
queue = 'short'
memory = '10 GB'
time = '30 min'
cpus = 4
}
Note: The queue is a preset configuration that is defined system-wide. Contact your system administrator or IT help desk to find out the different queue directives available.
For detailed executor documentation, see: - SLURM Executor - SGE Executor - AWS Batch - Azure Batch
Understanding the .nextflow.log file
Log Files
The .nextflow.log
file contains verbose information about your Nextflow run and is essential for debugging. This file provides detailed insights into workflow execution, process configuration, and system resources.
Command:
nextflow run ScaleRna -profile docker -params-file runParams.yml --outDir QuantumScale_v2.1_OUT
File structure:
├── .nextflow.log
└── QuantumScale_v2.1_OUT
├── alignment
├── barcodes
├── cellTyping
├── fastq
├── reports
├── samples
├── samples.csv
└── workflow_info.json
Custom log location:
nextflow run -log /path/to/myRun.nextflow.log ScaleRna -profile docker -params-file runParams.yml --outDir QuantumScale_v2.1_OUT
Log File Contents
The .nextflow.log
file contains several key sections:
1. Session Information
Version: 24.10.2 build 5932
Created: 27-11-2024 21:23 UTC
System: Linux 4.18.0-477.27.1.el8_8.x86_64
Runtime: Groovy 4.0.23 on OpenJDK 64-Bit Server VM 17.0.9+8-LTS
CPUs: 192 - Mem: 754.9 GB (505.7 GB) - Swap: 32 GB (19.2 GB)
2. Workflow Configuration and Input Parameters
Core Nextflow Options
Workflow Directory:: /home/ScaleRna-master
Workflow Version: : 2.1.0
Command Line : nextflow run ScaleRna/ -profile singularity -params-file runParams.yml
Nextflow RunName : kickass_gautier
Profile : singularity
Container Engine : apptainer
Work Directory : /home/work
...
fastqDir : /home/fastqs
samples : /home/samples.csv
genome : /home/genome.json
libStructure : libQuantumV1.0.json
3. Configuration and Executor Information
Config settings `withLabel:small` matches labels `small` for process with name RegularizeSamplesCsv
Config settings `withName:TrimFq` matches process END_TO_END:INPUT_READS:TrimFq
Config settings `withName:StarSolo` matches process END_TO_END:ALIGNMENT:StarSolo
...
Creating local task monitor for executor 'local' > cpus=192; memory=754.9 GB; capacity=192
Creating task monitor for executor 'slurm' > capacity: 100; pollInterval: 5s
Common Error Patterns
1. Interrupted Run
[SIGINT handler] DEBUG nextflow.Session - Session aborted -- Cause: SIGINT
Shows when a run was manually interrupted (Ctrl+C).
2. Missing Input File Errors
ERROR: Must specify --samples (e.g. samples.csv)
Cause: Missing required input parameters
Solution: Ensure all required parameters are provided
3. Environment Creation Failures
Failed to create Conda environment
command: conda env create --prefix /path/to/env --file /path/to/scalerna.conda.yml
status : 143
Cause: Conda environment creation failed (often due to package conflicts or network issues)
Solution: Check conda configuration, network connectivity, or try using a different profile
4. Data Validation Errors
ERROR: No valid PCR pool found in index2 fastq files
Cause: Input data doesn't meet expected format or quality criteria
Solution: Verify input file format and content; see fastq generation details
5. Resource Allocation Errors
- Memory allocation failures
- CPU limit exceeded
- Disk space issues
6. Process Exit Codes
-
Exit code 1: General error - indicates a program terminated due to an error condition. This is the most common exit code and usually means the process encountered an error during execution (e.g., invalid input, missing files, permission issues, or application-specific errors).
-
Exit code 137: Out of memory (OOM) - the process was killed by the system's out-of-memory killer (OOM killer). This typically occurs when the process exceeds the allocated memory limit or when the system runs out of available memory.
-
Exit code 143: Process killed - the process received a SIGTERM signal, usually sent by the system or scheduler when terminating a job (e.g., when a job exceeds its time limit or is manually cancelled).
-
Exit code 139: Segmentation fault - the process crashed due to a memory access violation, typically indicating a potential bug in the software or incompatible system libraries.
Note: The .nextflow.log
is a "hidden" file and thus it will typically be invisible to any GUI file browser. Use ls -al
from the command line to view the file.
Parameter Priority
The parameter priority order is:
- Command-line (highest priority)
- runParams.yml (
-params-file
) - nextflow.config (lowest priority)
Recommended usage:
-
nextflow.config: Use for global changes you want to make for every workflow run in the future (e.g., executor setup, resource increases for specific jobs)
-
runParams.yml (RECOMMENDED): Preferred for setting boolean pipeline parameters (e.g.,
cellFinder
,quantum
,seurat
,annData
) and path variables for input files -
Command-line: Use only for one-off run settings specific to this execution (e.g., changing log location with
-log
, using custom config with-c
)
For detailed parameter precedence, see: - Configuration Precedence
Workflow Information
All settings and parameters used for a workflow run are available in the workflow_info.json
that is exported to the outDir
.
Running in the Cloud
Nextflow supports execution on various cloud platforms:
In addition, Seqera Platform offers another simple way to manage and execute Nextflow workflows in Amazon AWS.
Offline Execution
Running Without Internet Access
Steps to pre-download Singularity images and set MXF_SINGULARITY_LIBRARYDIR:
- Pre-download containers:
singularity pull docker://public.ecr.aws/o5i3p364/scale_rna:v1.4
- Set environment variable:
export MXF_SINGULARITY_LIBRARYDIR=/path/to/containers
- Run with custom config:
nextflow run -c offline.config -profile singularity
- In the "offline.config":
process.container = "file:///path/to/scale_rna_v1.4.sif"
For more information on offline execution, see: - Singularity - Caching and Resuming
Additional Resources
Nextflow Documentation
Training and Community
Version Information
See the ScaleRna change log for version updates and changes.
Need Help?
For more information, please contact support@scale.bio or visit our support website.