Nextflow Basics for ScaleRna

This guide covers the fundamental concepts of using Nextflow with ScaleRna, including configuration, execution, and troubleshooting. For comprehensive Nextflow documentation, visit the official Nextflow documentation.

What is Nextflow?

Nextflow is a workflow management system for creating scalable, portable, and reproducible workflows. It uses a dataflow programming model that simplifies writing distributed workflows by allowing you to focus on data flow and computation. Nextflow can deploy workflows on a variety of execution platforms, including your local machine, HPC schedulers, and cloud environments.

Key Nextflow Concepts

Processes: Individual computational steps in your workflow
Channels: Data structures that connect processes
Executors: Backends that determine where and how processes are executed
Profiles: Pre-configured settings for different execution environments

For detailed information on these concepts, see: - Nextflow Overview - Processes - Channels - Executors

Basic ScaleRna Command

A typical command to run the ScaleRna workflow:

nextflow run /PATH/TO/ScaleRna/ -profile docker --samples samples.csv --genome /PATH/TO/GRCh38/grch38.json --runFolder /PATH/TO/runFolder --outDir output

Command Line Syntax

Nextflow options use single dash (-profile, -params-file)
Workflow parameters use double dash (--outDir, --samples)
See Nextflow CLI Reference for complete command-line options

Configuration

Configuration Hierarchy

ScaleRna uses Nextflow configuration files to manage system settings, compute resources, and execution parameters. The configuration hierarchy is:

Command-line parameters (highest priority)
runParams.yml file (recommended for analysis parameters)
nextflow.config (system settings and defaults)

For detailed configuration information, see: - Nextflow Configuration - Configuration Options

Specifying Analysis Parameters

Analysis parameters can be defined in a runParams.yml file or directly on the command line. See Analysis Parameters for details on available options.

Custom Configuration

Use a custom configuration file for system-specific settings:

nextflow run -c path/to/user.config ScaleRna -profile docker --params-file runParams.yml --outDir output

Resource Management

Optimized Workflow-Level Resources

In general the default resource allocations are optimized to run small to large datasets. We highly recommend only modifying the resource request if absolutely necessary. The workflow will identify process failures due to lack of resources and will automatically retry by multiplying the resource request by the number of task attempts, with the default being 3 attempts (retries).

Decrease Resources

To decrease the maximum resources any given task can request, please use the workflow level command line parameters below :

--taskMaxMemory 124.GB --taskMaxCpus 12

Increase Resources

This requires increasing individual task cpus and memory and increasing taskMaxMemory and taskMaxCpus if the intended resource allocation is higher than the default.

Default configuration:

taskMaxMemory = 256.GB
taskMaxCpus = 16
...
withName:BclConvert {
        container = 'nfcore/bclconvert:3.9.3'
        cpus = { max_cpu(16) }
        memory = { max_mem(120.GB * task.attempt) }
    }

Increased resources (custom.nextflow.config):

taskMaxMemory = 500.GB
taskMaxCpus = 64
...
withName:BclConvert {
        container = 'nfcore/bclconvert:3.9.3'
        cpus = { max_cpu(32) }
        memory = { max_mem(250.GB * task.attempt) }
    }

For detailed process configuration, see: - Process Directives - Resource Limits

Cluster Execution

Setting Up Cluster Executors

Using a scheduler is highly recommended if available. Runtimes and compute costs can decrease dramatically with proper configuration.

Useful Nextflow resources for executor configuration: - Executors Overview - Basic Training - Executors

Example SLURM configuration:

process {
  executor = 'slurm'
  queue = 'short'
  memory = '10 GB'
  time = '30 min'
  cpus = 4
}

Note: The queue is a preset configuration that is defined system-wide. Contact your system administrator or IT help desk to find out the different queue directives available.

For detailed executor documentation, see: - SLURM Executor - SGE Executor - AWS Batch - Azure Batch

Understanding the .nextflow.log file

Log Files

The .nextflow.log file contains verbose information about your Nextflow run and is essential for debugging. This file provides detailed insights into workflow execution, process configuration, and system resources.

Command:

nextflow run ScaleRna -profile docker -params-file runParams.yml --outDir QuantumScale_v2.1_OUT

File structure:

├── .nextflow.log
└── QuantumScale_v2.1_OUT
  ├── alignment
  ├── barcodes
  ├── cellTyping
  ├── fastq
  ├── reports
  ├── samples
  ├── samples.csv
  └── workflow_info.json

Custom log location:

nextflow run -log /path/to/myRun.nextflow.log ScaleRna -profile docker -params-file runParams.yml --outDir QuantumScale_v2.1_OUT

Log File Contents

The .nextflow.log file contains several key sections:

1. Session Information

Version: 24.10.2 build 5932
Created: 27-11-2024 21:23 UTC
System: Linux 4.18.0-477.27.1.el8_8.x86_64
Runtime: Groovy 4.0.23 on OpenJDK 64-Bit Server VM 17.0.9+8-LTS
CPUs: 192 - Mem: 754.9 GB (505.7 GB) - Swap: 32 GB (19.2 GB)

2. Workflow Configuration and Input Parameters

Core Nextflow Options
  Workflow Directory:: /home/ScaleRna-master
  Workflow Version:  : 2.1.0
  Command Line       : nextflow run ScaleRna/ -profile singularity -params-file runParams.yml
  Nextflow RunName   : kickass_gautier
  Profile            : singularity
  Container Engine   : apptainer
  Work Directory     : /home/work
  ...
  fastqDir           : /home/fastqs
  samples            : /home/samples.csv
  genome             : /home/genome.json
  libStructure       : libQuantumV1.0.json

3. Configuration and Executor Information

Config settings `withLabel:small` matches labels `small` for process with name RegularizeSamplesCsv
Config settings `withName:TrimFq` matches process END_TO_END:INPUT_READS:TrimFq
Config settings `withName:StarSolo` matches process END_TO_END:ALIGNMENT:StarSolo
...
Creating local task monitor for executor 'local' > cpus=192; memory=754.9 GB; capacity=192
Creating task monitor for executor 'slurm' > capacity: 100; pollInterval: 5s

Common Error Patterns

1. Interrupted Run

[SIGINT handler] DEBUG nextflow.Session - Session aborted -- Cause: SIGINT

Shows when a run was manually interrupted (Ctrl+C).

2. Missing Input File Errors

ERROR: Must specify --samples (e.g. samples.csv)

Cause: Missing required input parameters

Solution: Ensure all required parameters are provided

3. Environment Creation Failures

Failed to create Conda environment
command: conda env create --prefix /path/to/env --file /path/to/scalerna.conda.yml
status : 143

Cause: Conda environment creation failed (often due to package conflicts or network issues)

Solution: Check conda configuration, network connectivity, or try using a different profile

4. Data Validation Errors

ERROR: No valid PCR pool found in index2 fastq files

Cause: Input data doesn't meet expected format or quality criteria

Solution: Verify input file format and content; see fastq generation details

5. Resource Allocation Errors

Memory allocation failures
CPU limit exceeded
Disk space issues

6. Process Exit Codes

Exit code 1: General error - indicates a program terminated due to an error condition. This is the most common exit code and usually means the process encountered an error during execution (e.g., invalid input, missing files, permission issues, or application-specific errors).
Exit code 137: Out of memory (OOM) - the process was killed by the system's out-of-memory killer (OOM killer). This typically occurs when the process exceeds the allocated memory limit or when the system runs out of available memory.
Exit code 143: Process killed - the process received a SIGTERM signal, usually sent by the system or scheduler when terminating a job (e.g., when a job exceeds its time limit or is manually cancelled).
Exit code 139: Segmentation fault - the process crashed due to a memory access violation, typically indicating a potential bug in the software or incompatible system libraries.

Note: The .nextflow.log is a "hidden" file and thus it will typically be invisible to any GUI file browser. Use ls -al from the command line to view the file.

Parameter Priority

The parameter priority order is:

Command-line (highest priority)
runParams.yml (-params-file)
nextflow.config (lowest priority)

Recommended usage:

nextflow.config: Use for global changes you want to make for every workflow run in the future (e.g., executor setup, resource increases for specific jobs)
runParams.yml (RECOMMENDED): Preferred for setting boolean pipeline parameters (e.g., cellFinder, quantum, seurat, annData) and path variables for input files
Command-line: Use only for one-off run settings specific to this execution (e.g., changing log location with -log, using custom config with -c)

For detailed parameter precedence, see: - Configuration Precedence

Workflow Information

All settings and parameters used for a workflow run are available in the workflow_info.json that is exported to the outDir.

Running in the Cloud

Nextflow supports execution on various cloud platforms:

In addition, Seqera Platform offers another simple way to manage and execute Nextflow workflows in Amazon AWS.

Offline Execution

Running Without Internet Access

Steps to pre-download Singularity images and set MXF_SINGULARITY_LIBRARYDIR:

Pre-download containers:

singularity pull docker://public.ecr.aws/o5i3p364/scale_rna:v1.4

Set environment variable:

export MXF_SINGULARITY_LIBRARYDIR=/path/to/containers

Run with custom config:

nextflow run -c offline.config -profile singularity

In the "offline.config":

process.container = "file:///path/to/scale_rna_v1.4.sif"

For more information on offline execution, see: - Singularity - Caching and Resuming

Additional Resources

Nextflow Documentation

Training and Community

Version Information

See the ScaleRna change log for version updates and changes.

Need Help?

For more information, please contact support@scale.bio or visit our support website.