Workflow Conversion Best Practices¶

This guide provides best practices for converting workflows between different execution environments using wf2wf.

Understanding Execution Environment Differences¶

Shared Filesystem vs Distributed Computing¶

Workflow engines make different assumptions about their execution environment:

Aspect	Shared Filesystem	Distributed Computing
File Access	All files accessible via shared filesystem	Files must be explicitly transferred
Resources	Minimal specifications, system defaults	Explicit CPU, memory, disk allocation
Environment	System-wide software or conda	Container specifications required
Error Handling	Basic retry mechanisms	Sophisticated retry policies
Parallelization	Implicit (wildcards)	Explicit scatter/gather

Pre-Conversion Analysis¶

1. Understand Your Source Workflow¶

Before converting, analyze your source workflow:

# Get detailed information about your workflow
wf2wf info workflow.smk

# Check for potential issues
wf2wf validate workflow.smk

# Preview conversion with dry-run mode
wf2wf convert -i workflow.smk -o workflow.dag --dry-run

2. Identify Target Environment Requirements¶

Understand what your target environment needs:

HTCondor/DAGMan: Explicit resources, containers, file transfers
Nextflow: Container specifications, resource limits
CWL: Resource requirements, software requirements
Snakemake: Conda environments, resource specifications

Interactive Conversion Workflow¶

Step 1: Initial Conversion with Analysis¶

# Convert with interactive mode for guided assistance
wf2wf convert -i workflow.smk -o workflow.dag --interactive --verbose

# This enables:
# - Interactive resource specification
# - Guided environment configuration
# - Error handling setup
# - File transfer optimization

Step 2: Review Configuration Analysis¶

The conversion report will show:

## Configuration Analysis

### Resource Analysis
* **Memory**: 2 tasks without explicit memory requirements
* **CPU**: 1 task without CPU specifications
* **Disk**: 3 tasks without disk requirements

### Environment Analysis
* **Containers**: 3 tasks without container/conda specifications
* **Software**: 2 tasks with system dependencies

### Error Handling Analysis
* **Retry Policies**: 3 tasks without retry specifications
* **Error Recovery**: 1 task without error handling

### File Transfer Analysis
* **Transfer Modes**: 6 files with auto-detected transfer modes
* **Dependencies**: 2 files with missing dependency specifications

**Recommendations:**
* Add explicit resource requirements for all tasks
* Specify container images or conda environments for environment isolation
* Configure retry policies for fault tolerance
* Use `--infer-resources` to automatically detect resource requirements
* Use `--resource-profile cluster` for standard cluster specifications
* Enable `--auto-env` for automatic container generation
* Review file transfer modes for distributed execution

Step 3: Address Issues with Interactive Assistance¶

Use the interactive prompts to:

Add default resource specifications
Specify container environments
Configure retry policies
Review file transfer modes

# Interactive resource specification
Found 3 tasks without explicit resource requirements.
Distributed systems require explicit resource allocation.
Add default resource specifications? [Y/n]: Y

Applied default resources: CPU=1, Memory=2048MB, Disk=4096MB

# Interactive environment specification
Found 2 tasks without container specifications.
Distributed systems typically require explicit environment isolation.
Add container specifications or conda environments? [Y/n]: Y

Enable --auto-env to automatically build containers for these tasks.

# Interactive error handling
Found 3 tasks without retry specifications.
Distributed systems benefit from explicit error handling.
Add retry specifications for failed tasks? [Y/n]: Y

Applied default retry settings (2 retries)

Best Practices by Conversion Type¶

Snakemake → DAGMan¶

Before Conversion:

# Add resource specifications to your Snakefile
rule process_data:
    input: "data/{sample}.txt"
    output: "processed/{sample}.txt"
    resources:
        mem_mb=4096,
        disk_mb=4096
    shell: "python process.py {input} > {output}"

Enhanced Conversion:

# Convert with all enhanced features
wf2wf convert -i workflow.smk -o workflow.dag \
    --interactive \
    --infer-resources \
    --validate-resources \
    --resource-profile cluster \
    --target-env distributed \
    --report-md

After Conversion:

# Review generated DAGMan files
cat workflow.dag
cat process_data_*.sub

# Check conversion report for any issues
cat conversion_report.md

CWL → Nextflow¶

Before Conversion:

# Ensure resource requirements are specified
requirements:
  - class: ResourceRequirement
    coresMin: 4
    ramMin: 8192
    tmpdirMin: 4096

Enhanced Conversion:

# Convert with enhanced CWL processing
wf2wf convert -i workflow.cwl -o main.nf \
    --interactive \
    --infer-resources \
    --validate-resources \
    --resource-profile cloud \
    --target-env distributed

After Conversion:

// Review generated Nextflow process
process PROCESS_DATA {
    cpus 4
    memory 8.GB
    disk 4.GB
    
    input:
    path input_file
    
    output:
    path output_file
    
    script:
    """
    python process.py ${input_file} > ${output_file}
    """
}

DAGMan → Snakemake¶

Before Conversion:

# Review submit file specifications
cat job.sub
# Look for: request_cpus, request_memory, transfer_input_files

Enhanced Conversion:

# Convert with enhanced DAGMan processing
wf2wf convert -i workflow.dag -o workflow.smk \
    --interactive \
    --infer-resources \
    --validate-resources \
    --resource-profile shared \
    --target-env distributed

After Conversion:

# Review generated Snakefile
# Ensure resource specifications are preserved
rule job:
    input: "input.txt"
    output: "output.txt"
    resources:
        mem_mb=4096,
        disk_mb=4096
    shell: "python job.py {input} > {output}"

Loss Detection and Reporting¶

Comprehensive Loss Analysis¶

# Generate detailed loss report
wf2wf convert -i workflow.smk -o workflow.dag --report-md

# The report includes:
# - Information preservation analysis
# - Potential loss identification
# - Conversion recommendations
# - Quality metrics

Loss Report Example¶

# Enhanced Conversion Report

## Information Loss Analysis

### Preserved Information (100%)
- ✅ Task definitions and dependencies
- ✅ Resource specifications (enhanced)
- ✅ Container/environment specifications
- ✅ Input/output file specifications
- ✅ Error handling configurations

### Potential Loss (Minimal)
- ⚠️ Snakemake wildcards → DAGMan parameter substitution
- ⚠️ Snakemake conda environments → DAGMan container specifications

### Quality Metrics
- **Compliance Score**: 95/100
- **Resource Coverage**: 100%
- **Environment Coverage**: 100%
- **Error Handling Coverage**: 100%

### Recommendations
- Review wildcard substitutions for correctness
- Verify container specifications match conda environments
- All resource specifications are properly converted

Enhanced Troubleshooting¶

Common Issues and Solutions¶

Resource Validation Failures

# Adjust resource specifications
wf2wf convert -i workflow.smk -o workflow.dag --resource-profile shared

# Or use interactive mode to adjust manually
wf2wf convert -i workflow.smk -o workflow.dag --interactive

Container Generation Issues

# Enable automatic container generation
wf2wf convert -i workflow.smk -o workflow.dag --auto-env

# Or specify containers manually
wf2wf convert -i workflow.smk -o workflow.dag --interactive

File Transfer Problems

# Review file transfer modes
wf2wf convert -i workflow.smk -o workflow.dag --interactive

# Optimize for distributed computing
wf2wf convert -i workflow.smk -o workflow.dag --target-env distributed

Getting Enhanced Help¶

# Get detailed information about your workflow
wf2wf info workflow.smk

# Validate workflow with enhanced checks
wf2wf validate workflow.smk

# Preview conversion with all enhancements
wf2wf convert -i workflow.smk -o workflow.dag --dry-run --verbose

# Check for potential issues
wf2wf convert -i workflow.smk -o workflow.dag --interactive

Resource Specification Guidelines¶

Automatic Resource Inference¶

The enhanced system can automatically infer resource requirements:

# Enable automatic resource inference
wf2wf convert -i workflow.smk -o workflow.dag --infer-resources

# Inference examples:
# - "bwa mem" → 8GB memory, 4 CPU
# - "samtools sort" → 4GB memory, 2 CPU
# - "python script.py" → 1GB memory, 1 CPU
# - "Rscript analysis.R" → 2GB memory, 1 CPU

Resource Profiles¶

Use predefined resource profiles for different environments:

# Apply specific resource profile
wf2wf convert -i workflow.smk -o workflow.dag --resource-profile cluster

# Available profiles:
# - shared: Light resources (1 CPU, 512MB RAM, 1GB disk)
# - cluster: Standard cluster (1 CPU, 2GB RAM, 4GB disk)
# - cloud: Cloud-optimized (2 CPU, 4GB RAM, 8GB disk)
# - hpc: High-performance (4 CPU, 8GB RAM, 16GB disk)
# - gpu: GPU-enabled (4 CPU, 16GB RAM, 32GB disk)

Memory Requirements¶

Task Type	Recommended Memory	Notes
Light processing	1-2 GB	Text processing, simple scripts
Medium analysis	4-8 GB	Data analysis, moderate datasets
Heavy computation	16-32 GB	Machine learning, large datasets
Genomics	32-64 GB	Sequence alignment, variant calling

Disk Requirements¶

Task Type	Recommended Disk	Notes
Text processing	1-2 GB	Small input/output files
Data analysis	4-8 GB	Moderate datasets
Genomics	10-50 GB	Large sequence files
Machine learning	5-20 GB	Model files and datasets

CPU Requirements¶

Task Type	Recommended CPUs	Notes
Single-threaded	1	Simple scripts, basic processing
Multi-threaded	4-8	Data analysis, moderate parallelism
High-performance	16-32	Machine learning, genomics

Container and Environment Best Practices¶

Automatic Container Generation¶

# Enable automatic container generation
wf2wf convert -i workflow.smk -o workflow.dag --auto-env

# This will:
# - Analyze software dependencies
# - Generate appropriate container specifications
# - Create conda environment files
# - Optimize for target environment

Container Specifications¶

# Snakemake with container
rule process:
    container: "docker://python:3.9-slim"
    shell: "python process.py"

# CWL with Docker requirement
requirements:
  - class: DockerRequirement
    dockerPull: python:3.9-slim

Conda Environments¶

# Snakemake with conda
rule process:
    conda: "environment.yaml"
    shell: "python process.py"

# environment.yaml
name: analysis
channels:
  - conda-forge
  - bioconda
dependencies:
  - python=3.9
  - pandas
  - numpy
  - biopython

Error Handling and Retry Policies¶

Retry Specifications¶

# Snakemake with retries
rule process:
    retries: 3
    shell: "python process.py"

# DAGMan with retry
RETRY process_job 3

Interactive Error Handling Configuration¶

# Enable automatic error handling setup
wf2wf convert -i workflow.smk -o workflow.dag --interactive

# The system will:
# - Detect tasks without retry specifications
# - Suggest appropriate retry policies
# - Configure error recovery mechanisms
# - Optimize for target environment

Error Strategies¶

Transient failures: 2-3 retries with exponential backoff
Resource failures: Retry with different resource specifications
Data corruption: Validate inputs before processing
Network issues: Retry with longer timeouts

Error Handling Examples¶

Snakemake with enhanced error handling¶

rule process: input: “data.txt” output: “result.txt” retries: 3 # Auto-configured shell: “python process.py {input} > {output}”

```bash
# DAGMan with enhanced error handling
# Auto-generated submit file includes:
# retry = 3
# on_exit_remove = (ExitCode != 0)
# max_retries = 3

File Transfer Optimization¶

Transfer Mode Selection¶

from wf2wf.core import ParameterSpec

# Input files - always transfer
input_file = ParameterSpec(
    id="data/input.txt",
    type="File",
    transfer_mode="always"
)

# Reference data - shared storage
reference = ParameterSpec(
    id="/shared/genomes/hg38.fa",
    type="File",
    transfer_mode="shared"
)

# Temporary files - never transfer
temp_file = ParameterSpec(
    id="temp.log",
    type="File",
    transfer_mode="never"
)

Transfer Optimization Tips¶

Minimize transfers: Use shared mode for large reference files
Batch transfers: Group related files together
Compress data: Use compressed formats when possible
Local processing: Use never mode for temporary files

Validation and Testing¶

Post-Conversion Validation¶

# Validate the converted workflow
wf2wf validate workflow.dag

# Check for any unresolved losses
cat workflow.loss.json

# Test with a small dataset
condor_submit_dag workflow.dag

Testing Checklist¶

[ ] All tasks have appropriate resource specifications
[ ] Container/environment specifications are correct
[ ] File transfer modes are appropriate
[ ] Retry policies are configured
[ ] Dependencies are correctly specified
[ ] Output files are properly defined

Troubleshooting Common Issues¶

Resource Issues¶

Problem: Jobs fail due to insufficient memory Solution: Increase memory specifications or optimize memory usage

Problem: Jobs fail due to insufficient disk space Solution: Increase disk specifications or clean up temporary files

Container Issues¶

Problem: Container not found Solution: Ensure container image exists and is accessible

Problem: Container permissions issues Solution: Check container user and file permissions

File Transfer Issues¶

Problem: Files not found on compute nodes Solution: Check transfer modes and ensure files are in transfer lists

Problem: Unnecessary large file transfers Solution: Mark reference data as shared mode

Advanced Configuration¶

Custom Resource Patterns¶

# Custom resource detection patterns
def custom_resource_detection(task):
    if "alignment" in task.id:
        return ResourceSpec(mem_mb=16384, cpu=8)
    elif "qc" in task.id:
        return ResourceSpec(mem_mb=2048, cpu=2)
    else:
        return ResourceSpec(mem_mb=4096, cpu=4)

Environment-Specific Configurations¶

# Different configurations for different environments
wf2wf convert -i workflow.smk -o workflow.dag \
    --default-memory 8GB \
    --default-disk 10GB \
    --default-cpus 4

Workflow Conversion Best Practices¶

Understanding Execution Environment Differences¶

Shared Filesystem vs Distributed Computing¶

Pre-Conversion Analysis¶

1. Understand Your Source Workflow¶

2. Identify Target Environment Requirements¶

Interactive Conversion Workflow¶

Step 1: Initial Conversion with Analysis¶

Step 2: Review Configuration Analysis¶

Step 3: Address Issues with Interactive Assistance¶

Best Practices by Conversion Type¶

Snakemake → DAGMan¶

CWL → Nextflow¶

DAGMan → Snakemake¶

Loss Detection and Reporting¶

Comprehensive Loss Analysis¶

Loss Report Example¶

Enhanced Troubleshooting¶

Common Issues and Solutions¶

Getting Enhanced Help¶

Resource Specification Guidelines¶

Automatic Resource Inference¶

Resource Profiles¶

Memory Requirements¶

Disk Requirements¶

CPU Requirements¶

Container and Environment Best Practices¶

Automatic Container Generation¶

Container Specifications¶

Conda Environments¶

Error Handling and Retry Policies¶

Retry Specifications¶

Interactive Error Handling Configuration¶

Error Strategies¶

Error Handling Examples¶

Snakemake with enhanced error handling¶

File Transfer Optimization¶

Transfer Mode Selection¶

Transfer Optimization Tips¶

Validation and Testing¶

Post-Conversion Validation¶

Testing Checklist¶

Troubleshooting Common Issues¶

Resource Issues¶

Container Issues¶

File Transfer Issues¶

Advanced Configuration¶

Custom Resource Patterns¶

Environment-Specific Configurations¶

Performance Optimization¶

Resource Optimization¶

Transfer Optimization¶

Related Documentation¶