Workflow Conversion Best Practices

This guide provides best practices for converting workflows between different execution environments using wf2wf.

Understanding Execution Environment Differences

Shared Filesystem vs Distributed Computing

Workflow engines make different assumptions about their execution environment:

Aspect

Shared Filesystem

Distributed Computing

File Access

All files accessible via shared filesystem

Files must be explicitly transferred

Resources

Minimal specifications, system defaults

Explicit CPU, memory, disk allocation

Environment

System-wide software or conda

Container specifications required

Error Handling

Basic retry mechanisms

Sophisticated retry policies

Parallelization

Implicit (wildcards)

Explicit scatter/gather

Pre-Conversion Analysis

1. Understand Your Source Workflow

Before converting, analyze your source workflow:

# Get detailed information about your workflow
wf2wf info workflow.smk

# Check for potential issues
wf2wf validate workflow.smk

# Preview conversion with dry-run mode
wf2wf convert -i workflow.smk -o workflow.dag --dry-run

2. Identify Target Environment Requirements

Understand what your target environment needs:

  • HTCondor/DAGMan: Explicit resources, containers, file transfers

  • Nextflow: Container specifications, resource limits

  • CWL: Resource requirements, software requirements

  • Snakemake: Conda environments, resource specifications

Interactive Conversion Workflow

Step 1: Initial Conversion with Analysis

# Convert with interactive mode for guided assistance
wf2wf convert -i workflow.smk -o workflow.dag --interactive --verbose

# This enables:
# - Interactive resource specification
# - Guided environment configuration
# - Error handling setup
# - File transfer optimization

Step 2: Review Configuration Analysis

The conversion report will show:

## Configuration Analysis

### Resource Analysis
* **Memory**: 2 tasks without explicit memory requirements
* **CPU**: 1 task without CPU specifications
* **Disk**: 3 tasks without disk requirements

### Environment Analysis
* **Containers**: 3 tasks without container/conda specifications
* **Software**: 2 tasks with system dependencies

### Error Handling Analysis
* **Retry Policies**: 3 tasks without retry specifications
* **Error Recovery**: 1 task without error handling

### File Transfer Analysis
* **Transfer Modes**: 6 files with auto-detected transfer modes
* **Dependencies**: 2 files with missing dependency specifications

**Recommendations:**
* Add explicit resource requirements for all tasks
* Specify container images or conda environments for environment isolation
* Configure retry policies for fault tolerance
* Use `--infer-resources` to automatically detect resource requirements
* Use `--resource-profile cluster` for standard cluster specifications
* Enable `--auto-env` for automatic container generation
* Review file transfer modes for distributed execution

Step 3: Address Issues with Interactive Assistance

Use the interactive prompts to:

  • Add default resource specifications

  • Specify container environments

  • Configure retry policies

  • Review file transfer modes

# Interactive resource specification
Found 3 tasks without explicit resource requirements.
Distributed systems require explicit resource allocation.
Add default resource specifications? [Y/n]: Y

Applied default resources: CPU=1, Memory=2048MB, Disk=4096MB

# Interactive environment specification
Found 2 tasks without container specifications.
Distributed systems typically require explicit environment isolation.
Add container specifications or conda environments? [Y/n]: Y

Enable --auto-env to automatically build containers for these tasks.

# Interactive error handling
Found 3 tasks without retry specifications.
Distributed systems benefit from explicit error handling.
Add retry specifications for failed tasks? [Y/n]: Y

Applied default retry settings (2 retries)

Best Practices by Conversion Type

Snakemake → DAGMan

Before Conversion:

# Add resource specifications to your Snakefile
rule process_data:
    input: "data/{sample}.txt"
    output: "processed/{sample}.txt"
    resources:
        mem_mb=4096,
        disk_mb=4096
    shell: "python process.py {input} > {output}"

Enhanced Conversion:

# Convert with all enhanced features
wf2wf convert -i workflow.smk -o workflow.dag \
    --interactive \
    --infer-resources \
    --validate-resources \
    --resource-profile cluster \
    --target-env distributed \
    --report-md

After Conversion:

# Review generated DAGMan files
cat workflow.dag
cat process_data_*.sub

# Check conversion report for any issues
cat conversion_report.md

CWL → Nextflow

Before Conversion:

# Ensure resource requirements are specified
requirements:
  - class: ResourceRequirement
    coresMin: 4
    ramMin: 8192
    tmpdirMin: 4096

Enhanced Conversion:

# Convert with enhanced CWL processing
wf2wf convert -i workflow.cwl -o main.nf \
    --interactive \
    --infer-resources \
    --validate-resources \
    --resource-profile cloud \
    --target-env distributed

After Conversion:

// Review generated Nextflow process
process PROCESS_DATA {
    cpus 4
    memory 8.GB
    disk 4.GB
    
    input:
    path input_file
    
    output:
    path output_file
    
    script:
    """
    python process.py ${input_file} > ${output_file}
    """
}

DAGMan → Snakemake

Before Conversion:

# Review submit file specifications
cat job.sub
# Look for: request_cpus, request_memory, transfer_input_files

Enhanced Conversion:

# Convert with enhanced DAGMan processing
wf2wf convert -i workflow.dag -o workflow.smk \
    --interactive \
    --infer-resources \
    --validate-resources \
    --resource-profile shared \
    --target-env distributed

After Conversion:

# Review generated Snakefile
# Ensure resource specifications are preserved
rule job:
    input: "input.txt"
    output: "output.txt"
    resources:
        mem_mb=4096,
        disk_mb=4096
    shell: "python job.py {input} > {output}"

Loss Detection and Reporting

Comprehensive Loss Analysis

# Generate detailed loss report
wf2wf convert -i workflow.smk -o workflow.dag --report-md

# The report includes:
# - Information preservation analysis
# - Potential loss identification
# - Conversion recommendations
# - Quality metrics

Loss Report Example

# Enhanced Conversion Report

## Information Loss Analysis

### Preserved Information (100%)
- ✅ Task definitions and dependencies
- ✅ Resource specifications (enhanced)
- ✅ Container/environment specifications
- ✅ Input/output file specifications
- ✅ Error handling configurations

### Potential Loss (Minimal)
- ⚠️ Snakemake wildcards → DAGMan parameter substitution
- ⚠️ Snakemake conda environments → DAGMan container specifications

### Quality Metrics
- **Compliance Score**: 95/100
- **Resource Coverage**: 100%
- **Environment Coverage**: 100%
- **Error Handling Coverage**: 100%

### Recommendations
- Review wildcard substitutions for correctness
- Verify container specifications match conda environments
- All resource specifications are properly converted

Enhanced Troubleshooting

Common Issues and Solutions

  1. Resource Validation Failures

    # Adjust resource specifications
    wf2wf convert -i workflow.smk -o workflow.dag --resource-profile shared
    
    # Or use interactive mode to adjust manually
    wf2wf convert -i workflow.smk -o workflow.dag --interactive
    
  2. Container Generation Issues

    # Enable automatic container generation
    wf2wf convert -i workflow.smk -o workflow.dag --auto-env
    
    # Or specify containers manually
    wf2wf convert -i workflow.smk -o workflow.dag --interactive
    
  3. File Transfer Problems

    # Review file transfer modes
    wf2wf convert -i workflow.smk -o workflow.dag --interactive
    
    # Optimize for distributed computing
    wf2wf convert -i workflow.smk -o workflow.dag --target-env distributed
    

Getting Enhanced Help

# Get detailed information about your workflow
wf2wf info workflow.smk

# Validate workflow with enhanced checks
wf2wf validate workflow.smk

# Preview conversion with all enhancements
wf2wf convert -i workflow.smk -o workflow.dag --dry-run --verbose

# Check for potential issues
wf2wf convert -i workflow.smk -o workflow.dag --interactive

Resource Specification Guidelines

Automatic Resource Inference

The enhanced system can automatically infer resource requirements:

# Enable automatic resource inference
wf2wf convert -i workflow.smk -o workflow.dag --infer-resources

# Inference examples:
# - "bwa mem" → 8GB memory, 4 CPU
# - "samtools sort" → 4GB memory, 2 CPU
# - "python script.py" → 1GB memory, 1 CPU
# - "Rscript analysis.R" → 2GB memory, 1 CPU

Resource Profiles

Use predefined resource profiles for different environments:

# Apply specific resource profile
wf2wf convert -i workflow.smk -o workflow.dag --resource-profile cluster

# Available profiles:
# - shared: Light resources (1 CPU, 512MB RAM, 1GB disk)
# - cluster: Standard cluster (1 CPU, 2GB RAM, 4GB disk)
# - cloud: Cloud-optimized (2 CPU, 4GB RAM, 8GB disk)
# - hpc: High-performance (4 CPU, 8GB RAM, 16GB disk)
# - gpu: GPU-enabled (4 CPU, 16GB RAM, 32GB disk)

Memory Requirements

Task Type

Recommended Memory

Notes

Light processing

1-2 GB

Text processing, simple scripts

Medium analysis

4-8 GB

Data analysis, moderate datasets

Heavy computation

16-32 GB

Machine learning, large datasets

Genomics

32-64 GB

Sequence alignment, variant calling

Disk Requirements

Task Type

Recommended Disk

Notes

Text processing

1-2 GB

Small input/output files

Data analysis

4-8 GB

Moderate datasets

Genomics

10-50 GB

Large sequence files

Machine learning

5-20 GB

Model files and datasets

CPU Requirements

Task Type

Recommended CPUs

Notes

Single-threaded

1

Simple scripts, basic processing

Multi-threaded

4-8

Data analysis, moderate parallelism

High-performance

16-32

Machine learning, genomics

Container and Environment Best Practices

Automatic Container Generation

# Enable automatic container generation
wf2wf convert -i workflow.smk -o workflow.dag --auto-env

# This will:
# - Analyze software dependencies
# - Generate appropriate container specifications
# - Create conda environment files
# - Optimize for target environment

Container Specifications

# Snakemake with container
rule process:
    container: "docker://python:3.9-slim"
    shell: "python process.py"

# CWL with Docker requirement
requirements:
  - class: DockerRequirement
    dockerPull: python:3.9-slim

Conda Environments

# Snakemake with conda
rule process:
    conda: "environment.yaml"
    shell: "python process.py"
# environment.yaml
name: analysis
channels:
  - conda-forge
  - bioconda
dependencies:
  - python=3.9
  - pandas
  - numpy
  - biopython

Error Handling and Retry Policies

Retry Specifications

# Snakemake with retries
rule process:
    retries: 3
    shell: "python process.py"
# DAGMan with retry
RETRY process_job 3

Interactive Error Handling Configuration

# Enable automatic error handling setup
wf2wf convert -i workflow.smk -o workflow.dag --interactive

# The system will:
# - Detect tasks without retry specifications
# - Suggest appropriate retry policies
# - Configure error recovery mechanisms
# - Optimize for target environment

Error Strategies

  • Transient failures: 2-3 retries with exponential backoff

  • Resource failures: Retry with different resource specifications

  • Data corruption: Validate inputs before processing

  • Network issues: Retry with longer timeouts

Error Handling Examples

Snakemake with enhanced error handling

rule process: input: “data.txt” output: “result.txt” retries: 3 # Auto-configured shell: “python process.py {input} > {output}”


```bash
# DAGMan with enhanced error handling
# Auto-generated submit file includes:
# retry = 3
# on_exit_remove = (ExitCode != 0)
# max_retries = 3

File Transfer Optimization

Transfer Mode Selection

from wf2wf.core import ParameterSpec

# Input files - always transfer
input_file = ParameterSpec(
    id="data/input.txt",
    type="File",
    transfer_mode="always"
)

# Reference data - shared storage
reference = ParameterSpec(
    id="/shared/genomes/hg38.fa",
    type="File",
    transfer_mode="shared"
)

# Temporary files - never transfer
temp_file = ParameterSpec(
    id="temp.log",
    type="File",
    transfer_mode="never"
)

Transfer Optimization Tips

  1. Minimize transfers: Use shared mode for large reference files

  2. Batch transfers: Group related files together

  3. Compress data: Use compressed formats when possible

  4. Local processing: Use never mode for temporary files

Validation and Testing

Post-Conversion Validation

# Validate the converted workflow
wf2wf validate workflow.dag

# Check for any unresolved losses
cat workflow.loss.json

# Test with a small dataset
condor_submit_dag workflow.dag

Testing Checklist

  • [ ] All tasks have appropriate resource specifications

  • [ ] Container/environment specifications are correct

  • [ ] File transfer modes are appropriate

  • [ ] Retry policies are configured

  • [ ] Dependencies are correctly specified

  • [ ] Output files are properly defined

Troubleshooting Common Issues

Resource Issues

Problem: Jobs fail due to insufficient memory Solution: Increase memory specifications or optimize memory usage

Problem: Jobs fail due to insufficient disk space Solution: Increase disk specifications or clean up temporary files

Container Issues

Problem: Container not found Solution: Ensure container image exists and is accessible

Problem: Container permissions issues Solution: Check container user and file permissions

File Transfer Issues

Problem: Files not found on compute nodes Solution: Check transfer modes and ensure files are in transfer lists

Problem: Unnecessary large file transfers Solution: Mark reference data as shared mode

Advanced Configuration

Custom Resource Patterns

# Custom resource detection patterns
def custom_resource_detection(task):
    if "alignment" in task.id:
        return ResourceSpec(mem_mb=16384, cpu=8)
    elif "qc" in task.id:
        return ResourceSpec(mem_mb=2048, cpu=2)
    else:
        return ResourceSpec(mem_mb=4096, cpu=4)

Environment-Specific Configurations

# Different configurations for different environments
wf2wf convert -i workflow.smk -o workflow.dag \
    --default-memory 8GB \
    --default-disk 10GB \
    --default-cpus 4

Performance Optimization

Resource Optimization

  1. Profile your workflows: Measure actual resource usage

  2. Right-size resources: Match specifications to actual needs

  3. Use resource limits: Prevent runaway jobs

  4. Monitor usage: Track resource utilization over time

Transfer Optimization

  1. Use shared storage: Minimize data movement

  2. Compress data: Reduce transfer sizes

  3. Batch operations: Group related transfers

  4. Cache frequently used data: Store on shared storage