Workflow Conversion Best Practices¶
This guide provides best practices for converting workflows between different execution environments using wf2wf.
Understanding Execution Environment Differences¶
Pre-Conversion Analysis¶
1. Understand Your Source Workflow¶
Before converting, analyze your source workflow:
# Get detailed information about your workflow
wf2wf info workflow.smk
# Check for potential issues
wf2wf validate workflow.smk
# Preview conversion with dry-run mode
wf2wf convert -i workflow.smk -o workflow.dag --dry-run
2. Identify Target Environment Requirements¶
Understand what your target environment needs:
HTCondor/DAGMan: Explicit resources, containers, file transfers
Nextflow: Container specifications, resource limits
CWL: Resource requirements, software requirements
Snakemake: Conda environments, resource specifications
Interactive Conversion Workflow¶
Step 1: Initial Conversion with Analysis¶
# Convert with interactive mode for guided assistance
wf2wf convert -i workflow.smk -o workflow.dag --interactive --verbose
# This enables:
# - Interactive resource specification
# - Guided environment configuration
# - Error handling setup
# - File transfer optimization
Step 2: Review Configuration Analysis¶
The conversion report will show:
## Configuration Analysis
### Resource Analysis
* **Memory**: 2 tasks without explicit memory requirements
* **CPU**: 1 task without CPU specifications
* **Disk**: 3 tasks without disk requirements
### Environment Analysis
* **Containers**: 3 tasks without container/conda specifications
* **Software**: 2 tasks with system dependencies
### Error Handling Analysis
* **Retry Policies**: 3 tasks without retry specifications
* **Error Recovery**: 1 task without error handling
### File Transfer Analysis
* **Transfer Modes**: 6 files with auto-detected transfer modes
* **Dependencies**: 2 files with missing dependency specifications
**Recommendations:**
* Add explicit resource requirements for all tasks
* Specify container images or conda environments for environment isolation
* Configure retry policies for fault tolerance
* Use `--infer-resources` to automatically detect resource requirements
* Use `--resource-profile cluster` for standard cluster specifications
* Enable `--auto-env` for automatic container generation
* Review file transfer modes for distributed execution
Step 3: Address Issues with Interactive Assistance¶
Use the interactive prompts to:
Add default resource specifications
Specify container environments
Configure retry policies
Review file transfer modes
# Interactive resource specification
Found 3 tasks without explicit resource requirements.
Distributed systems require explicit resource allocation.
Add default resource specifications? [Y/n]: Y
Applied default resources: CPU=1, Memory=2048MB, Disk=4096MB
# Interactive environment specification
Found 2 tasks without container specifications.
Distributed systems typically require explicit environment isolation.
Add container specifications or conda environments? [Y/n]: Y
Enable --auto-env to automatically build containers for these tasks.
# Interactive error handling
Found 3 tasks without retry specifications.
Distributed systems benefit from explicit error handling.
Add retry specifications for failed tasks? [Y/n]: Y
Applied default retry settings (2 retries)
Best Practices by Conversion Type¶
Snakemake → DAGMan¶
Before Conversion:
# Add resource specifications to your Snakefile
rule process_data:
input: "data/{sample}.txt"
output: "processed/{sample}.txt"
resources:
mem_mb=4096,
disk_mb=4096
shell: "python process.py {input} > {output}"
Enhanced Conversion:
# Convert with all enhanced features
wf2wf convert -i workflow.smk -o workflow.dag \
--interactive \
--infer-resources \
--validate-resources \
--resource-profile cluster \
--target-env distributed \
--report-md
After Conversion:
# Review generated DAGMan files
cat workflow.dag
cat process_data_*.sub
# Check conversion report for any issues
cat conversion_report.md
CWL → Nextflow¶
Before Conversion:
# Ensure resource requirements are specified
requirements:
- class: ResourceRequirement
coresMin: 4
ramMin: 8192
tmpdirMin: 4096
Enhanced Conversion:
# Convert with enhanced CWL processing
wf2wf convert -i workflow.cwl -o main.nf \
--interactive \
--infer-resources \
--validate-resources \
--resource-profile cloud \
--target-env distributed
After Conversion:
// Review generated Nextflow process
process PROCESS_DATA {
cpus 4
memory 8.GB
disk 4.GB
input:
path input_file
output:
path output_file
script:
"""
python process.py ${input_file} > ${output_file}
"""
}
DAGMan → Snakemake¶
Before Conversion:
# Review submit file specifications
cat job.sub
# Look for: request_cpus, request_memory, transfer_input_files
Enhanced Conversion:
# Convert with enhanced DAGMan processing
wf2wf convert -i workflow.dag -o workflow.smk \
--interactive \
--infer-resources \
--validate-resources \
--resource-profile shared \
--target-env distributed
After Conversion:
# Review generated Snakefile
# Ensure resource specifications are preserved
rule job:
input: "input.txt"
output: "output.txt"
resources:
mem_mb=4096,
disk_mb=4096
shell: "python job.py {input} > {output}"
Loss Detection and Reporting¶
Comprehensive Loss Analysis¶
# Generate detailed loss report
wf2wf convert -i workflow.smk -o workflow.dag --report-md
# The report includes:
# - Information preservation analysis
# - Potential loss identification
# - Conversion recommendations
# - Quality metrics
Loss Report Example¶
# Enhanced Conversion Report
## Information Loss Analysis
### Preserved Information (100%)
- ✅ Task definitions and dependencies
- ✅ Resource specifications (enhanced)
- ✅ Container/environment specifications
- ✅ Input/output file specifications
- ✅ Error handling configurations
### Potential Loss (Minimal)
- ⚠️ Snakemake wildcards → DAGMan parameter substitution
- ⚠️ Snakemake conda environments → DAGMan container specifications
### Quality Metrics
- **Compliance Score**: 95/100
- **Resource Coverage**: 100%
- **Environment Coverage**: 100%
- **Error Handling Coverage**: 100%
### Recommendations
- Review wildcard substitutions for correctness
- Verify container specifications match conda environments
- All resource specifications are properly converted
Enhanced Troubleshooting¶
Common Issues and Solutions¶
Resource Validation Failures
# Adjust resource specifications wf2wf convert -i workflow.smk -o workflow.dag --resource-profile shared # Or use interactive mode to adjust manually wf2wf convert -i workflow.smk -o workflow.dag --interactive
Container Generation Issues
# Enable automatic container generation wf2wf convert -i workflow.smk -o workflow.dag --auto-env # Or specify containers manually wf2wf convert -i workflow.smk -o workflow.dag --interactive
File Transfer Problems
# Review file transfer modes wf2wf convert -i workflow.smk -o workflow.dag --interactive # Optimize for distributed computing wf2wf convert -i workflow.smk -o workflow.dag --target-env distributed
Getting Enhanced Help¶
# Get detailed information about your workflow
wf2wf info workflow.smk
# Validate workflow with enhanced checks
wf2wf validate workflow.smk
# Preview conversion with all enhancements
wf2wf convert -i workflow.smk -o workflow.dag --dry-run --verbose
# Check for potential issues
wf2wf convert -i workflow.smk -o workflow.dag --interactive
Resource Specification Guidelines¶
Automatic Resource Inference¶
The enhanced system can automatically infer resource requirements:
# Enable automatic resource inference
wf2wf convert -i workflow.smk -o workflow.dag --infer-resources
# Inference examples:
# - "bwa mem" → 8GB memory, 4 CPU
# - "samtools sort" → 4GB memory, 2 CPU
# - "python script.py" → 1GB memory, 1 CPU
# - "Rscript analysis.R" → 2GB memory, 1 CPU
Resource Profiles¶
Use predefined resource profiles for different environments:
# Apply specific resource profile
wf2wf convert -i workflow.smk -o workflow.dag --resource-profile cluster
# Available profiles:
# - shared: Light resources (1 CPU, 512MB RAM, 1GB disk)
# - cluster: Standard cluster (1 CPU, 2GB RAM, 4GB disk)
# - cloud: Cloud-optimized (2 CPU, 4GB RAM, 8GB disk)
# - hpc: High-performance (4 CPU, 8GB RAM, 16GB disk)
# - gpu: GPU-enabled (4 CPU, 16GB RAM, 32GB disk)
Memory Requirements¶
Task Type |
Recommended Memory |
Notes |
|---|---|---|
Light processing |
1-2 GB |
Text processing, simple scripts |
Medium analysis |
4-8 GB |
Data analysis, moderate datasets |
Heavy computation |
16-32 GB |
Machine learning, large datasets |
Genomics |
32-64 GB |
Sequence alignment, variant calling |
Disk Requirements¶
Task Type |
Recommended Disk |
Notes |
|---|---|---|
Text processing |
1-2 GB |
Small input/output files |
Data analysis |
4-8 GB |
Moderate datasets |
Genomics |
10-50 GB |
Large sequence files |
Machine learning |
5-20 GB |
Model files and datasets |
CPU Requirements¶
Task Type |
Recommended CPUs |
Notes |
|---|---|---|
Single-threaded |
1 |
Simple scripts, basic processing |
Multi-threaded |
4-8 |
Data analysis, moderate parallelism |
High-performance |
16-32 |
Machine learning, genomics |
Container and Environment Best Practices¶
Automatic Container Generation¶
# Enable automatic container generation
wf2wf convert -i workflow.smk -o workflow.dag --auto-env
# This will:
# - Analyze software dependencies
# - Generate appropriate container specifications
# - Create conda environment files
# - Optimize for target environment
Container Specifications¶
# Snakemake with container
rule process:
container: "docker://python:3.9-slim"
shell: "python process.py"
# CWL with Docker requirement
requirements:
- class: DockerRequirement
dockerPull: python:3.9-slim
Conda Environments¶
# Snakemake with conda
rule process:
conda: "environment.yaml"
shell: "python process.py"
# environment.yaml
name: analysis
channels:
- conda-forge
- bioconda
dependencies:
- python=3.9
- pandas
- numpy
- biopython
Error Handling and Retry Policies¶
Retry Specifications¶
# Snakemake with retries
rule process:
retries: 3
shell: "python process.py"
# DAGMan with retry
RETRY process_job 3
Interactive Error Handling Configuration¶
# Enable automatic error handling setup
wf2wf convert -i workflow.smk -o workflow.dag --interactive
# The system will:
# - Detect tasks without retry specifications
# - Suggest appropriate retry policies
# - Configure error recovery mechanisms
# - Optimize for target environment
Error Strategies¶
Transient failures: 2-3 retries with exponential backoff
Resource failures: Retry with different resource specifications
Data corruption: Validate inputs before processing
Network issues: Retry with longer timeouts
Error Handling Examples¶
Snakemake with enhanced error handling¶
rule process: input: “data.txt” output: “result.txt” retries: 3 # Auto-configured shell: “python process.py {input} > {output}”
```bash
# DAGMan with enhanced error handling
# Auto-generated submit file includes:
# retry = 3
# on_exit_remove = (ExitCode != 0)
# max_retries = 3
File Transfer Optimization¶
Transfer Mode Selection¶
from wf2wf.core import ParameterSpec
# Input files - always transfer
input_file = ParameterSpec(
id="data/input.txt",
type="File",
transfer_mode="always"
)
# Reference data - shared storage
reference = ParameterSpec(
id="/shared/genomes/hg38.fa",
type="File",
transfer_mode="shared"
)
# Temporary files - never transfer
temp_file = ParameterSpec(
id="temp.log",
type="File",
transfer_mode="never"
)
Transfer Optimization Tips¶
Minimize transfers: Use
sharedmode for large reference filesBatch transfers: Group related files together
Compress data: Use compressed formats when possible
Local processing: Use
nevermode for temporary files
Validation and Testing¶
Post-Conversion Validation¶
# Validate the converted workflow
wf2wf validate workflow.dag
# Check for any unresolved losses
cat workflow.loss.json
# Test with a small dataset
condor_submit_dag workflow.dag
Testing Checklist¶
[ ] All tasks have appropriate resource specifications
[ ] Container/environment specifications are correct
[ ] File transfer modes are appropriate
[ ] Retry policies are configured
[ ] Dependencies are correctly specified
[ ] Output files are properly defined
Troubleshooting Common Issues¶
Resource Issues¶
Problem: Jobs fail due to insufficient memory Solution: Increase memory specifications or optimize memory usage
Problem: Jobs fail due to insufficient disk space Solution: Increase disk specifications or clean up temporary files
Container Issues¶
Problem: Container not found Solution: Ensure container image exists and is accessible
Problem: Container permissions issues Solution: Check container user and file permissions
File Transfer Issues¶
Problem: Files not found on compute nodes Solution: Check transfer modes and ensure files are in transfer lists
Problem: Unnecessary large file transfers
Solution: Mark reference data as shared mode
Advanced Configuration¶
Custom Resource Patterns¶
# Custom resource detection patterns
def custom_resource_detection(task):
if "alignment" in task.id:
return ResourceSpec(mem_mb=16384, cpu=8)
elif "qc" in task.id:
return ResourceSpec(mem_mb=2048, cpu=2)
else:
return ResourceSpec(mem_mb=4096, cpu=4)
Environment-Specific Configurations¶
# Different configurations for different environments
wf2wf convert -i workflow.smk -o workflow.dag \
--default-memory 8GB \
--default-disk 10GB \
--default-cpus 4
Performance Optimization¶
Resource Optimization¶
Profile your workflows: Measure actual resource usage
Right-size resources: Match specifications to actual needs
Use resource limits: Prevent runaway jobs
Monitor usage: Track resource utilization over time
Transfer Optimization¶
Use shared storage: Minimize data movement
Compress data: Reduce transfer sizes
Batch operations: Group related transfers
Cache frequently used data: Store on shared storage