# Workflow Conversion Best Practices This guide provides best practices for converting workflows between different execution environments using wf2wf. ## Understanding Execution Environment Differences ### Shared Filesystem vs Distributed Computing Workflow engines make different assumptions about their execution environment: | Aspect | Shared Filesystem | Distributed Computing | |--------|------------------|---------------------| | **File Access** | All files accessible via shared filesystem | Files must be explicitly transferred | | **Resources** | Minimal specifications, system defaults | Explicit CPU, memory, disk allocation | | **Environment** | System-wide software or conda | Container specifications required | | **Error Handling** | Basic retry mechanisms | Sophisticated retry policies | | **Parallelization** | Implicit (wildcards) | Explicit scatter/gather | ## Pre-Conversion Analysis ### 1. Understand Your Source Workflow Before converting, analyze your source workflow: ```bash # Get detailed information about your workflow wf2wf info workflow.smk # Check for potential issues wf2wf validate workflow.smk # Preview conversion with dry-run mode wf2wf convert -i workflow.smk -o workflow.dag --dry-run ``` ### 2. Identify Target Environment Requirements Understand what your target environment needs: - **HTCondor/DAGMan**: Explicit resources, containers, file transfers - **Nextflow**: Container specifications, resource limits - **CWL**: Resource requirements, software requirements - **Snakemake**: Conda environments, resource specifications ## Interactive Conversion Workflow ### Step 1: Initial Conversion with Analysis ```bash # Convert with interactive mode for guided assistance wf2wf convert -i workflow.smk -o workflow.dag --interactive --verbose # This enables: # - Interactive resource specification # - Guided environment configuration # - Error handling setup # - File transfer optimization ``` ### Step 2: Review Configuration Analysis The conversion report will show: ```markdown ## Configuration Analysis ### Resource Analysis * **Memory**: 2 tasks without explicit memory requirements * **CPU**: 1 task without CPU specifications * **Disk**: 3 tasks without disk requirements ### Environment Analysis * **Containers**: 3 tasks without container/conda specifications * **Software**: 2 tasks with system dependencies ### Error Handling Analysis * **Retry Policies**: 3 tasks without retry specifications * **Error Recovery**: 1 task without error handling ### File Transfer Analysis * **Transfer Modes**: 6 files with auto-detected transfer modes * **Dependencies**: 2 files with missing dependency specifications **Recommendations:** * Add explicit resource requirements for all tasks * Specify container images or conda environments for environment isolation * Configure retry policies for fault tolerance * Use `--infer-resources` to automatically detect resource requirements * Use `--resource-profile cluster` for standard cluster specifications * Enable `--auto-env` for automatic container generation * Review file transfer modes for distributed execution ``` ### Step 3: Address Issues with Interactive Assistance Use the interactive prompts to: - Add default resource specifications - Specify container environments - Configure retry policies - Review file transfer modes ```bash # Interactive resource specification Found 3 tasks without explicit resource requirements. Distributed systems require explicit resource allocation. Add default resource specifications? [Y/n]: Y Applied default resources: CPU=1, Memory=2048MB, Disk=4096MB # Interactive environment specification Found 2 tasks without container specifications. Distributed systems typically require explicit environment isolation. Add container specifications or conda environments? [Y/n]: Y Enable --auto-env to automatically build containers for these tasks. # Interactive error handling Found 3 tasks without retry specifications. Distributed systems benefit from explicit error handling. Add retry specifications for failed tasks? [Y/n]: Y Applied default retry settings (2 retries) ``` ## Best Practices by Conversion Type ### Snakemake → DAGMan **Before Conversion:** ```python # Add resource specifications to your Snakefile rule process_data: input: "data/{sample}.txt" output: "processed/{sample}.txt" resources: mem_mb=4096, disk_mb=4096 shell: "python process.py {input} > {output}" ``` **Enhanced Conversion:** ```bash # Convert with all enhanced features wf2wf convert -i workflow.smk -o workflow.dag \ --interactive \ --infer-resources \ --validate-resources \ --resource-profile cluster \ --target-env distributed \ --report-md ``` **After Conversion:** ```bash # Review generated DAGMan files cat workflow.dag cat process_data_*.sub # Check conversion report for any issues cat conversion_report.md ``` ### CWL → Nextflow **Before Conversion:** ```yaml # Ensure resource requirements are specified requirements: - class: ResourceRequirement coresMin: 4 ramMin: 8192 tmpdirMin: 4096 ``` **Enhanced Conversion:** ```bash # Convert with enhanced CWL processing wf2wf convert -i workflow.cwl -o main.nf \ --interactive \ --infer-resources \ --validate-resources \ --resource-profile cloud \ --target-env distributed ``` **After Conversion:** ```groovy // Review generated Nextflow process process PROCESS_DATA { cpus 4 memory 8.GB disk 4.GB input: path input_file output: path output_file script: """ python process.py ${input_file} > ${output_file} """ } ``` ### DAGMan → Snakemake **Before Conversion:** ```bash # Review submit file specifications cat job.sub # Look for: request_cpus, request_memory, transfer_input_files ``` **Enhanced Conversion:** ```bash # Convert with enhanced DAGMan processing wf2wf convert -i workflow.dag -o workflow.smk \ --interactive \ --infer-resources \ --validate-resources \ --resource-profile shared \ --target-env distributed ``` **After Conversion:** ```python # Review generated Snakefile # Ensure resource specifications are preserved rule job: input: "input.txt" output: "output.txt" resources: mem_mb=4096, disk_mb=4096 shell: "python job.py {input} > {output}" ``` ## Loss Detection and Reporting ### Comprehensive Loss Analysis ```bash # Generate detailed loss report wf2wf convert -i workflow.smk -o workflow.dag --report-md # The report includes: # - Information preservation analysis # - Potential loss identification # - Conversion recommendations # - Quality metrics ``` ### Loss Report Example ```markdown # Enhanced Conversion Report ## Information Loss Analysis ### Preserved Information (100%) - ✅ Task definitions and dependencies - ✅ Resource specifications (enhanced) - ✅ Container/environment specifications - ✅ Input/output file specifications - ✅ Error handling configurations ### Potential Loss (Minimal) - ⚠️ Snakemake wildcards → DAGMan parameter substitution - ⚠️ Snakemake conda environments → DAGMan container specifications ### Quality Metrics - **Compliance Score**: 95/100 - **Resource Coverage**: 100% - **Environment Coverage**: 100% - **Error Handling Coverage**: 100% ### Recommendations - Review wildcard substitutions for correctness - Verify container specifications match conda environments - All resource specifications are properly converted ``` ## Enhanced Troubleshooting ### Common Issues and Solutions 1. **Resource Validation Failures** ```bash # Adjust resource specifications wf2wf convert -i workflow.smk -o workflow.dag --resource-profile shared # Or use interactive mode to adjust manually wf2wf convert -i workflow.smk -o workflow.dag --interactive ``` 2. **Container Generation Issues** ```bash # Enable automatic container generation wf2wf convert -i workflow.smk -o workflow.dag --auto-env # Or specify containers manually wf2wf convert -i workflow.smk -o workflow.dag --interactive ``` 3. **File Transfer Problems** ```bash # Review file transfer modes wf2wf convert -i workflow.smk -o workflow.dag --interactive # Optimize for distributed computing wf2wf convert -i workflow.smk -o workflow.dag --target-env distributed ``` ### Getting Enhanced Help ```bash # Get detailed information about your workflow wf2wf info workflow.smk # Validate workflow with enhanced checks wf2wf validate workflow.smk # Preview conversion with all enhancements wf2wf convert -i workflow.smk -o workflow.dag --dry-run --verbose # Check for potential issues wf2wf convert -i workflow.smk -o workflow.dag --interactive ``` ## Resource Specification Guidelines ### Automatic Resource Inference The enhanced system can automatically infer resource requirements: ```bash # Enable automatic resource inference wf2wf convert -i workflow.smk -o workflow.dag --infer-resources # Inference examples: # - "bwa mem" → 8GB memory, 4 CPU # - "samtools sort" → 4GB memory, 2 CPU # - "python script.py" → 1GB memory, 1 CPU # - "Rscript analysis.R" → 2GB memory, 1 CPU ``` ### Resource Profiles Use predefined resource profiles for different environments: ```bash # Apply specific resource profile wf2wf convert -i workflow.smk -o workflow.dag --resource-profile cluster # Available profiles: # - shared: Light resources (1 CPU, 512MB RAM, 1GB disk) # - cluster: Standard cluster (1 CPU, 2GB RAM, 4GB disk) # - cloud: Cloud-optimized (2 CPU, 4GB RAM, 8GB disk) # - hpc: High-performance (4 CPU, 8GB RAM, 16GB disk) # - gpu: GPU-enabled (4 CPU, 16GB RAM, 32GB disk) ``` ### Memory Requirements | Task Type | Recommended Memory | Notes | |-----------|-------------------|-------| | Light processing | 1-2 GB | Text processing, simple scripts | | Medium analysis | 4-8 GB | Data analysis, moderate datasets | | Heavy computation | 16-32 GB | Machine learning, large datasets | | Genomics | 32-64 GB | Sequence alignment, variant calling | ### Disk Requirements | Task Type | Recommended Disk | Notes | |-----------|------------------|-------| | Text processing | 1-2 GB | Small input/output files | | Data analysis | 4-8 GB | Moderate datasets | | Genomics | 10-50 GB | Large sequence files | | Machine learning | 5-20 GB | Model files and datasets | ### CPU Requirements | Task Type | Recommended CPUs | Notes | |-----------|-----------------|-------| | Single-threaded | 1 | Simple scripts, basic processing | | Multi-threaded | 4-8 | Data analysis, moderate parallelism | | High-performance | 16-32 | Machine learning, genomics | ## Container and Environment Best Practices ### Automatic Container Generation ```bash # Enable automatic container generation wf2wf convert -i workflow.smk -o workflow.dag --auto-env # This will: # - Analyze software dependencies # - Generate appropriate container specifications # - Create conda environment files # - Optimize for target environment ``` ### Container Specifications ```python # Snakemake with container rule process: container: "docker://python:3.9-slim" shell: "python process.py" # CWL with Docker requirement requirements: - class: DockerRequirement dockerPull: python:3.9-slim ``` ### Conda Environments ```python # Snakemake with conda rule process: conda: "environment.yaml" shell: "python process.py" ``` ```yaml # environment.yaml name: analysis channels: - conda-forge - bioconda dependencies: - python=3.9 - pandas - numpy - biopython ``` ## Error Handling and Retry Policies ### Retry Specifications ```python # Snakemake with retries rule process: retries: 3 shell: "python process.py" ``` ```bash # DAGMan with retry RETRY process_job 3 ``` ### Interactive Error Handling Configuration ```bash # Enable automatic error handling setup wf2wf convert -i workflow.smk -o workflow.dag --interactive # The system will: # - Detect tasks without retry specifications # - Suggest appropriate retry policies # - Configure error recovery mechanisms # - Optimize for target environment ``` ### Error Strategies - **Transient failures**: 2-3 retries with exponential backoff - **Resource failures**: Retry with different resource specifications - **Data corruption**: Validate inputs before processing - **Network issues**: Retry with longer timeouts ### Error Handling Examples # Snakemake with enhanced error handling rule process: input: "data.txt" output: "result.txt" retries: 3 # Auto-configured shell: "python process.py {input} > {output}" ``` ```bash # DAGMan with enhanced error handling # Auto-generated submit file includes: # retry = 3 # on_exit_remove = (ExitCode != 0) # max_retries = 3 ``` ## File Transfer Optimization ### Transfer Mode Selection ```python from wf2wf.core import ParameterSpec # Input files - always transfer input_file = ParameterSpec( id="data/input.txt", type="File", transfer_mode="always" ) # Reference data - shared storage reference = ParameterSpec( id="/shared/genomes/hg38.fa", type="File", transfer_mode="shared" ) # Temporary files - never transfer temp_file = ParameterSpec( id="temp.log", type="File", transfer_mode="never" ) ``` ### Transfer Optimization Tips 1. **Minimize transfers**: Use `shared` mode for large reference files 2. **Batch transfers**: Group related files together 3. **Compress data**: Use compressed formats when possible 4. **Local processing**: Use `never` mode for temporary files ## Validation and Testing ### Post-Conversion Validation ```bash # Validate the converted workflow wf2wf validate workflow.dag # Check for any unresolved losses cat workflow.loss.json # Test with a small dataset condor_submit_dag workflow.dag ``` ### Testing Checklist - [ ] All tasks have appropriate resource specifications - [ ] Container/environment specifications are correct - [ ] File transfer modes are appropriate - [ ] Retry policies are configured - [ ] Dependencies are correctly specified - [ ] Output files are properly defined ## Troubleshooting Common Issues ### Resource Issues **Problem**: Jobs fail due to insufficient memory **Solution**: Increase memory specifications or optimize memory usage **Problem**: Jobs fail due to insufficient disk space **Solution**: Increase disk specifications or clean up temporary files ### Container Issues **Problem**: Container not found **Solution**: Ensure container image exists and is accessible **Problem**: Container permissions issues **Solution**: Check container user and file permissions ### File Transfer Issues **Problem**: Files not found on compute nodes **Solution**: Check transfer modes and ensure files are in transfer lists **Problem**: Unnecessary large file transfers **Solution**: Mark reference data as `shared` mode ## Advanced Configuration ### Custom Resource Patterns ```python # Custom resource detection patterns def custom_resource_detection(task): if "alignment" in task.id: return ResourceSpec(mem_mb=16384, cpu=8) elif "qc" in task.id: return ResourceSpec(mem_mb=2048, cpu=2) else: return ResourceSpec(mem_mb=4096, cpu=4) ``` ### Environment-Specific Configurations ```bash # Different configurations for different environments wf2wf convert -i workflow.smk -o workflow.dag \ --default-memory 8GB \ --default-disk 10GB \ --default-cpus 4 ``` ## Performance Optimization ### Resource Optimization 1. **Profile your workflows**: Measure actual resource usage 2. **Right-size resources**: Match specifications to actual needs 3. **Use resource limits**: Prevent runaway jobs 4. **Monitor usage**: Track resource utilization over time ### Transfer Optimization 1. **Use shared storage**: Minimize data movement 2. **Compress data**: Reduce transfer sizes 3. **Batch operations**: Group related transfers 4. **Cache frequently used data**: Store on shared storage ## Related Documentation - [File Transfer Handling](file_transfers.md) - Detailed file transfer guide - [Installation Guide](installation.md) - Setting up wf2wf - [Engine Overview](engines/overview.md) - Understanding workflow engines - [Troubleshooting](troubleshooting.md) - Common issues and solutions