# File Transfer Handling for Distributed Computing ## Overview One of the critical challenges in workflow conversion is handling the fundamental difference between **shared filesystem workflows** (like Snakemake, CWL) and **distributed computing workflows** (like HTCondor/DAGMan). This document explains how wf2wf addresses this challenge. ## The Problem Different workflow systems make different assumptions about file accessibility: ### Shared Filesystem Workflows (Snakemake, CWL, Nextflow) - **Assumption**: All compute nodes share the same filesystem - **File Handling**: Files can be left in place between tasks - **Intermediate Files**: Can be accessed directly by path - **Reference Data**: Assumed to be accessible from all nodes ### Distributed Computing Workflows (HTCondor/DAGMan) - **Assumption**: Compute nodes may not share filesystems - **File Handling**: Files must be explicitly transferred to/from compute nodes - **Intermediate Files**: Must be transferred between dependent tasks - **Reference Data**: May need to be on shared storage or transferred ## wf2wf's Solution: Transfer Modes wf2wf introduces **transfer modes** in the intermediate representation to capture file transfer requirements: ### Transfer Modes | Mode | Description | Use Case | |------|-------------|----------| | `auto` | Automatically determine transfer need (default) | Most regular files | | `always` | Always transfer, regardless of environment | Critical files that must be local | | `never` | Never transfer, local only | Temporary/log files | | `shared` | On shared storage, accessible to all nodes | Reference genomes, large databases | ## Conversion Behavior ### Snakemake → DAGMan When converting from Snakemake to DAGMan, wf2wf: 1. **Analyzes file paths** to detect likely shared storage locations 2. **Applies heuristics** to determine appropriate transfer modes: - `/shared/`, `/nfs/`, `/data/` → `shared` - `.tmp`, `/tmp/`, `.log` → `never` - Reference file extensions (`.fa`, `.bam`, `.gtf`) → `shared` - Regular files → `auto` (will be transferred) ### DAGMan → Snakemake When converting from DAGMan to Snakemake: 1. **Preserves transfer specifications** from HTCondor submit files 2. **Maps transfer directives** to appropriate transfer modes 3. **Removes transfer specifications** in Snakemake output (not needed) ## Examples ### Basic Usage ```python from wf2wf.core import ParameterSpec # Regular file - will be transferred (auto mode) input_file = "data.txt" # Reference on shared storage - no transfer needed reference = ParameterSpec( id="/shared/genomes/hg38.fa", type="File", transfer_mode="shared" ) # Critical config file - always transfer config = ParameterSpec( id="analysis.conf", type="File", transfer_mode="always" ) # Temporary file - local only temp = ParameterSpec( id="temp.log", type="File", transfer_mode="never" ) ``` ### DAGMan Output When exported to DAGMan, only `auto` and `always` files appear in transfer lists: ```bash # In generated .sub file: transfer_input_files = data.txt,analysis.conf # /shared/genomes/hg38.fa and temp.log are excluded ``` ## Automatic Detection ### Snakemake Import wf2wf automatically detects transfer modes based on file path patterns: **Shared Storage Patterns:** - `/nfs/`, `/shared/`, `/data/`, `/storage/` - `/lustre/`, `/gpfs/`, `/beegfs/` - Cloud URLs: `gs://`, `s3://`, `https://` **Local/Temporary Patterns:** - `/tmp/`, `.tmp`, `temp_` - `.log`, `.err`, `.out` - `/dev/`, `/proc/`, `/sys/` **Reference Data Patterns:** - `.fa`, `.fasta`, `.genome`, `.gtf`, `.gff` - `.bam`, `.sam`, `.bed` - Directories: `reference/`, `genome/`, `annotation/` ### Example Automatic Detection ```python # Input: /shared/data/genome.fa # → Detected as transfer_mode="shared" # Input: temp_analysis.tmp # → Detected as transfer_mode="never" # Input: sample_data.txt # → Detected as transfer_mode="auto" ``` ## File Transfer Best Practices ### For Workflow Authors 1. **Be Explicit**: Use `ParameterSpec` with explicit `transfer_mode` for clarity 2. **Consider Environment**: Think about where files will be stored 3. **Reference Data**: Mark large reference files as `shared` 4. **Temporary Files**: Mark temp/log files as `never` ### For System Administrators 1. **Shared Storage**: Ensure shared paths are consistently mounted 2. **Reference Data**: Place large datasets on shared storage 3. **Transfer Optimization**: Use appropriate transfer modes to minimize data movement ## Troubleshooting ### Common Issues **Problem**: Files not found on compute nodes - **Solution**: Check transfer modes, ensure `auto` or `always` for required files **Problem**: Unnecessary large file transfers - **Solution**: Mark reference data as `shared` mode **Problem**: Log files filling transfer directories - **Solution**: Mark log files as `never` mode ### Debugging Transfer Specifications Use verbose output to see transfer decisions: ```bash wf2wf convert -i workflow.smk --out-format dagman --verbose ``` Examine generated `.sub` files to verify transfer specifications: ```bash grep "transfer_.*_files" *.sub ``` ## Advanced Configuration ### Custom Patterns You can extend the automatic detection by modifying the patterns in: - `wf2wf/importers/snakemake.py` → `_detect_transfer_mode()` ### Engine-Specific Handling Different engines may interpret transfer modes differently: - **DAGMan**: Generates `transfer_input_files`/`transfer_output_files` - **Snakemake**: Ignores transfer modes (assumes shared filesystem) - **CWL**: Could map to staging directives (future enhancement) ## Related Documentation - [Installation Guide](installation.md) - Setting up shared storage - [Engine Overview](engines/overview.md) - Understanding different workflow engines - [Troubleshooting](troubleshooting.md) - Common conversion issues --- ## Workflow Conversion: Shared Filesystem vs Distributed Computing This section demonstrates the key differences between shared filesystem workflows (like Snakemake) and distributed computing workflows (like HTCondor DAGMan), and how wf2wf handles these conversions. ### Key Differences #### 1. File Transfer Assumptions **Shared Filesystem (Snakemake)** - Assumes all files are accessible on a shared filesystem - No explicit file transfer needed - Files referenced by relative paths **Distributed Computing (HTCondor)** - Jobs run on different machines - Files must be explicitly transferred to/from execution nodes - Requires `transfer_input_files` and `transfer_output_files` directives #### 2. Resource Requirements **Shared Filesystem** - Often minimal or implicit resource specifications - Relies on system defaults or queue limits **Distributed Computing** - Explicit resource allocation required - `request_cpus`, `request_memory`, `request_disk` must be specified - GPU resources need explicit allocation #### 3. Environment Isolation **Shared Filesystem** - Often uses system-wide installations or conda environments - Environment setup handled outside workflow **Distributed Computing** - Requires explicit container specifications - Environment must be portable across execution nodes - Docker/Singularity containers preferred #### 4. Error Handling **Shared Filesystem** - Basic retry mechanisms - Often relies on external monitoring **Distributed Computing** - Sophisticated retry policies - Job-level error handling and recovery - Priority and preemption support #### 5. Scatter/Gather Patterns **Shared Filesystem** - Implicit parallelization through wildcards - Dynamic job generation **Distributed Computing** - Explicit scatter specifications needed - Static job definitions ### Example: Snakemake to DAGMan Conversion #### Input: Snakemake Workflow ```python # Snakefile rule all: input: "results/final_report.txt" rule process_data: input: "data/{sample}.txt" output: "processed/{sample}.txt" shell: "python process.py {input} > {output}" rule analyze: input: "processed/{sample}.txt" output: "results/{sample}_analysis.txt" shell: "python analyze.py {input} > {output}" rule report: input: expand("results/{sample}_analysis.txt", sample=["A", "B", "C"]) output: "results/final_report.txt" shell: "python report.py {input} > {output}" ``` #### Output: DAGMan Workflow ```text # HTCondor DAGMan file JOB process_data_A process_data_A.sub JOB process_data_B process_data_B.sub JOB process_data_C process_data_C.sub JOB analyze_A analyze_A.sub JOB analyze_B analyze_B.sub JOB analyze_C analyze_C.sub JOB report report.sub PARENT process_data_A CHILD analyze_A PARENT process_data_B CHILD analyze_B PARENT process_data_C CHILD analyze_C PARENT analyze_A analyze_B analyze_C CHILD report ``` ```bash # process_data_A.sub executable = scripts/process_data_A.sh request_cpus = 1 request_memory = 4096MB request_disk = 4096MB transfer_input_files = data/A.txt transfer_output_files = processed/A.txt universe = vanilla queue ``` ### Configuration Issues and Solutions #### 1. Missing Resource Requirements **Problem**: Snakemake workflow has no explicit resource specifications. **Solution**: wf2wf prompts user to add default resources: ```bash $ wf2wf convert -i Snakefile -o workflow.dag --interactive Found 3 tasks without explicit resource requirements. Distributed systems require explicit resource allocation. Add default resource specifications? (y)es/(n)o/(a)lways/(q)uit: y ``` #### 2. Missing Container Specifications **Problem**: No environment isolation specified. **Solution**: wf2wf prompts for container specifications: ```bash Found 3 tasks without container or conda specifications. Distributed systems typically require explicit environment isolation. Add container specifications or conda environments? (y)es/(n)o/(a)lways/(q)uit: y ``` #### 3. Missing Error Handling **Problem**: No retry specifications for fault tolerance. **Solution**: wf2wf adds default retry policies: ```bash Found 3 tasks without retry specifications. Distributed systems benefit from explicit error handling. Add retry specifications for failed tasks? (y)es/(n)o/(a)lways/(q)uit: y ``` #### 4. File Transfer Modes **Problem**: Files need explicit transfer specifications. **Solution**: wf2wf automatically detects and sets transfer modes: ```python # Auto-detected transfer modes ParameterSpec(id="data/A.txt", type="File", transfer_mode="always") # Input file ParameterSpec(id="processed/A.txt", type="File", transfer_mode="always") # Output file ParameterSpec(id="/shared/reference.fa", type="File", transfer_mode="shared") # Shared reference ``` ### Interactive Mode Features #### Automatic Configuration Detection When using `--interactive`, wf2wf automatically detects: 1. **Resource Gaps**: Tasks without memory/disk specifications 2. **Environment Issues**: Tasks without container/conda specifications 3. **Error Handling**: Tasks without retry policies 4. **File Transfer**: Files with auto-detected transfer modes #### Smart Defaults wf2wf applies intelligent defaults: - **Memory**: 4GB default for compute tasks - **Disk**: 4GB default for data processing tasks - **Retry**: 2 retries for fault tolerance - **Transfer Mode**: Auto-detected based on file paths #### Configuration Validation The conversion report includes a "Configuration Analysis" section: ```markdown ## Configuration Analysis ### Potential Issues for Distributed Computing * **Memory**: 2 tasks without explicit memory requirements * **Containers**: 3 tasks without container/conda specifications * **Error Handling**: 3 tasks without retry specifications * **File Transfer**: 6 files with auto-detected transfer modes **Recommendations:** * Add explicit resource requirements for all tasks * Specify container images or conda environments for environment isolation * Configure retry policies for fault tolerance * Review file transfer modes for distributed execution ``` ### Best Practices #### For Shared Filesystem Workflows 1. **Add Resource Specifications**: Even if not required, specify memory/disk needs 2. **Use Containers**: Specify conda environments or container images 3. **Add Retry Logic**: Include retry specifications for robustness 4. **Document Dependencies**: Make file dependencies explicit #### For Distributed Computing Workflows 1. **Explicit Resources**: Always specify CPU, memory, and disk requirements 2. **Container Isolation**: Use Docker or Singularity containers 3. **Error Handling**: Configure retry policies and error strategies 4. **File Transfer**: Review and optimize file transfer patterns 5. **Monitoring**: Set up proper logging and monitoring #### Conversion Workflow 1. **Analyze Source**: Understand the source workflow's assumptions 2. **Interactive Review**: Use `--interactive` to review configuration gaps 3. **Apply Defaults**: Let wf2wf apply intelligent defaults 4. **Customize**: Adjust configurations based on your infrastructure 5. **Validate**: Test the converted workflow thoroughly ### Example Commands ```bash # Basic conversion with warnings wf2wf convert -i Snakefile -o workflow.dag # Interactive conversion with configuration prompts wf2wf convert -i Snakefile -o workflow.dag --interactive # Automatic environment handling wf2wf convert -i Snakefile -o workflow.dag --auto-env build # Generate detailed report wf2wf convert -i Snakefile -o workflow.dag --report-md conversion_report.md # Validate the conversion wf2wf validate workflow.dag ``` This comprehensive approach ensures that workflows converted between different execution environments maintain their functionality while adapting to the target system's requirements.