File Transfer Handling for Distributed Computing

Overview

One of the critical challenges in workflow conversion is handling the fundamental difference between shared filesystem workflows (like Snakemake, CWL) and distributed computing workflows (like HTCondor/DAGMan). This document explains how wf2wf addresses this challenge.

The Problem

Different workflow systems make different assumptions about file accessibility:

Shared Filesystem Workflows (Snakemake, CWL, Nextflow)

  • Assumption: All compute nodes share the same filesystem

  • File Handling: Files can be left in place between tasks

  • Intermediate Files: Can be accessed directly by path

  • Reference Data: Assumed to be accessible from all nodes

Distributed Computing Workflows (HTCondor/DAGMan)

  • Assumption: Compute nodes may not share filesystems

  • File Handling: Files must be explicitly transferred to/from compute nodes

  • Intermediate Files: Must be transferred between dependent tasks

  • Reference Data: May need to be on shared storage or transferred

wf2wf’s Solution: Transfer Modes

wf2wf introduces transfer modes in the intermediate representation to capture file transfer requirements:

Transfer Modes

Mode

Description

Use Case

auto

Automatically determine transfer need (default)

Most regular files

always

Always transfer, regardless of environment

Critical files that must be local

never

Never transfer, local only

Temporary/log files

shared

On shared storage, accessible to all nodes

Reference genomes, large databases

Conversion Behavior

Snakemake → DAGMan

When converting from Snakemake to DAGMan, wf2wf:

  1. Analyzes file paths to detect likely shared storage locations

  2. Applies heuristics to determine appropriate transfer modes:

    • /shared/, /nfs/, /data/shared

    • .tmp, /tmp/, .lognever

    • Reference file extensions (.fa, .bam, .gtf) → shared

    • Regular files → auto (will be transferred)

DAGMan → Snakemake

When converting from DAGMan to Snakemake:

  1. Preserves transfer specifications from HTCondor submit files

  2. Maps transfer directives to appropriate transfer modes

  3. Removes transfer specifications in Snakemake output (not needed)

Examples

Basic Usage

from wf2wf.core import ParameterSpec

# Regular file - will be transferred (auto mode)
input_file = "data.txt"

# Reference on shared storage - no transfer needed
reference = ParameterSpec(
    id="/shared/genomes/hg38.fa",
    type="File",
    transfer_mode="shared"
)

# Critical config file - always transfer
config = ParameterSpec(
    id="analysis.conf",
    type="File", 
    transfer_mode="always"
)

# Temporary file - local only
temp = ParameterSpec(
    id="temp.log",
    type="File",
    transfer_mode="never"
)

DAGMan Output

When exported to DAGMan, only auto and always files appear in transfer lists:

# In generated .sub file:
transfer_input_files = data.txt,analysis.conf
# /shared/genomes/hg38.fa and temp.log are excluded

Automatic Detection

Snakemake Import

wf2wf automatically detects transfer modes based on file path patterns:

Shared Storage Patterns:

  • /nfs/, /shared/, /data/, /storage/

  • /lustre/, /gpfs/, /beegfs/

  • Cloud URLs: gs://, s3://, https://

Local/Temporary Patterns:

  • /tmp/, .tmp, temp_

  • .log, .err, .out

  • /dev/, /proc/, /sys/

Reference Data Patterns:

  • .fa, .fasta, .genome, .gtf, .gff

  • .bam, .sam, .bed

  • Directories: reference/, genome/, annotation/

Example Automatic Detection

# Input: /shared/data/genome.fa
# → Detected as transfer_mode="shared"

# Input: temp_analysis.tmp  
# → Detected as transfer_mode="never"

# Input: sample_data.txt
# → Detected as transfer_mode="auto"

File Transfer Best Practices

For Workflow Authors

  1. Be Explicit: Use ParameterSpec with explicit transfer_mode for clarity

  2. Consider Environment: Think about where files will be stored

  3. Reference Data: Mark large reference files as shared

  4. Temporary Files: Mark temp/log files as never

For System Administrators

  1. Shared Storage: Ensure shared paths are consistently mounted

  2. Reference Data: Place large datasets on shared storage

  3. Transfer Optimization: Use appropriate transfer modes to minimize data movement

Troubleshooting

Common Issues

Problem: Files not found on compute nodes

  • Solution: Check transfer modes, ensure auto or always for required files

Problem: Unnecessary large file transfers

  • Solution: Mark reference data as shared mode

Problem: Log files filling transfer directories

  • Solution: Mark log files as never mode

Debugging Transfer Specifications

Use verbose output to see transfer decisions:

wf2wf convert -i workflow.smk --out-format dagman --verbose

Examine generated .sub files to verify transfer specifications:

grep "transfer_.*_files" *.sub

Advanced Configuration

Custom Patterns

You can extend the automatic detection by modifying the patterns in:

  • wf2wf/importers/snakemake.py_detect_transfer_mode()

Engine-Specific Handling

Different engines may interpret transfer modes differently:

  • DAGMan: Generates transfer_input_files/transfer_output_files

  • Snakemake: Ignores transfer modes (assumes shared filesystem)

  • CWL: Could map to staging directives (future enhancement)


Workflow Conversion: Shared Filesystem vs Distributed Computing

This section demonstrates the key differences between shared filesystem workflows (like Snakemake) and distributed computing workflows (like HTCondor DAGMan), and how wf2wf handles these conversions.

Key Differences

1. File Transfer Assumptions

Shared Filesystem (Snakemake)

  • Assumes all files are accessible on a shared filesystem

  • No explicit file transfer needed

  • Files referenced by relative paths

Distributed Computing (HTCondor)

  • Jobs run on different machines

  • Files must be explicitly transferred to/from execution nodes

  • Requires transfer_input_files and transfer_output_files directives

2. Resource Requirements

Shared Filesystem

  • Often minimal or implicit resource specifications

  • Relies on system defaults or queue limits

Distributed Computing

  • Explicit resource allocation required

  • request_cpus, request_memory, request_disk must be specified

  • GPU resources need explicit allocation

3. Environment Isolation

Shared Filesystem

  • Often uses system-wide installations or conda environments

  • Environment setup handled outside workflow

Distributed Computing

  • Requires explicit container specifications

  • Environment must be portable across execution nodes

  • Docker/Singularity containers preferred

4. Error Handling

Shared Filesystem

  • Basic retry mechanisms

  • Often relies on external monitoring

Distributed Computing

  • Sophisticated retry policies

  • Job-level error handling and recovery

  • Priority and preemption support

5. Scatter/Gather Patterns

Shared Filesystem

  • Implicit parallelization through wildcards

  • Dynamic job generation

Distributed Computing

  • Explicit scatter specifications needed

  • Static job definitions

Example: Snakemake to DAGMan Conversion

Input: Snakemake Workflow

# Snakefile
rule all:
    input: "results/final_report.txt"

rule process_data:
    input: "data/{sample}.txt"
    output: "processed/{sample}.txt"
    shell: "python process.py {input} > {output}"

rule analyze:
    input: "processed/{sample}.txt"
    output: "results/{sample}_analysis.txt"
    shell: "python analyze.py {input} > {output}"

rule report:
    input: expand("results/{sample}_analysis.txt", sample=["A", "B", "C"])
    output: "results/final_report.txt"
    shell: "python report.py {input} > {output}"

Output: DAGMan Workflow

# HTCondor DAGMan file
JOB process_data_A process_data_A.sub
JOB process_data_B process_data_B.sub
JOB process_data_C process_data_C.sub
JOB analyze_A analyze_A.sub
JOB analyze_B analyze_B.sub
JOB analyze_C analyze_C.sub
JOB report report.sub

PARENT process_data_A CHILD analyze_A
PARENT process_data_B CHILD analyze_B
PARENT process_data_C CHILD analyze_C
PARENT analyze_A analyze_B analyze_C CHILD report
# process_data_A.sub
executable = scripts/process_data_A.sh
request_cpus = 1
request_memory = 4096MB
request_disk = 4096MB
transfer_input_files = data/A.txt
transfer_output_files = processed/A.txt
universe = vanilla
queue

Configuration Issues and Solutions

1. Missing Resource Requirements

Problem: Snakemake workflow has no explicit resource specifications.

Solution: wf2wf prompts user to add default resources:

$ wf2wf convert -i Snakefile -o workflow.dag --interactive

Found 3 tasks without explicit resource requirements. 
Distributed systems require explicit resource allocation. 
Add default resource specifications? (y)es/(n)o/(a)lways/(q)uit: y

2. Missing Container Specifications

Problem: No environment isolation specified.

Solution: wf2wf prompts for container specifications:

Found 3 tasks without container or conda specifications. 
Distributed systems typically require explicit environment isolation. 
Add container specifications or conda environments? (y)es/(n)o/(a)lways/(q)uit: y

3. Missing Error Handling

Problem: No retry specifications for fault tolerance.

Solution: wf2wf adds default retry policies:

Found 3 tasks without retry specifications. 
Distributed systems benefit from explicit error handling. 
Add retry specifications for failed tasks? (y)es/(n)o/(a)lways/(q)uit: y

4. File Transfer Modes

Problem: Files need explicit transfer specifications.

Solution: wf2wf automatically detects and sets transfer modes:

# Auto-detected transfer modes
ParameterSpec(id="data/A.txt", type="File", transfer_mode="always")  # Input file
ParameterSpec(id="processed/A.txt", type="File", transfer_mode="always")  # Output file
ParameterSpec(id="/shared/reference.fa", type="File", transfer_mode="shared")  # Shared reference

Interactive Mode Features

Automatic Configuration Detection

When using --interactive, wf2wf automatically detects:

  1. Resource Gaps: Tasks without memory/disk specifications

  2. Environment Issues: Tasks without container/conda specifications

  3. Error Handling: Tasks without retry policies

  4. File Transfer: Files with auto-detected transfer modes

Smart Defaults

wf2wf applies intelligent defaults:

  • Memory: 4GB default for compute tasks

  • Disk: 4GB default for data processing tasks

  • Retry: 2 retries for fault tolerance

  • Transfer Mode: Auto-detected based on file paths

Configuration Validation

The conversion report includes a “Configuration Analysis” section:

## Configuration Analysis

### Potential Issues for Distributed Computing

* **Memory**: 2 tasks without explicit memory requirements
* **Containers**: 3 tasks without container/conda specifications
* **Error Handling**: 3 tasks without retry specifications
* **File Transfer**: 6 files with auto-detected transfer modes

**Recommendations:**
* Add explicit resource requirements for all tasks
* Specify container images or conda environments for environment isolation
* Configure retry policies for fault tolerance
* Review file transfer modes for distributed execution

Best Practices

For Shared Filesystem Workflows

  1. Add Resource Specifications: Even if not required, specify memory/disk needs

  2. Use Containers: Specify conda environments or container images

  3. Add Retry Logic: Include retry specifications for robustness

  4. Document Dependencies: Make file dependencies explicit

For Distributed Computing Workflows

  1. Explicit Resources: Always specify CPU, memory, and disk requirements

  2. Container Isolation: Use Docker or Singularity containers

  3. Error Handling: Configure retry policies and error strategies

  4. File Transfer: Review and optimize file transfer patterns

  5. Monitoring: Set up proper logging and monitoring

Conversion Workflow

  1. Analyze Source: Understand the source workflow’s assumptions

  2. Interactive Review: Use --interactive to review configuration gaps

  3. Apply Defaults: Let wf2wf apply intelligent defaults

  4. Customize: Adjust configurations based on your infrastructure

  5. Validate: Test the converted workflow thoroughly

Example Commands

# Basic conversion with warnings
wf2wf convert -i Snakefile -o workflow.dag

# Interactive conversion with configuration prompts
wf2wf convert -i Snakefile -o workflow.dag --interactive

# Automatic environment handling
wf2wf convert -i Snakefile -o workflow.dag --auto-env build

# Generate detailed report
wf2wf convert -i Snakefile -o workflow.dag --report-md conversion_report.md

# Validate the conversion
wf2wf validate workflow.dag

This comprehensive approach ensures that workflows converted between different execution environments maintain their functionality while adapting to the target system’s requirements.