File Transfer Handling for Distributed Computing¶
Overview¶
One of the critical challenges in workflow conversion is handling the fundamental difference between shared filesystem workflows (like Snakemake, CWL) and distributed computing workflows (like HTCondor/DAGMan). This document explains how wf2wf addresses this challenge.
The Problem¶
Different workflow systems make different assumptions about file accessibility:
Distributed Computing Workflows (HTCondor/DAGMan)¶
Assumption: Compute nodes may not share filesystems
File Handling: Files must be explicitly transferred to/from compute nodes
Intermediate Files: Must be transferred between dependent tasks
Reference Data: May need to be on shared storage or transferred
wf2wf’s Solution: Transfer Modes¶
wf2wf introduces transfer modes in the intermediate representation to capture file transfer requirements:
Transfer Modes¶
Mode |
Description |
Use Case |
|---|---|---|
|
Automatically determine transfer need (default) |
Most regular files |
|
Always transfer, regardless of environment |
Critical files that must be local |
|
Never transfer, local only |
Temporary/log files |
|
On shared storage, accessible to all nodes |
Reference genomes, large databases |
Conversion Behavior¶
Snakemake → DAGMan¶
When converting from Snakemake to DAGMan, wf2wf:
Analyzes file paths to detect likely shared storage locations
Applies heuristics to determine appropriate transfer modes:
/shared/,/nfs/,/data/→shared.tmp,/tmp/,.log→neverReference file extensions (
.fa,.bam,.gtf) →sharedRegular files →
auto(will be transferred)
DAGMan → Snakemake¶
When converting from DAGMan to Snakemake:
Preserves transfer specifications from HTCondor submit files
Maps transfer directives to appropriate transfer modes
Removes transfer specifications in Snakemake output (not needed)
Examples¶
Basic Usage¶
from wf2wf.core import ParameterSpec
# Regular file - will be transferred (auto mode)
input_file = "data.txt"
# Reference on shared storage - no transfer needed
reference = ParameterSpec(
id="/shared/genomes/hg38.fa",
type="File",
transfer_mode="shared"
)
# Critical config file - always transfer
config = ParameterSpec(
id="analysis.conf",
type="File",
transfer_mode="always"
)
# Temporary file - local only
temp = ParameterSpec(
id="temp.log",
type="File",
transfer_mode="never"
)
DAGMan Output¶
When exported to DAGMan, only auto and always files appear in transfer lists:
# In generated .sub file:
transfer_input_files = data.txt,analysis.conf
# /shared/genomes/hg38.fa and temp.log are excluded
Automatic Detection¶
Snakemake Import¶
wf2wf automatically detects transfer modes based on file path patterns:
Shared Storage Patterns:
/nfs/,/shared/,/data/,/storage//lustre/,/gpfs/,/beegfs/Cloud URLs:
gs://,s3://,https://
Local/Temporary Patterns:
/tmp/,.tmp,temp_.log,.err,.out/dev/,/proc/,/sys/
Reference Data Patterns:
.fa,.fasta,.genome,.gtf,.gff.bam,.sam,.bedDirectories:
reference/,genome/,annotation/
Example Automatic Detection¶
# Input: /shared/data/genome.fa
# → Detected as transfer_mode="shared"
# Input: temp_analysis.tmp
# → Detected as transfer_mode="never"
# Input: sample_data.txt
# → Detected as transfer_mode="auto"
File Transfer Best Practices¶
For System Administrators¶
Shared Storage: Ensure shared paths are consistently mounted
Reference Data: Place large datasets on shared storage
Transfer Optimization: Use appropriate transfer modes to minimize data movement
Troubleshooting¶
Common Issues¶
Problem: Files not found on compute nodes
Solution: Check transfer modes, ensure
autooralwaysfor required files
Problem: Unnecessary large file transfers
Solution: Mark reference data as
sharedmode
Problem: Log files filling transfer directories
Solution: Mark log files as
nevermode
Debugging Transfer Specifications¶
Use verbose output to see transfer decisions:
wf2wf convert -i workflow.smk --out-format dagman --verbose
Examine generated .sub files to verify transfer specifications:
grep "transfer_.*_files" *.sub
Advanced Configuration¶
Custom Patterns¶
You can extend the automatic detection by modifying the patterns in:
wf2wf/importers/snakemake.py→_detect_transfer_mode()
Engine-Specific Handling¶
Different engines may interpret transfer modes differently:
DAGMan: Generates
transfer_input_files/transfer_output_filesSnakemake: Ignores transfer modes (assumes shared filesystem)
CWL: Could map to staging directives (future enhancement)