Use case for standalone scripts
People needing to perform HPC actions on POD5 files on shared resource servers, e.g., on a Slurm work node on a university cluster, will benefit from scripts that optimize file IO. Poor attention to these critical issues can make programs MUCH slower than they might otherwise be.
If you are struggling with slow POD5 processing, these scripts might help you. Our goal is to keep your cluster CPUs and/or GPUs working as close to 100% as possible.
Problems and solutions
The common approach used by the standalone scripts is to support batched analysis of a portion of your total POD5 files at a time, which solves many problems.
Problem | Solution / Strategy |
---|---|
File IO to/from nodes is slow | Batched analysis and parallel processing allow file transfers to/from nodes and ONT commands to run largely concurrently |
POD5 file actions are IO limited, HDD drives are very slow | Batched analysis allows use of shared memory drives, i.e., /dev/shm (or small SSD drives) for maximial processing speed |
ONT data sets are very large | Batched analysis makes it unnecessary for all of your run data files to reside on your fastest drives at the same time |
Basecalling sometime crashes | Batched writing of bam files allows jobs to be restarted from where they failed |