Recently I started using the OPIG servers to run the algorithm I have developed (CRANkS) on datasets from DUDE (Database of Useful Decoys Enhanced).
This required learning how to run jobs in parallel. Previously I had been using computer clusters with their own queuing system (Torque/PBS) which allowed me to submit each molecule to be scored by the algorithm as a separate job. The queuing system would then automatically allocate nodes to jobs and execute jobs accordingly. On a side note I learnt how to submit these jobs an array, which was preferable to submitting ~ 150,000 separate jobs:
qsub -t 1:X array_submit.sh
where the contents of array_submit.sh would be:
#!/bin/bash
./$SGE_TASK_ID.sh
which would submit jobs 1.sh to X.sh, where X is the total number of jobs.
However the OPIG servers do not have a global queuing system to use. I needed a way of being able to run the code I already had in parallel with minimal changes to the workflow or code itself. There are many ways to run jobs in parallel, but to minimise work for myself, I decided to use GNU parallel [1].
This is an easy-to-use shell tool, which I found quick and easy to install onto my home server, allowing me to access it on each of the OPIG servers.
To use it I simply run the command:
cat submit.sh | parallel -j Y
where Y is the number of cores to run the jobs on, and submit.sh contains:
./1.sh
./2.sh
...
./X.sh
This executes each job making use of Y number of cores when available to run the jobs in parallel.
Quick, easy, simple and minimal modifications needed! Thanks to Jin for introducing me to GNU Parallel!
[1] O. Tange (2011): GNU Parallel – The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.