When developing your pipeline for processing, annotating and/or analyzing data, you will probably find yourself needing to continuously re-run it, as you play around with your code. This can become a problem when working with long pipelines, large datasets and cpu’s begging you not to run some pieces of code again.
Luckily, you are not the first one to have been annoyed by this and other related struggles. Some people were actually so annoyed that they created Snakemake. Snakemake can be used to create workflows and help solve problems, such as the one mentioned above. This is done using a Snakefile, which helps you split your pipeline into “rules”. To illustrate how this helps you create a better workflow, we will be looking at the example below.
Installing Snakemake
First we need to install Snakemake. Snakemake can be installed using either pip
or conda. However, conda
is recommended, as the full version of Snakemake includes non-Python dependencies not included with pip
, meaning the pip
version has limited functionality.
conda install -c bioconda snakemake
Creating a Snakefile
Next step is to create the Snakefile. In this file you need to define the rules of your workflow. A rule consists of a step in the workflow, and the input files needed and output files created for this step. Below is a simple example of a Snakefile with two rules. The Snakefile also works with wildcards, meaning {csvdata}
will be adapted according to the name of the file you are working with.
rule add: input: "{csvdata}.csv" output: "{csvdata}_added.csv" shell: "python ./tools.py {input} {output} add" rule multiply: input: "{csvdata}_added.csv" output: "{csvdata}_multiplied.csv" shell: "python ./tools.py {input} {output} multiply"
Running Snakemake
Instead of the normal approach of you giving an input and receiving an output, with Snakemake, you tell it the output file you want to create. Snakemake will then execute the rule which returns the output file.
Using the above Snakefile, if we told Snakemake to create data_multiplied.csv
, it would match data_multiplied.csv
with {csvdata}_multiplied.csv
({csvdata}
will now be replaced by ‘data
‘). However, to create data_multiplied.csv
the file data_added.csv
is needed. If this file does not exist, Snakemake will look for another rule which returns the needed file and execute that one first. In our example, the ‘add
‘ rule returns data_added.csv
when given data.csv
, which in our case is the data file we start with. Snakemake will therefore first execute the ‘add
‘ rule to create data_added.csv
and then the ‘multiplied
‘ rule to create data_multiplied.csv
. If we already had the data_added.csv
file, Snakemake would only run the ‘multiplied
‘ rule, saving us some computation.
To get a better feeling of it, let us try and run our example. In order to do so, we first need to create the tools.py
file.
import sys import pandas as pd def add(x): return x+2 def multiply(x): return x*2 if __name__ == '__main__': inputFilename = sys.argv[1] outputFilename = sys.argv[2] method = sys.argv[3] a_input = pd.read_csv(inputFilename) if method == 'add': a_input.apply(add).to_csv(outputFilename, index=False) elif method == 'multiply': a_input.apply(multiply).to_csv(outputFilename, index=False)
And our data.csv
file.
X 1 2 3 4 5 6 7 8 9 10
To run Snakemake you need to be in a folder with the Snakefile, tools.py
and data.csv
files and then run the below command.
snakemake data_multiplied.csv
This will first generate the data_added.csv file and then the data_multiplied.csv file. If you run the command again, it will tell you that there is nothing new to run.
Snakemake also includes many useful options, such as ‘-n
‘ which shows each step needed for creating the output file without running anything, ‘-F
‘ which force runs all steps and ‘-j
‘ which allows you to run multiple rules in parallel. This is barely the surface of what Snakemake has to offer, but I hope this short blog has illustrated the usefulness of this tool.