snakeMAKE better workflows with your code | Oxford Protein Informatics Group

When developing your pipeline for processing, annotating and/or analyzing data, you will probably find yourself needing to continuously re-run it, as you play around with your code. This can become a problem when working with long pipelines, large datasets and cpu’s begging you not to run some pieces of code again.

Luckily, you are not the first one to have been annoyed by this and other related struggles. Some people were actually so annoyed that they created Snakemake. Snakemake can be used to create workflows and help solve problems, such as the one mentioned above. This is done using a Snakefile, which helps you split your pipeline into “rules”. To illustrate how this helps you create a better workflow, we will be looking at the example below.

Installing Snakemake

First we need to install Snakemake. Snakemake can be installed using either pip or conda. However, conda is recommended, as the full version of Snakemake includes non-Python dependencies not included with pip, meaning the pip version has limited functionality.

conda install -c bioconda snakemake

Creating a Snakefile

Next step is to create the Snakefile. In this file you need to define the rules of your workflow. A rule consists of a step in the workflow, and the input files needed and output files created for this step. Below is a simple example of a Snakefile with two rules. The Snakefile also works with wildcards, meaning {csvdata} will be adapted according to the name of the file you are working with.

rule add:
    input:
        "{csvdata}.csv"
    output:
        "{csvdata}_added.csv"
    shell:
        "python ./tools.py {input} {output} add"

rule multiply:
    input:
        "{csvdata}_added.csv"
    output:
        "{csvdata}_multiplied.csv"
    shell:
        "python ./tools.py {input} {output} multiply"

Running Snakemake

Instead of the normal approach of you giving an input and receiving an output, with Snakemake, you tell it the output file you want to create. Snakemake will then execute the rule which returns the output file.

Using the above Snakefile, if we told Snakemake to create data_multiplied.csv, it would match data_multiplied.csv with {csvdata}_multiplied.csv ({csvdata} will now be replaced by ‘data‘). However, to create data_multiplied.csv the file data_added.csv is needed. If this file does not exist, Snakemake will look for another rule which returns the needed file and execute that one first. In our example, the ‘add‘ rule returns data_added.csv when given data.csv, which in our case is the data file we start with. Snakemake will therefore first execute the ‘add‘ rule to create data_added.csv and then the ‘multiplied‘ rule to create data_multiplied.csv. If we already had the data_added.csv file, Snakemake would only run the ‘multiplied‘ rule, saving us some computation.

To get a better feeling of it, let us try and run our example. In order to do so, we first need to create the tools.py file.

import sys
import pandas as pd

def add(x):
    return x+2

def multiply(x):
    return x*2

if __name__ == '__main__':
    
    inputFilename = sys.argv[1]
    outputFilename = sys.argv[2]
    method = sys.argv[3]
       
    a_input = pd.read_csv(inputFilename)
    
    if method == 'add':  
        a_input.apply(add).to_csv(outputFilename, index=False)
        
    elif method == 'multiply':  
        a_input.apply(multiply).to_csv(outputFilename, index=False)

And our data.csv file.

To run Snakemake you need to be in a folder with the Snakefile, tools.py and data.csv files and then run the below command.

snakemake data_multiplied.csv

This will first generate the data_added.csv file and then the data_multiplied.csv file. If you run the command again, it will tell you that there is nothing new to run.

Snakemake also includes many useful options, such as ‘-n‘ which shows each step needed for creating the output file without running anything, ‘-F‘ which force runs all steps and ‘-j‘ which allows you to run multiple rules in parallel. This is barely the surface of what Snakemake has to offer, but I hope this short blog has illustrated the usefulness of this tool.

Author

Tobias Olsen

View all posts