A common way of deploying a Flask web application in a production environment is to use an Apache server with the mod_wsgi module, which allows Apache to host any application that supports Python’s Web Server Gateway Interface (WSGI), making it quick and easy to get an application up and running. In this post, we’ll go through configuring your Apache server to host multiple Python apps in a stable manner, including how to run apps in daemon mode and avoiding hanging processes due to Python C extensions not working well with Python sub-interpreters (I’m looking at you, numpy).
Continue readingAuthor Archives: Fergus Boyles
Doing rigid receptor docking? Consider using multiple structures!
Here it is. It’s finally happening. I’m actually writing a blog post about docking. Are the end times upon us? Perhaps. If by my next post I’m not back to my usual techie self, the horsemen may well be on their way.
If you’ve ever used, read about, or listened to a lab mate complain about protein-ligand docking, you’re likely familiar with the rigid receptor assumption. In this model, the active site of the protein is treated as completely rigid, with no side chain flexibility, and only the rotatable bonds in the ligand are allowed to move. The motivation behind this assumption is simple. The computational cost of sampling the conformational space of a ligand within a protein’s active site, and doing so with sufficient rigour so as to sample a near-native binding mode, grows rapidly with the number of rotatable bonds in the ligand. Further increasing the degrees of freedom in the system by incorporating receptor side chain flexibility compounds this problem, making the sampling of accurate binding modes for the ligand an incredibly expensive and difficult problem.
One compromise, if multiple structures with different active site conformations are available for the target protein, is to simply dock your ligands into multiple structures, and trust your scoring function (!!!) to pick out the best binding mode from across the different structures. This is a crude approximation to true flexible receptor docking which won’t capture fully any induced fit effects due to a particular ligand, but if the structures are available, this may offer a more computationally-feasible alternative to flexible docking.
A study earlier this year by Cleves and Jain illustrates this approach nicely. They dock the ligands of the DUD-E database into multiple structures for each target, in each case treating the receptor strucutre as completely rigid. Unsurprisingly, when the target is rigid and there is little structural variation in the active site across the structures, the choice of structure has little influence on the docking results. However, when the receptor is flexible, with clear structural variation across the active sites in the different structures, there is a strong impact on the poses generated by rigid-receptor docking. This effect translates directly into improved virtual screening performance when docking into multiple different structures, illustrating the value of considering the conformational space of the receptor, even when it is treated as rigid during the docking process.
From Jupyter to Slides using RISE
In preparation for remote teaching this year, I’ve spent the last few weeks converting the Doctoral Training Centre’s ‘Introduction to Computer Programming’ course into a series of Jupyter notebooks so that the course can be run entirely using Google Colaboratory.
Continue readingEditors for remote development
The ongoing COVID-19 situation has forced us all to dramatically rethink how we work, with many industries struggling to adjust their on-site procedures to ensure the safety of workers, and many more adapting to support much of their workforce in working from home. As a largely computational research group, we are incredibly fortunate in our ability to carry out most of our work remotely, and our department’s wonderful IT and administrative support staff have enabled a smooth transition to remote working.
Continue readingStreamlining SSH for remote work
With the university now working remotely, and our group working entirely on linux systems, I figured that now would be a good time to share some useful SSH commands to streamline remote access. This is far from an exhaustive list, but will hopefully serve as a useful starting point for anybody who finds themself needing to work remotely on a linux system.
Continue readingBringing practical bioinformatics to high school classrooms
Back in July a litter of OPIGlets went rooting for interesting science at ISMB/ECCB 2019 in Basel, Switzerland. When not presenting, working on my sunburn, or paying nine Francs for a beer, I made a point to attend talks outside my usual bubble of machine learning and drug discovery. In particular, I spent the latter half of the conference in the Education track, and am very glad I did. I love teaching, and am always excited to learn from more experienced educators and trainers. Today I’m going to talk about a fantastic presentation by Stevie Bain from the University of Edinburgh about introducing practical bioinformatics to high school biology classrooms through the 4273pi project.
Continue readingWhy you should care about type hints in Python
Duck typing is great. Knowing that as long as my object does what the function expects it to, I can pass it to the function and get my results without having to worry about exactly what else my object might do. Coming from statically-typed languages such as Java and C++, this is incredibly liberating, and makes it easy to rapidly prototype complex and expressive code without worrying about checking types everywhere. This expressiveness, however, comes with a cost: type errors are only caught at runtime, and can be hard to debug if the original author didn’t document what that one variable in that one function signature is expected to look like.
Continue readingdocopt for dummies
Parsing command line arguments is an annoying piece of boilerplate we all have to do. Documenting our code is either an absolutely essential part of software engineering, or a frivolous waste of research time, depending on who you ask. But what if I told you that we can combine the two? That you can handle your argument parsing simply by documenting how your code works? Well, the dream is now reality. Continue reading
Working with Jupyter notebook on a remote server
To celebrate the recent beta release of Jupyter Lab (try it out of you haven’t already), today we’re going to look at how to run a Jupyter session (Notebook or Lab) on a remote server.
Suppose you have lots of data which lives on a remote server and you want to play with it in a Jupyter notebook. You can’t copy the data to your local machine (well, you can, but you’re sensible so you won’t), but you can run your Jupyter session on the remote server. There’s just one problem – since Jupyter notebook is browser-based and works by connecting to the Jupyter session running locally, you can’t just run Jupyter remotely and forward X11 like you would a traditional graphical IDE. Fortunately, the solution is simple: we run Jupyter remotely, create an ssh tunnel connecting a local port to the one used by the Jupyter session, and connect directly to the Jupyter session using our local browser. The best part about this is that you can set up the Jupyter session once then connect to it from any browser on any machine once an ssh tunnel is created, without worrying about X11 forwarding.
Here’s how to do it.
1. First, connect to the remote server if you haven’t already
ssh fergus@funkyserver
1.5. Jupyter takes browser security very seriously, so in order to access a remote session from a local browser we need to set up a password associated with the remote Jupyter session. This is stored in jupyter_notebook_config.py
which by default lives in ~/.jupyter
. You can edit this manually, but the easiest option is to set the password by running Jupyter with the password
argument:
jupyter notebook password >>> Enter password:
This password will be used to access any Jupyter session running from this installation, so pick something sensible. You can set a new password at any time on the remote server in exactly the same way.
2: Launch a Jupyter session on the remote server. You can specify the access port using the --port
option. This might be useful on a shared server where others might be doing the same thing. You’ll also want to run this without launching a browser on the remote server since this is of no use to you.
jupyter lab --port=9000 --no-browser &
Here I’m using Jupyter Lab, but this works in exactly the same way for Jupyter Notebook.
3: Now for the fun part. Jupyter is running on our remote server, but what we really want is to work in our favourite browser on our local machine. To do this we just need to create an ssh tunnel between a port on our machine and the port our Jupyter session is using on the remote server. On our local machine:
ssh -N -f -L 8888:localhost:9000 fergus@funkyserver
For those not familiar with ssh tunneling, we’ve just created a secure, encrypted connection between port 8888 on our local machine and port 9000 on our remote server.
- -N tells ssh we won’t be running any remote processes using the connection. This is useful for situations like this where all we want to do is port forwarding.
- -f runs ssh in the background, so we don’t need to keep a terminal session running just for the tunnel.
- -L specifies that we’ll be forwarding a local port to a remote address and port. In this case, we’re forwarding port 8888 on our machine to port 9000 on the remote server. The name ‘localhost’ just means ‘this computer’. If you’re a Java programmer who lives for verbosity, you could equivalently pass
-L localhost:8888:localhost:9000
.
4: If you’ve done everything correctly, you should now be able to access your Jupyter session via port 8888 on your machine. Fire up your favourite browser and type localhost:8888
into the address bar. This should bring up a Jupyter session and prompt you for a password. Enter the password you specified for Jupyter on the remote server.
Congratulations! You now have a Jupyter session running remotely which you can connect to anytime, anywhere, from any machine.
Disclaimer: I haven’t tried this on Windows, nor do I intend to. I value my sanity.
Using Random Forests in Python with Scikit-Learn
I spend a lot of time experimenting with machine learning tools in my research; in particular I seem to spend a lot of time chasing data into random forests and watching the other side to see what comes out. In my many hours of Googling “random forest foobar” a disproportionate number of hits offer solutions implemented in R. As a young Pythonista in the present year I find this a thoroughly unacceptable state of affairs, so I decided to write a crash course in how to build random forest models in Python using the machine learning library scikit-learn (or sklearn to friends). This is far from exhaustive, and I won’t be delving into the machinery of how and why we might want to use a random forest. Rather, the hope is that this will be useful to anyone looking for a hands-on introduction to random forests (or machine learning in general) in Python.
In the future I’ll write a more in-depth post on how a few libraries turn Python into a powerful environment for data handling and machine learning. Until then, though, let’s jump into random forests!
Toy datasets
Sklearn comes with several nicely formatted real-world toy data sets which we can use to experiment with the tools at our disposal. We’ll be using the venerable iris dataset for classification and the Boston housing set for regression. Sklearn comes with a nice selection of data sets and tools for generating synthetic data, all of which are well-documented. Now, let’s write some Python!
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn import datasets iris = datasets.load_iris()
Classification using random forests
First we’ll look at how to do solve a simple classification problem using a random forest. The iris dataset is probably the most widely-used example for this problem and nicely illustrates the problem of classification when some classes are not linearly separable from the others.
First we’ll load the iris dataset into a pandas dataframe. Pandas is a nifty Python library which provides a data structure comparable to the dataframes found in R with database style querying. As an added bonus, the seaborn visualization library integrates nicely with pandas allowing us to generate a nice scatter matrix of our data with minimal fuss.
df = pd.DataFrame(iris.data, columns=iris.feature_names) # sklearn provides the iris species as integer values since this is required for classification # here we're just adding a column with the species names to the dataframe for visualisation df['species'] = np.array([iris.target_names[i] for i in iris.target]) sns.pairplot(df, hue='species')
Neat. Notice that iris-setosa is easily identifiable by petal length and petal width, while the other two species are much more difficult to distinguish. We could do all sorts of pre-processing and exploratory analysis at this stage, but since this is such a simple dataset let’s just fire on. We’ll do a bit of pre-processing later when we come to the Boston data set.
First, let’s split the data into training and test sets. We’ll used stratified sampling by iris class to ensure both the training and test sets contain a balanced number of representatives of each of the three classes. Sklearn requires that all features and targets be numeric, so the three classes are represented as integers (0, 1, 2). Here we’re doing a simple 50/50 split because the data are so nicely behaved. Typically however we might use a 75/25 or even 80/20 training/test split to ensure we have enough training data. In true Python style this is a one-liner.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
Now let’s fit a random forest classifier to our training set. For the most part we’ll use the default settings since they’re quite robust. One exception is the out-of-bag estimate: by default an out-of-bag error estimate is not computed, so we need to tell the classifier object that we want this.
If you’re used to the R implementation, or you ever find yourself having to compare results using the two, be aware that some parameter names and default settings are different between the two. Fortunately both have excellent documentation so it’s easy to ensure you’re using the right parameters if you ever need to compare models.
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456) rf.fit(X_train, y_train)
Let’s see how well our model performs when classifying our unseen test data. For a random forest classifier, the out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data. We’ll compare this to the actual score obtained on our test data.
from sklearn.metrics import accuracy_score predicted = rf.predict(X_test) accuracy = accuracy_score(y_test, predicted) print(f'Out-of-bag score estimate: {rf.oob_score_:.3}') print(f'Mean accuracy score: {accuracy:.3}')
Out-of-bag score estimate: 0.973 Mean accuracy score: 0.933
Not bad. However, this doesn’t really tell us anything about where we’re doing well. A useful technique for visualising performance is the confusion matrix. This is simply a matrix whose diagonal values are true positive counts, while off-diagonal values are false positive and false negative counts for each class against the other.
from sklearn.metrics import confusion_matrix cm = pd.DataFrame(confusion_matrix(y_test, predicted), columns=iris.target_names, index=iris.target_names) sns.heatmap(cm, annot=True)
This lets us know that our model correctly separates the setosa examples, but exhibits a small amount of confusion when attempting to distinguish between versicolor and virginica.
Random forest regression
Now let’s look at using a random forest to solve a regression problem. The Boston housing data set consists of census housing price data in the region of Boston, Massachusetts, together with a series of values quantifying various properties of the local area such as crime rate, air pollution, and student-teacher ratio in schools. The question for us is whether we can use these data to accurately predict median house prices. One caveat of this data set is that the median house price is truncated at $50,000 which suggests that there may be considerable noise in this region of the data. You might want to remove all data with a median house price of $50,000 from the set and see if the regression improves at all.
As before we’ll load the data into a pandas dataframe. This time, however, we’re going to do some pre-processing of our data by independently transforming each feature to have zero mean and unit variance. The values of different features vary greatly in order of magnitude. If we were to analyse the raw data as-is, we run the risk of our analysis being skewed by certain features dominating the variance. This isn’t strictly necessary for a random forest, but will enable us to perform a more meaningful principal component analysis later. Performing this transformation in sklearn is super simple using the StandardScaler class of the preprocessing module. This time we’re going to use an 80/20 split of our data. You could bin the house prices to perform stratified sampling, but we won’t worry about that for now.
boston = datasets.load_boston() features = pd.DataFrame(boston.data, columns=boston.feature_names) targets = boston.target
As before, we’ve loaded our data into a pandas dataframe. Notice how I have to construct new dataframes from the transformed data. This is because sklearn is built around numpy arrays. While it’s possible to return a view of a dataframe as an array, transforming the contents of a dataframe requires a little more work. Of course, there’s a library for that, but I’m lazy so I didn’t use it this time.
from sklearn.preprocessing import StandardScaler X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, random_state=42) scaler = StandardScaler().fit(X_train) X_train_scaled = pd.DataFrame(scaler.transform(X_train), index=X_train.index.values, columns=X_train.columns.values) X_test_scaled = pd.DataFrame(scaler.transform(X_test), index=X_test.index.values, columns=X_test.columns.values)
With the data standardised, let’s do a quick principal-component analysis to see if we could reduce the dimensionality of the problem. This is quick and easy in sklearn using the PCA class of the decomposition module.
from sklearn.decomposition import PCA pca = PCA() pca.fit(X_train) cpts = pd.DataFrame(pca.transform(X_train)) x_axis = np.arange(1, pca.n_components_+1) pca_scaled = PCA() pca_scaled.fit(X_train_scaled) cpts_scaled = pd.DataFrame(pca.transform(X_train_scaled)) # matplotlib boilerplate goes here
Notice how without data standardisation the variance is completely dominated by the first principal component. With standardisation, however, we see that in fact we must consider multiple features in order to explain a significant proportion of the variance. You might want to experiment with building regression models using the principal components (or indeed just combinations of the raw features) to see how well you can do with less information. For now though we’re going to use all of the (scaled) features as the regressors for our model. As with the classification problem fitting the random forest is simple using the RandomForestRegressor class.
from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=500, oob_score=True, random_state=0) rf.fit(X_train, y_train)
Now let’s see how we do on our test set. As before we’ll compare the out-of-bag estimate (this time it’s an R-squared score) to the R-squared score for our predictions. We’ll also compute Spearman rank and Pearson correlation coefficients for our predictions to get a feel for how we’re doing.
from sklearn.metrics import r2_score from scipy.stats import spearmanr, pearsonr predicted_train = rf.predict(X_train) predicted_test = rf.predict(X_test) test_score = r2_score(y_test, predicted_test) spearman = spearmanr(y_test, predicted_test) pearson = pearsonr(y_test, predicted_test) print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}') print(f'Test data R-2 score: {test_score:>5.3}') print(f'Test data Spearman correlation: {spearman[0]:.3}') print(f'Test data Pearson correlation: {pearson[0]:.3}')
Out-of-bag R-2 score estimate: 0.841 Test data R-2 score: 0.886 Test data Spearman correlation: 0.904 Test data Pearson correlation: 0.942
Not too bad, though there are a few outliers that would be worth looking into. Your challenge, should you choose to accept it, is to see if removing the $50,000 data improves the regression.
Wrapping up
Congratulations on making it this far. Now you know how to pre-process your data and build random forest models all from the comfort of your iPython session. I plan on writing more in the future about how to use Python for machine learning, and in particular how to make use of some of the powerful tools available in sklearn (a pipeline for data preparation, model fitting, prediction, in one line of Python? Yes please!), and how to make sklearn and pandas play nicely with minimal hassle. If you’re lucky, and if I can bring myself to process the data nicely, I might include some fun examples from less well-behaved real-world data sets.
Until then, though, happy Pythoning!