Suppose we need to do some interactive analysis in a Jupyter notebook, but our local machine lacks the power. We have access to a slurm cluster, but we can’t SSH from the head node to the worker node; we can only SSH from the worker node to the head node. Can we still interact with a Jupyter notebook running on the worker node? As it happens, the answer is “yes” – we just need to do some reverse SSH tunnelling.
Continue readingTag Archives: Jupyter
Visualising macromolecules and grids in Jupyter Notebooks with nglview
If you do most of your work in Jupyter notebooks, it can be convenient to have a quick visualisation tool to view the results of your latest computation from within the notebook, without having to flick between the notebook and your favourite molecule viewer.
I have recently started using NGLview, an IPython/Jupyter widget, to do this. It is based on the NGL viewer, an embeddable webapp for macromolecular visualisation. The nglvew module documentation can be found here, and in addition to handling the usual formats for molecular structure (.pdb, .mol2, .sdf, .pqr, etc.) and map density(.ccp4 and more), it supports visualising trajectories and even making movies.
Continue readingStoring variables in Jupyter Notebooks using %store magic
We’ve all been there. You’ve just run an expensive computation in your Jupyter Notebook and are about to draw those conclusions which will prove that your theories were right all along (until you find the sixteen bugs in your code which render them invalid, but that’s an issue for a different time). Then at the critical moment, your flatmate begins streaming their Lord Of The Rings marathon in 4k and your already temperamental Wi-Fi severs your connection to the department servers in protest, crashing your Jupyter Notebook, leaving your hopes and dreams in tatters.
Continue readingMol2vec: Finding Chemical Meaning in 300 Dimensions
A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.
ISMB 2018: Collaborative Structural Biology using Machine Learning and Jupyter Notebook
This post is a summary of the talk, Collaborative Structural Biology using Machine Learning and Jupyter Notebook, given by Fergus Imrie and Fergus Boyles at ISMB 2018. Materials for the experiments can be found here and here.
Myself and four other members of the Oxford Protein Informatics Group (a.k.a. OPIGlets) recently had the pleasure of attending the Intelligent Systems for Molecular Biology (ISMB) conference in Chicago. Organised by the International Society of Computational Biology (ISCB), ISMB is the largest computational biology conference in the world, with several thousand attendees.
Spread over four action-packed days in July (not including workshops/tutorial sessions), it was an eye-opening experience, showcasing the depth and breadth of computational biology research; particularly striking was the range of problems tackled, techniques applied, and data sources used.
I was fortunate enough to have the opportunity to present alongside my colleague, Fergus Boyles, as part of the 3DSIG Community of Special Interest (COSI). We led the first hands-on practical demonstration at 3DSIG, entitled “Collaborative Structural Biology using Machine Learning and Jupyter Notebook”. While a new format at the conference, with our presentation somewhat of an experiment, I understand the organising committee is keen to repeat the format next year.
In what follows, I’ll briefly outline the key themes and outcomes from our presentation. Full materials to reproduce all results presented in full can be found here and here.
Reproducibility crisis?
In a survey of 1,500 scientists by Nature in 2016 (link), more than 70% of participants had tried and failed to reproduce another scientist’s experiments, while 90% said there was a reproducibility crisis to some extent. Most striking, perhaps, was the revelation that “more than half have failed to reproduce their own experiments”!
While the focus of the survey was, admittedly, on traditional, lab-based, experimental research, this is certainly also an issue in computational approaches, with the machine learning community under the heaviest scrutiny.
This is clearly unsustainable and many efforts are being taken to address this across the scientific world. As one example, Nature has introduced a code and submission checklist that requires authors to submit custom algorithms or software that are central to the paper for peer review and editorial assessment. While only directly affecting a small portion of research, this is a big step in the right direction and I think we’re only going to see more of this in the future.
Software to the rescue?
With the rise of cloud computing, the open-source community, and much more, there is a plethora of software available that can be used to improve the accessibility of methods and improve the reproducibility of computational experiments. Below, I touch on a couple of general areas that are increasing used in computational pipelines and setups.
- Cloud computing (such as Amazon Web Services, Google Cloud, and Microsoft Azure) provides widely accessible, standardised compute environments, and allows the use of anything from a single core to near-HPC-level resources for a short period of time at relative inexpensive.
- Container solutions (such as Docker and Kubernets) allow developers to package an application, with all required libraries and dependencies, into a single executable for the end user, with no further dependencies.
Our approach
We didn’t use any of the above tools for purposes of our talk, but instead constructed our pipeline based on three other widely-used solutions: Conda, Project Jupyter, and Git/GitHub. For those unfamiliar, here is a brief overview of each.
- Conda is an open-source package and environment management system. It works by creating distinct virtual environments and installing standalone interpreters or compilers within that virtual environment. You can then install additional packages within that virtual environment, that are completely isolated and separate from your system default packages, and other virtual environments.
- For those of you who are familiar with the iPython notebook, Jupyter is an extension of this format to multiple languages. Jupyter provides an interactive browser-based coding environment in the form of a notebook, that can be thought of as similar to a lightweight IDE. The power of Jupyter notebooks comes from a combination of (1) the ability to intersperse code with markdown, which is much more human readable and friendly on the eye compared to traditional comments; (2) the cell-based format, where small pieces of code are contained in cells that can be run, and re-run, individually and without re-running the remainder of your code; (3) the ability to display inline figures, tables (among other things), rendering in HTML.
- Git is an open-source version control system. Version control is an essential bedrock of good programming that we don’t have time to go into in more detail, but long-story short, Git takes any headache out of version control.
- GitHub is a code hosting platform built for collaboration with Git at its core. Beyond a simple code repository, GitHub allows collaboration and development through two key features. “Forking” allows you to clone other projects, and either develop them yourself, or keep a record of a fixed version for integration within another project. “Pull requests” make large scale community collaboration projects possible, with users providing code for specific modifications for the original projects, which the owners/admin of the original project can choose to merge or reject.
Experiments
As a toy problem to showcase this approach to building a reproducible pipeline, we address the problem of protein classification according to the SCOP classification scheme. While the dataset we have shared contains examples of protein pairs that are in the same fold, superfamily, and family (as well as none of these), we focussed on the most straightforward task of determining whether a pair of proteins belong to the same family or not.
Our dataset is based on the Astral data set (06.02.2016 build), and consists of 8 pairwise features computed from the sequences of the two proteins. We won’t go into the details of the exact features here.
Using a simple random forest on these 8 pairwise features between the target and template protein, we achieved an accuracy of 88.0%, and an area under the receiver operative curve of 0.95. A confusion matrix and ROC curve summarising our results can be found below.
Instructions to reproduce these results, together with all materials needed, can be found here and here.
Conclusions
Reproducibility in science is facing a challenging time. All stakeholders, from researchers to funders and publishers, are placing more emphasis on work being reproducible, and are taking measures to ensure this. In computational research, in particular stochastic algorithms such as those prevalent throughout machine learning, the problem is no less serious, and on the face of it should be readily solvable.
In our demonstration, we have illustrated one approach to tackling this in a simple, efficient way. In addition, we only looked to tackle one possible problem or question, and only used a subset of the overall dataset. Please feel free to explore the dataset and pose your own questions. We’d love to hear from you if you do!
Acknowledgements
I’d like to thank all of OPIG for providing feedback on an early version of the talk. Crucially, I’d like to thank Dr Saulo de Oliveira who provided us with the dataset used in our exploratory analysis. Finally, I’d like to thank my co-presenter Fergus Bolyes, without whom I couldn’t have done this.
Working with Jupyter notebook on a remote server
To celebrate the recent beta release of Jupyter Lab (try it out of you haven’t already), today we’re going to look at how to run a Jupyter session (Notebook or Lab) on a remote server.
Suppose you have lots of data which lives on a remote server and you want to play with it in a Jupyter notebook. You can’t copy the data to your local machine (well, you can, but you’re sensible so you won’t), but you can run your Jupyter session on the remote server. There’s just one problem – since Jupyter notebook is browser-based and works by connecting to the Jupyter session running locally, you can’t just run Jupyter remotely and forward X11 like you would a traditional graphical IDE. Fortunately, the solution is simple: we run Jupyter remotely, create an ssh tunnel connecting a local port to the one used by the Jupyter session, and connect directly to the Jupyter session using our local browser. The best part about this is that you can set up the Jupyter session once then connect to it from any browser on any machine once an ssh tunnel is created, without worrying about X11 forwarding.
Here’s how to do it.
1. First, connect to the remote server if you haven’t already
ssh fergus@funkyserver
1.5. Jupyter takes browser security very seriously, so in order to access a remote session from a local browser we need to set up a password associated with the remote Jupyter session. This is stored in jupyter_notebook_config.py
which by default lives in ~/.jupyter
. You can edit this manually, but the easiest option is to set the password by running Jupyter with the password
argument:
jupyter notebook password >>> Enter password:
This password will be used to access any Jupyter session running from this installation, so pick something sensible. You can set a new password at any time on the remote server in exactly the same way.
2: Launch a Jupyter session on the remote server. You can specify the access port using the --port
option. This might be useful on a shared server where others might be doing the same thing. You’ll also want to run this without launching a browser on the remote server since this is of no use to you.
jupyter lab --port=9000 --no-browser &
Here I’m using Jupyter Lab, but this works in exactly the same way for Jupyter Notebook.
3: Now for the fun part. Jupyter is running on our remote server, but what we really want is to work in our favourite browser on our local machine. To do this we just need to create an ssh tunnel between a port on our machine and the port our Jupyter session is using on the remote server. On our local machine:
ssh -N -f -L 8888:localhost:9000 fergus@funkyserver
For those not familiar with ssh tunneling, we’ve just created a secure, encrypted connection between port 8888 on our local machine and port 9000 on our remote server.
- -N tells ssh we won’t be running any remote processes using the connection. This is useful for situations like this where all we want to do is port forwarding.
- -f runs ssh in the background, so we don’t need to keep a terminal session running just for the tunnel.
- -L specifies that we’ll be forwarding a local port to a remote address and port. In this case, we’re forwarding port 8888 on our machine to port 9000 on the remote server. The name ‘localhost’ just means ‘this computer’. If you’re a Java programmer who lives for verbosity, you could equivalently pass
-L localhost:8888:localhost:9000
.
4: If you’ve done everything correctly, you should now be able to access your Jupyter session via port 8888 on your machine. Fire up your favourite browser and type localhost:8888
into the address bar. This should bring up a Jupyter session and prompt you for a password. Enter the password you specified for Jupyter on the remote server.
Congratulations! You now have a Jupyter session running remotely which you can connect to anytime, anywhere, from any machine.
Disclaimer: I haven’t tried this on Windows, nor do I intend to. I value my sanity.
Interesting Jupyter and IPython Notebooks
Here’s a treasure trove of interesting Jupyter and iPython notebooks, with lots of diverse examples relevant to OPIG, including an RDKit notebook, but also:
Entire books or other large collections of notebooks on a topic (covering Introductory Tutorials; Programming and Computer Science; Statistics, Machine Learning and Data Science; Mathematics, Physics, Chemistry, Biology; Linguistics and Text Mining; Signal Processing; Scientific computing and data analysis with the SciPy Stack; General topics in scientific computing; Machine Learning, Statistics and Probability; Physics, Chemistry and Biology; Data visualization and plotting; Mathematics; Signal, Sound and Image Processing; Natural Language Processing; Pandas for data analysis); General Python Programming; Notebooks in languages other than Python (Julia; Haskell; Ruby; Perl; F#; C#); Miscellaneous topics about doing various things with the Notebook itself; Reproducible academic publications; and lots more!
Using RDKit to load ligand SDFs into Pandas DataFrames
If you have downloaded lots of ligand SDF files from the PDB, then a good way of viewing/comparing all their properties would be to load it into a Pandas DataFrame.
RDKit has a very handy function just for this – it’s found under the PandasTool module.
I show an example below within Jupypter-notebook, in which I load in the SDF file, view the table of molecules and perform other RDKit functions to the molecules.
First import the PandasTools module:
from rdkit.Chem import PandasTools
Read in the SDF file:
SDFFile = "./Ligands_noHydrogens_noMissing_59_Instances.sdf" BRDLigs = PandasTools.LoadSDF(SDFFile)
You can see the whole table by calling the dataframe:
BRDLigs
The ligand properties in the SDF file are stored as columns. You can view what these properties are, and in my case I have loaded 59 ligands each having up to 26 properties:
BRDLigs.info()
It is also very easy to perform other RDKit functions on the dataframe. For instance, I noticed there is no heavy atom column, so I added my own called ‘NumHeavyAtoms’:
BRDLigs['NumHeavyAtoms']=BRDLigs.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)
Here is the column added to the table, alongside columns containing the molecules’ SMILES and RDKit molecule:
BRDLigs[['NumHeavyAtoms','SMILES','ROMol']]
Viewing 3D molecules interactively in Jupyter iPython notebooks
Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).
The trick is to use py3Dmol
. It’s easy to install:
pip install py3Dmol
This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js
(Rego & Koes, 2015); here's a nice summary of the
capabilities of 3Dmol.js. It's features include:
- support for pdb, sdf, mol2, xyz, and cube formats
- parallelized molecular surface computation
- sphere, stick, line, cross, cartoon, and surface styles
- atom property based selection and styling
- labels
- clickable interactivity with molecular data
- geometric shapes including spheres and arrows
I tried a simple example and it worked beautifully:
import py3Dmol view = py3Dmol.view(query='pdb:1hvr') view.setStyle({'cartoon':{'color':'spectrum'}}) view
The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.
References
Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829