Catching up on the literature is one of the highlights of my job as a scientist. True, sometimes you can be overwhelmed by the amount of information you don’t have; or wonder if we really need another paper showing that protein-ligand scoring functions don’t work. And yet, sometimes you find excellent research that you can’t but regard with a mixture of awe and envy. At a recent group meeting, I discussed one such paper from the research group of Aviv Regev at MIT, where the authors perform an impressive combination of computation and experiment to consider some basic questions in gene regulation and evolution. Here is why I think it’s excellent.
The authors are interested in promoters, small sequences of DNA that precede genes, which are known to regulate how frequently their partners will be expressed. In short, these promoters are binding sites for transcription factors, a family of proteins that in turn recruit RNA polymerase to transcribe DNA to RNA. In turn, albeit not directly, the rate of gene transcription determines the rate at which a protein is produced. If this sounds simple, however, that is where our understanding stops. The human genome encodes some 1.6k different transcription factors (~6-7% of protein-coding genes) and their underworkings are still not well-understood.
Broadly, promoters are the simplest of a grop of DNA patterns known as cis-regulatory elements, which dictate the expression of genes closely downstream. However, understanding how changes in the sequences of the promoters, for example due to mutations, or in general due to the evolutionary drift, has been an elusive research question. Like many other problems in modern bioinformatics, the main challenge lies in the quality of the data: most available datasets are highly biased towards archetypal examples, often local mutational explorations around natural promoters.
Here is where the work of Vaishnav, de Boer and collaborators comes in. The authors leverage a specialised high-throughput assay to study the effect of over 50,000 randomly generated promoters on the effect of a target protein. In short (see the diagram below), the authors generate a large number of mutants of S. cerevisiae, each containing a random promoter sequence (consisting of 80 base pairs) preceding an insert gene coding for yellow fluorescent protein (YFP). After transformation, cells are sorted by fluorescence using flow cytometry, and then sequenced, allowing to generate a large array of data where every promoter can be mapped to a quantitative measure of expression.
There is of course variability in the measured intensity owing to both instrumental sensitivity and differences between individual yeast cells, but an average over the bins that a given promoter-bearing cell is sorted into serves as a reasonable proxy of the expression of the YFP gene. Armed with this prime quality, unbiased dataset, the authors set to train a neural network to predict the expression of a random promoter. They use a relatively simple convolutional neural network which achieves impressive results, obtaining a Pearson’s correlation coefficient of 0.96 on a test dataset not used in training.
The authors then demonstrate that this surrogate fitness function can be used to engineer new promoter sequences. Using a simple off-the-shelf evolutionary algorithm, the authors find hundreds of promoter sequences with more extreme expression behaviour than in the original dataset. Finally, the authors apply this model to study several hypotheses in the evolvability of these promoter sequences and in the promoter fitness landscape. Using a simple model, they manage to consider several interesting hypothesis in a variety of fields related to evolution and gene regulation.
And this was the reason why I became interested in the paper in the first place: it is a prime example of how the cooperation between experimental and computational science can pay enormous dividends. Clever experimental designs can yield high-quality, unbiased “machine learning grade” data which can be combined with computational tools to reach biological insights that would be unattainable for either experiment or computation alone. While the conclusions of this article are at present limited to a single promoter region affecting a specific reporter protein, this work is likely to become a milestone in the study of gene regulation. And it poses a question to all of us working in computational biology: how can we best achieve synergy between computation and experiment, so that this kind of data-intense projects become routine?