Author Archives: Oliver Crook

ggPlotting tips with OPIG data

Ever wondered whether opiglets keep their ketchup in the fridge or cupboard? Perhaps you’ve wanted to know how to create nice figure to display lots of information simulataniously. Publication quality figures are easy in R with the ggplot package. We may also learn some good visualisation.

Continue reading

Pitfalls of using Pearson’s correlation for comparing model performance

Pearson’s R (correlation coefficient) is a measure of the linear correlation between two variables, giving a value between -1 and 1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation. While it’s a useful statistic for understanding the relationship between two variables, it is often used to compare the performance of two or more models. For example, imagine we had experimental values that we are predicting and several models’ predictions. Obviously, we would prefer the model with the highest Pearson’s R … or perhaps not?

Continue reading

Am I better? Performance metrics unravelled

What’s the deal with all these numbers? Accuracy, Precision, Recall, Sensitivity, AUC and ROCs.

The basic stuff:

Given a method that produces a numerical outcome either catagorical (classification) or continuous (regression), we want to know how well our method did. Let’s start simple:

True positives (TP): You said something was a cow and it was in fact a cow – duh.

False positives (FP): You said it was a cow and it wasn’t – sad.

True negative (TN): You said it was not a cow and it was not – good job.

False negative (FN): You said it was not a cow but it was a cow – do better.

I can optimise these metrics artificially. Just call everything a cow and I have a 100% true positive rate. We are usually interested in a trade-off, something like the relative value of metrics. This gives us:

Continue reading

Tackling horizontal and vertical limitations

A blog post about reviewing papers and preparing papers for publication.

We start with the following premise: all papers have limitations. There is not a single paper without limitations. A method may not be generally applicable, a result may not be completely justified by the data or a theory may make restrictive assumptions. To cover all limitations would make a paper infinitely long, so we must stop somewhere.

A lot of limitations fall into the following scenario. The results or methods are presented but they could have extended them in some way. Suppose, we obtain results on a particular cell type using an immortalized cell-line. Are the results still true, if we performed the experiments on primary or patient-derived cells? If the signal from the original cells was sufficiently robust then we would hope so. However, we can not be one hundred percent sure. A similar example is a method that can be applied to a certain type of data. It may be possible to extend the method to be applied to other data types. However, this may require some new methodology. I call this flavor of limitations vertical limitations. They are vertical in the sense that they build upon an already developed result in the manuscript. For certain journals, they will require that you tackle vertical limitations by adapting the original idea or method to demonstrate broad appeal or that idea could permeate multiple fields. Most of the time, however, the premise of an approach is not to keep extending it. It works. Leave it alone. Do not ask for more. An idea done well does not need more.

Continue reading

Why can a man not lift himself by pulling up on his bootstrap hypothesis test?

This blogpost highlights a typical mistake when performing the bootstrap hypothesis test. Bootstrapping is a method of resampling data to estimate measures of variability, such as confidence intervals or variance. 

In the simplest form of the bootstrap, assume you have a set of values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You want to estimate the mean and variability of the mean using these data. The recipe is as follows:

Continue reading

An idea by any other name would smell as sweet.

A blog post about ideas.

Ideation is the formation of an idea, but how do we ideate? 

The route of the word is “to see”, so when we have an idea we see something. In that moment of realization, we hold on to something quite abstract. Some describe it as a click or pattern or insight. This “seeing” is with the mind, however, not the eyes. Idea also implies sentiment or direction – a path one might say. It’s this last point that resonates with me most. When we are lost, in the sea of thoughts, most of the time the consequences are immediate (no consciousness required). However, sometimes we must pause and ideate. Our path, the next step, is unclear. 

Continue reading

How do I do regression when my predictors have multicollinearity?

A quick summary of the key idea of principal components regression (PCR), its advantages and extensions.

Sometimes we find ourselves in a dire situation. We have measured some response y and a set of predictors W. Unfortunately, W is a wide but short matrix, say 10×100 or worse 10×100000. We’ve made only 10 observations. Standard regression is simply not going to work, because W is singular. Some would say p is bigger than n.

So what can we do? Many of us would jump to LASSO or ridge regression. However, there is another way that is often overlooked.

Continue reading