Have you ever had an annoying dataset that looks something like this?
or even worse, just several of them
In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this
I will look specifically at the case of timeseries of instant conductance values of simulated ion channels (*). And at the end, I will share a code snippet you can steal and adapt to deal with similar time series.
(*) I will probably write another blog post in the future to expand on my simulations and on how I computed instant conductance for hundreds of simulation trajectories.
Identifying Outliers
Outliers can be defined as data points that significantly differ from other observations.
In general, removing outliers will depend on the specific characteristics of your data and the desired outcome. I will not elaborate on that here, but this can get really complicated as the topic of timeseries spans several disciplines — also an old one. Check this reference if you want to get a sense of how deep the rabbit hole goes.
Here, I will simply focus on a few approaches relevant to the case of the kind of timeseries that one can simply bound within an interval.
In this case, to spot outliers one can simply plot the data (visual approach) and then manually set up threshold values to filter them out. But, because this process gets impractical for hundreds of timeseries, using information drawn from the data distributions gets handy (statistical approach). I will cover both cases.
Visual approach
Just plot it
Timeseries and scatter plots are useful means to visualise outliers.
In the time series below, we can see that the bulk of the data lives between 0.5 and 2.0
To clean our data, we can set these as thresholds, replace outliers with NaN
values, and fill them in with interpolated data. Using pandas
it would look something like this
import pandas as pd from numpy import NaN df = pd.DataFrame({'original': timeseries}) outliers = df[(0.5 > df.values) | (df.values > 2.0)].values df_with_NaNs = df.replace(outliers, NaN) df_new = df_with_NaNs.interpolate(method='linear', axis=0).ffill().bfill()
Box plots
Using a plain plot of our timeseries does the job. However, we can use another graphic representation, a more statistical one: a box plot.
import seaborn as sns sns.boxplot(df['original'],ax=ax)
Box plots graphically represent the anatomy of your data’s distribution. The central box indicates the ranked quartiles Q1
, Q2
, and Q3
which represent the 25-percentile, the median, and the 75-percentile of the distribution. While the two whiskers indicate the minimum and the maximum of the distribution, with all data points outside these considered outliers. For a pretty picture of the parts of a box plot, check this.
Again, from looking at the box plot, we can see that 0.5 and 2.0 as threshold values to filter out outliers are indeed quite sensible choices.
Statistical Approach
Visualisation is great for quickly spotting outliers. However, if you have to deal with hundreds or even thousands of timeseries, cluttering will limit visualisation to pick suitable thresholds to remove outliers.
Statistical approaches provide a methodic way to overcome this limitation. Two very well-known approaches I will look at are:
- The Z-score
- The Inter-Quartile Range
The Z-score
This method is based on a simple intuition: Just use the arithmetic mean and standard deviation of each time series to define an interval
[np.mean(timseries)-np.std(timeseries), np.mean(timseries)+np.std(timeseries)]
outside which outliers must live. After all, the bulk of the data must fall within this interval. Set this into a script, and “Boom!” Outliers gone!
The Z-score method does exactly the same thing as setting thresholds for outlier filtering with an interval of length 2*sigma
centred at the arithmetic mean.
But, instead of setting thresholds on the timeseries values, this is done over the normalisation of these
z_scores = (timeseries - np.mean(timeseries))/np.std(timeseries)
So, any threshold on the z_scores
will represent a multiple of the standard deviation.
Use this snippet to capture the outlier values to be cleaned following this method:
from scipy import stats z_scores = stats.zscore(timeseries) threshold = 1 outliers = timeseries[abs(z_scores.values) > threshold].values
Inter-Quartile Range
This is regarded as the most trusted method in research when it comes to dealing with outliers. And just in case you didn’t notice, this gets visualised in box plots.
To define the endpoints of the interval for outlier filtering, instead of using the mean and the standard deviation, this approach uses the interval
[Q3+1.5*IQR, Q1-1.5*IQR]
where Q1
and Q3
are the position of the first and the third quartile and IQR
the interquartile distance given by their absolute difference.
Unlike the Z-score method, the filtering interval is defined around the median, not the mean, hence taking into account any asymmetry in the data distribution. Another reason why IQR is more robust is that the computed mean and the standard deviation values can get corrupted if you have several outliers with unusually high values — something I saw when trying Z-scores on my several timeseries.
Use this snippet to capture the outlier values to be cleaned following the IQR method:
import numpy as np # Define first and third quartiles and IQR Q1 = np.percentile(timeseries, 25, interpolation = 'midpoint') Q3 = np.percentile(timeseries, 75, interpolation = 'midpoint') IQR = Q3 - Q1 # Get upper and lower outliers indices upper_outliers_indices = timeseries >= (Q3+1.5*IQR) lower_outliers_indices = timeseries <= (Q1-1.5*IQR) # Extract outlier values upper_outliers = timeseries[np.where(upper_outliers_indices)[0]].values lower_outliers = timeseries[np.where(lower_outliers_indices)[0]].values outliers = np.concatenate([lower_outliers, upper_outliers])
Back to our original timeseries
OK, we have talked through how different visual and statistical methods deal with outliers along with their limitations.
Now, it’s time to go back to our original problem of filtering outliers out from our conductance timeseries.
Trade-offs
You might think that probably the way to go should be the IQR method. However, if you had a careful look at the timeseries above after using IQR, not only zero and large-value outliers are removed, but also, some other non-zero data points telling us something about the evolution of our system, i.e., low conductance states transiently visited by the ion channels during its simulation.
When dealing with outliers, you must consider whether any data points removed carry some valuable information in the context of your data.
My Python function
For my original data, the particular strategy I used required me to remove all the upper outliers along with all zeros, while keeping the lower “outliers” as considered by the IQR method. And again, filling in outliers with interpolated values.
Steal this:
import numpy as np from numpy import NaN def clean_outliers(df_timeseries): # Determine Interquartile Range (IQR) Q1 = np.percentile(df_timeseries, 25, interpolation = 'midpoint') Q3 = np.percentile(df_timeseries, 75, interpolation = 'midpoint') IQR = Q3 - Q1 # Get upper and lower outliers indices upper_outliers_indices = df_timeseries >= (Q3+1.5*IQR) lower_outliers_indices = df_timeseries <= (Q1-1.5*IQR) # Extract outlier values upper_outliers = df_timeseries[np.where(upper_outliers_indices)[0]].values outliers = np.concatenate([upper_outliers, np.zeros(1)]) # Replace outliers with NaNs and fill in via interpolation df_with_NaNs = df_timeseries.replace(outliers, NaN) df_interpolated = df_with_NaNs.interpolate(method='linear', axis=0).ffill().bfill() return df_interpolated
And the outcome:
The bottom line
Dealing with outliers doesn’t have to be a pain if you use the right approach. Sometimes just plotting your data and discarding points outside an interval is enough. Or, sometimes you may need the statistical information in your dataset to identify and remove outliers more methodically. Regardless of the method, always consider the trade-off between what information you want to keep and how much you can afford to wipe off. Think about the context of your problem to judge if you aren’t removing informative data.