Combining Multiple Comparisons Similarity plots for statistical tests

Following on from my previous blopig post, Garrett gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:

This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.

First, let’s import the necessary python modules for an example:

import lightgbm as lgbm
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np

from scipy.stats import tukey_hsd
import seaborn as sns

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_absolute_error

from tqdm import tqdm

Now, let’s make some dummy data using make_regression from sklearn:

X, y = make_regression(n_samples=1000, n_features=50, noise=0.1)

Sticking with the themes from my previous blopig post, let’s presume this is molecular data and we’ve grouped the molecules based on similarity with Butina:

groups = np.random.randint(0, 10, 1000)

Using repeated GroupKFold to generate many training and testing splits, let’s train and test three different model types on this data: light gradient boosted machines (LGBM), random forest (RF), and gradient boosting (GB):

# Set the number of repeats and folds for train and test splits
repeats = 10
folds = 5

# Create arrays to store the scores
lgbm_scores = np.zeros((repeats*folds))
rf_scores = np.zeros((repeats*folds))
gb_scores = np.zeros((repeats*folds))

# Loop through the repeats
count = 0
pbar = tqdm(total=repeats*folds)
for i in range(repeats):
    # Create a GroupKFold object with new random state
    splitter = GroupKFold(n_splits=folds, random_state=i, shuffle=True)
    # Loop through the folds
    for train_index, test_index in splitter.split(X, y, groups=groups):
        # Split the data into training and test sets
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Fit and score LGBM model
        lgbm_model = lgbm.LGBMRegressor(verbose=-1)
        lgbm_model.fit(X_train, y_train)
        preds = lgbm_model.predict(X_test)
        mae = mean_absolute_error(y_test, preds)
        lgbm_scores[count] = mae

        # Fit and score RF model
        rf_model = RandomForestRegressor()
        rf_model.fit(X_train, y_train)
        preds = rf_model.predict(X_test)
        mae = mean_absolute_error(y_test, preds)
        rf_scores[count] = mae

        # Fit and score GB model
        gb_model = GradientBoostingRegressor()
        gb_model.fit(X_train, y_train)
        preds = gb_model.predict(X_test)
        mae = mean_absolute_error(y_test, preds)
        gb_scores[count] = mae

        count += 1
        pbar.update(1)
pbar.close()
[out]: 100%|██████████| 50/50 [02:11<00:00,  2.63s/it]

We now have an array of test set MAE scores for each of the three models over the 50 train-test splits. We can now compare these scores using the Tukey HSD test to evaluate whether there is a significant difference between the average scores of the models.

# pairwise comparison of models
tukey_results = tukey_hsd(lgbm_scores, rf_scores, gb_scores)
print(tukey_results)
[out]: Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval)
Comparison  Statistic  p-value  Lower CI  Upper CI
 (0 - 1)    -20.255     0.000   -21.527   -18.983
 (0 - 2)     -0.379     0.760    -1.651     0.893
 (1 - 0)     20.255     0.000    18.983    21.527
 (1 - 2)     19.876     0.000    18.604    21.148
 (2 - 0)      0.379     0.760    -0.893     1.651
 (2 - 1)    -19.876     0.000   -21.148   -18.604

There we go, we have performed a pairwise statistical test, producing effect sizes and p-values. We can visualise these values in an MCSim plot, like the one above. However, this time we will plot both the effect sizes and the p-values onto a single plot:

# Create a heatmap of the effect size and p-values
# First, create a figure and axis
fig, ax = plt.subplots(1, 1, figsize=(8, 8))

# Set the headings for the heatmap
headings = ['LGBM', 'RF', 'GB']

# Get the effect sizes to plot
effect_size = tukey_results.statistic.copy()

# Create a mask to hide the lower triangle
mask = np.zeros_like(effect_size).astype(bool)
mask[np.tril_indices_from(mask)] = True

# Plot the effect size heatmap
sns.heatmap(
    effect_size, # the effect size matrix
    annot=True, fmt='.2f', # show the values in the cells
    cmap='coolwarm', # use the coolwarm colormap
    ax=ax, # use the axis we created
    mask=mask, # only show the upper triangle
    vmax=25, vmin=-25, # set the limits of the color scale
    cbar_kws={
        'label': 'Effect Size', 'orientation': 'vertical',
        'location': 'right', 'aspect':40, 'pad': 0.1, 
        'fraction': 0.023, 'anchor': (0.0, 1.0)
    }, # set the colorbar parameters
)

# adjust the colorbar ticks
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([i for i in range(-25, 26, 5)])

# Get the p-values to plot
pvals = tukey_results.pvalue.copy()

# Convert the p-values to discrete values for plotting
pvals[pvals > 0.05] = 0
pvals[(pvals > 0.01) & (pvals <= 0.05)] = 1
pvals[(pvals > 0.001) & (pvals <= 0.01)] = 2
pvals[(pvals > 0.) & (pvals <= 0.001)] = 3

# Create a mask to only show the lower triangle
mask = np.zeros_like(pvals).astype(bool)
mask[np.triu_indices_from(mask)] = True

# Create a custom colormap for the p-values
colors = ["#d3d3d3", "#90ee90", "#008000", "#004d00"]  # Custom colors for 0, 1, 2, 3
cmap = ListedColormap(colors)

# Plot the p-values on the same axis
sns.heatmap(
    pvals, # the p-value matrix
    ax=ax, # plot on the same axis as the effect size
    mask=mask, # mask the upper triangle
    cmap=cmap, # use the custom colormap
    vmin=-0.5, vmax=3.5, # set the limits of the color scale
    cbar_kws={
        'label': 'P-values', 'orientation': 'horizontal',
        'location': 'bottom', 'aspect':40, 'pad': 0.1, 
    }, # set the colorbar parameters
    xticklabels=headings, yticklabels=headings, # set the labels for the axes
    )

colorbar = ax.collections[1].colorbar
# adjust the colorbar ticks
colorbar.set_ticks([0, 1, 2, 3,])

# adjust the colorbar tick labels
colorbar.set_ticklabels([
    '$5\cdot 10^{-2} < p$',
    '$10^{-2} < p < 5\cdot 10^{-2}$', 
    '$10^{-3} < p < 10^{-2}$',
    '$p < 10^{-3}$'
])

# Place ticks on all sides of the colorbar
ax.tick_params(top=True, labeltop=True, right=True, labelright=True, )
fig.tight_layout(rect=[0, 0, .9, 1])
plt.show()

We now have a plot where the lower triangle displays the p-values from the statistical test and the upper triangle displays the effect size from the statistical test, reducing information redundancy. For effect size, red indicates that the method on the y-axis has a higher MAE than the method on the x-axis; blue indicates the inverse. For p-values, green indicates a significant difference (p < 0.05).

Author