Following on from my previous blopig post, Garrett gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:

This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.
First, let’s import the necessary python modules for an example:
import lightgbm as lgbm import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap import numpy as np from scipy.stats import tukey_hsd import seaborn as sns from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.model_selection import GroupKFold from sklearn.metrics import mean_absolute_error from tqdm import tqdm
Now, let’s make some dummy data using make_regression
from sklearn:
X, y = make_regression(n_samples=1000, n_features=50, noise=0.1)
Sticking with the themes from my previous blopig post, let’s presume this is molecular data and we’ve grouped the molecules based on similarity with Butina:
groups = np.random.randint(0, 10, 1000)
Using repeated GroupKFold to generate many training and testing splits, let’s train and test three different model types on this data: light gradient boosted machines (LGBM), random forest (RF), and gradient boosting (GB):
# Set the number of repeats and folds for train and test splits repeats = 10 folds = 5 # Create arrays to store the scores lgbm_scores = np.zeros((repeats*folds)) rf_scores = np.zeros((repeats*folds)) gb_scores = np.zeros((repeats*folds)) # Loop through the repeats count = 0 pbar = tqdm(total=repeats*folds) for i in range(repeats): # Create a GroupKFold object with new random state splitter = GroupKFold(n_splits=folds, random_state=i, shuffle=True) # Loop through the folds for train_index, test_index in splitter.split(X, y, groups=groups): # Split the data into training and test sets X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Fit and score LGBM model lgbm_model = lgbm.LGBMRegressor(verbose=-1) lgbm_model.fit(X_train, y_train) preds = lgbm_model.predict(X_test) mae = mean_absolute_error(y_test, preds) lgbm_scores[count] = mae # Fit and score RF model rf_model = RandomForestRegressor() rf_model.fit(X_train, y_train) preds = rf_model.predict(X_test) mae = mean_absolute_error(y_test, preds) rf_scores[count] = mae # Fit and score GB model gb_model = GradientBoostingRegressor() gb_model.fit(X_train, y_train) preds = gb_model.predict(X_test) mae = mean_absolute_error(y_test, preds) gb_scores[count] = mae count += 1 pbar.update(1) pbar.close()
[out]: 100%|██████████| 50/50 [02:11<00:00, 2.63s/it]
We now have an array of test set MAE scores for each of the three models over the 50 train-test splits. We can now compare these scores using the Tukey HSD test to evaluate whether there is a significant difference between the average scores of the models.
# pairwise comparison of models tukey_results = tukey_hsd(lgbm_scores, rf_scores, gb_scores) print(tukey_results)
[out]: Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval) Comparison Statistic p-value Lower CI Upper CI (0 - 1) -20.255 0.000 -21.527 -18.983 (0 - 2) -0.379 0.760 -1.651 0.893 (1 - 0) 20.255 0.000 18.983 21.527 (1 - 2) 19.876 0.000 18.604 21.148 (2 - 0) 0.379 0.760 -0.893 1.651 (2 - 1) -19.876 0.000 -21.148 -18.604
There we go, we have performed a pairwise statistical test, producing effect sizes and p-values. We can visualise these values in an MCSim plot, like the one above. However, this time we will plot both the effect sizes and the p-values onto a single plot:
# Create a heatmap of the effect size and p-values # First, create a figure and axis fig, ax = plt.subplots(1, 1, figsize=(8, 8)) # Set the headings for the heatmap headings = ['LGBM', 'RF', 'GB'] # Get the effect sizes to plot effect_size = tukey_results.statistic.copy() # Create a mask to hide the lower triangle mask = np.zeros_like(effect_size).astype(bool) mask[np.tril_indices_from(mask)] = True # Plot the effect size heatmap sns.heatmap( effect_size, # the effect size matrix annot=True, fmt='.2f', # show the values in the cells cmap='coolwarm', # use the coolwarm colormap ax=ax, # use the axis we created mask=mask, # only show the upper triangle vmax=25, vmin=-25, # set the limits of the color scale cbar_kws={ 'label': 'Effect Size', 'orientation': 'vertical', 'location': 'right', 'aspect':40, 'pad': 0.1, 'fraction': 0.023, 'anchor': (0.0, 1.0) }, # set the colorbar parameters ) # adjust the colorbar ticks colorbar = ax.collections[0].colorbar colorbar.set_ticks([i for i in range(-25, 26, 5)]) # Get the p-values to plot pvals = tukey_results.pvalue.copy() # Convert the p-values to discrete values for plotting pvals[pvals > 0.05] = 0 pvals[(pvals > 0.01) & (pvals <= 0.05)] = 1 pvals[(pvals > 0.001) & (pvals <= 0.01)] = 2 pvals[(pvals > 0.) & (pvals <= 0.001)] = 3 # Create a mask to only show the lower triangle mask = np.zeros_like(pvals).astype(bool) mask[np.triu_indices_from(mask)] = True # Create a custom colormap for the p-values colors = ["#d3d3d3", "#90ee90", "#008000", "#004d00"] # Custom colors for 0, 1, 2, 3 cmap = ListedColormap(colors) # Plot the p-values on the same axis sns.heatmap( pvals, # the p-value matrix ax=ax, # plot on the same axis as the effect size mask=mask, # mask the upper triangle cmap=cmap, # use the custom colormap vmin=-0.5, vmax=3.5, # set the limits of the color scale cbar_kws={ 'label': 'P-values', 'orientation': 'horizontal', 'location': 'bottom', 'aspect':40, 'pad': 0.1, }, # set the colorbar parameters xticklabels=headings, yticklabels=headings, # set the labels for the axes ) colorbar = ax.collections[1].colorbar # adjust the colorbar ticks colorbar.set_ticks([0, 1, 2, 3,]) # adjust the colorbar tick labels colorbar.set_ticklabels([ '$5\cdot 10^{-2} < p$', '$10^{-2} < p < 5\cdot 10^{-2}$', '$10^{-3} < p < 10^{-2}$', '$p < 10^{-3}$' ]) # Place ticks on all sides of the colorbar ax.tick_params(top=True, labeltop=True, right=True, labelright=True, ) fig.tight_layout(rect=[0, 0, .9, 1]) plt.show()

We now have a plot where the lower triangle displays the p-values from the statistical test and the upper triangle displays the effect size from the statistical test, reducing information redundancy. For effect size, red indicates that the method on the y-axis has a higher MAE than the method on the x-axis; blue indicates the inverse. For p-values, green indicates a significant difference (p < 0.05).