Histograms are frequently used to visualize the distribution of a data set or to compare between multiple distributions. Python, via matplotlib.pyplot, contains convenient functions for plotting histograms; the default plots it generates, however, leave much to be desired in terms of visual appeal and clarity.
The two code blocks below generate histograms of two normally distributed sets using default matplotlib.pyplot.hist settings and then, in the second block, I add some lines to improve the data presentation. See the comments to determine what each individual line is doing.
## DEFAULT HISTOGRAMS import matplotlib.pyplot as plt import numpy as np # set random seed so we get reproducible behavior np.random.seed(1) # generate two data series each containing 1,000 normally distributed values d1 = np.random.normal(5.0, 2.0, 1000) d2 = np.random.normal(6.0, 2.0, 1000) # make the plot with default settings plt.clf() plt.hist(d1) plt.hist(d2) plt.savefig('default_hist.png', dpi=300)
The output of this program is:
And now for the slightly longer but much improved histogram code:
## BETTER HISTOGRAMS import matplotlib.pyplot as plt import numpy as np # set random seed so we get reproducible behavior np.random.seed(1) # generate two data series each containing 1,000 normally distributed values d1 = np.random.normal(5.0, 2.0, 1000) d2 = np.random.normal(6.0, 2.0, 1000) # make the plot plt.clf() # generate subplot object so we can modify axis lines easily ax = plt.subplot(111) # updated histogram commands # use colors that can be differentiated by the colorblind from Paul Tol's notes # do not use "filled" histograms so all bin heights can be seen clearly plt.hist(d1, histtype='step', color='#EE8026', label='Data Set 1', alpha=0.7) plt.hist(d2, histtype='step', color='#BA8DB4', label='Data Set 2', alpha=0.7) # new things ax.spines['top'].set_visible(False) # turn off top line ax.spines['right'].set_visible(False) # turn off right line plt.ylabel('Counts') # label the y axis plt.xlabel('Values') # label the x axis plt.xlim(-2, 14) # set x limits that span full data range plt.ylim(-10, 300) # set y limits so that full range can be seen plt.legend(loc='best', fancybox=True) # add a legend plt.savefig('better_hist.png', dpi=300)
The result of this program is:
This second plot is easier to read, has less visual clutter thanks to the removal of the “filled” histograms, and has labeled axes. The choice of histogram bins is an important consideration that I am not going to touch on here. You can experiment yourself to see how adding, for example, bins=’fd’ to the plt.hist calls in the second program above changes the visual depiction of the results with all else held constant.