This Week I presented Ian Tickle’s 2012 Paper “Statistical quality indicators for electron-density maps”. This paper presented new, statistically robust metrics for describing the agreement between an atomic model and the experimentally derived electron density.
Previous metrics such as the Real-space R (RSR) and Real-Space Correlation Coefficient (RSCC) (Brandon & Jones, 1991, and others) are popular electron density metrics, and can inform on the quality of an atomic model. However, as Tickle claims, they cannot inform on how the model is good, or bad, as they give no indication of the accuracy, or the precision, of the electron density.
Accuracy:
Ian Tickle describes accuracy as – “How close are the results on average to the truth (regardless of precision)?” This is more often referred to as ‘error’. The most accurate model is the one that best agrees with the electron density.
Precision:
Precision is described as – “If you were to repeat the experiment, how much would you expect the results to vary (regardless of accuracy)?” This is more often described as ‘uncertainty’. Precision is a property of the crystal and the experiment. It is independent of the model.
A pictographic representation is shown below –
Before the discussion of the new metrics proposed, there are several assumptions that must be made and several influencing factors to be considered.
Assumptions:
- The electron density, and the phases used to generate it, are accurate. This assumption is reasonable because density-based validation is generally done near to the end of refinement when the model is mostly correct.
Metric usefulness depends critically on:
- Accurate calculation and representation of the electron density from our atomic model.
- Accurate scaling of the observed and model density (neither calculated nor observed density is not on an absolute scale).
- Accurate determination of the area to be considered for the calculation of the metric. If too large an area is considered, noise and density from other features will influence the metric. Too small an area will not encompass the whole model and its environment.
Calculating the Model Density:
Accurate calculation of the model’s electron density is essential, as the profile of the atoms will of course affect the comparison of the model to the experimental density. Often (as in Jones 1991, and others) a fixed profile is assumed for all atoms. Of course, in reality the profile will depend on atom type, B-factors, data completeness, and resolution limits.
Due to the resolution limits, the electron density from an atom is the convolution of a 3d gaussian and a sphere of constant scattering power (Blundell & Johnson, 1976). The truncated density function for an atom then becomes:
Scaling the calculated density:
This, fortunately, is already available and calculated by refinement programs (when calculating 2mFo – DFc maps), and the correct scaling factor is the resolution-dependent D.
Visualising the quality of a model:
To demonstrate how the (global) quality of a model can easily be seen, Tickle calculates and normalises difference density maps for a good, and a bad, model. If the model is ‘correct’, then the difference density should be gaussian noise, but if the model is ‘incorrect’, it will be skewed. This can easily be seen in Figure 8 from the paper.
The main feature that appears in a ‘bad’ model is the increased weight in the tails of the distribution. Extra weight on the left-hand side indicates model that is not supported by the evidence, and extra weight on the right-hand side indicates under-modelled density.
The New Accuracy Metric
Using the ideas from above (that the difference density between a model and the experimental density should be distributed as a gaussian) Tickle goes on to develop metrics for determining the likelihood that a model density and an experimental density differ only by the addition of random noise.
The metric he describes tests a hypothesis – Does the distribution of the difference density reflect that obtained from the propagation of random errors in the experimental data (and phases)?
To do this, statistical tests are developed. First we define the difference density Z-score (ZD)
This quantity is the difference between the calculated electron density and the experimental density (delta rho), divided by the error in the difference density, giving the normal definition of a normalised Z-score.
The difference density (the numerator) has been discussed above, so we now discuss the error in the difference density. Assuming that the experimental data and the phases are ‘correct’, any non-random errors arise only from the model.
That is, errors arising in the experimental data will appear only as random noise, whereas errors in the model will manifest as the signal that we are trying to detect.
To calculate the strength of the noise (that of the experimental data and the phases), we look at the bulk-solvent regions. Here, the atoms are unordered, and so should give uniform density. Any deviations from uniform should be approximately the random noise from the experimental data and the phases.
Maximum Z-score analysis
Tickle considers using the maximum ZD of a sample as a test for model accuracy, and discusses merits and failings. In brief, if we were to sample from a difference density distribution, and take only the most significant ZD score, “focusing only on the maximum value inevitably overstates the significance of the results”.
A Chi-Squared test for ZD scores
The solution that Tickle proposes is to allow that all sample values may be significant (rather than just the largest values). He creates a joint probability density function of the absolute sample values (assumed half-normal and iid). This probability density function then becomes a chi-squared distribution.
By calculating the CDF of the chi-squared (a lower regularised gamma function), Tickle is able to attain p-values for a set of observations.
These can then be converted back to Z-scores, which crystallographers are more comfortable using. As Tickle states, just because the metric is in terms of Z-scores does not mean that the distribution is normal (here it is clearly a chi-squared).
Problem: Large Samples
The problem with the above is that for large samples, a small number of significant values will be drowned out by noise and the signal may be missed. The failure of the above test in this situation is put down to the choice of null hypothesis. Multiple null hypotheses are needed in order to cover all possibilities.
When distinguishing between multiple hypotheses, we must aim to avoid type II errors wherever possible, whilst attempting to minimise type I errors. We must select the hypothesis “that maximises the probability of obtaining a result less extreme that the one actually observed (…) or equivalently the one that minimises the probability of obtaining a result more extreme than that observed”.
Solution: A new JPDF and CDF
To solve this, Tickle takes a subset of the highest values of the original sample of n, say from i=k to n (the n-k highest scores), and calculates the chi-squared and its associated cumulative probability. We will then choose the value of k such that it gives us the highest probability,
However, the cumulative probability of chi-squared is no longer the regularised gamma function due to the bias introduced by selected the largest values. Recalculating the JPDF and integrating analytically and numerically to obtain the CDF, we could arrive at a result. This, however, has the problem of a large dimensionality, which requires the use of very accurate Monte Carlo integration (accuracy of much better 0.27% is required, since we are interested in the p-values between 0.9973 and 1 – greater than 3 sigma).
Fortunately, an approximation can be made to bring about a practical solution.
Examples of Significant Distributions:
Tickle generates a table which gives several scenarios that will give a significant result, for different extremes of Z-value, and different sample sizes. One particularly key point is
…small samples are statistically less reliable so require a higher proportion of significant data points to achieve the same overall level of significance. Large samples require relatively fewer data points but they must have higher values to overcome the ‘multiple comparisons’ effect, where large values are more likely to occur occur purely as a result of random error.
Summary
B-factors:
Tickle shows early in the paper that the RSR and RSCC are correlated with the B-factor of the model. RSZD shows no correlation with the B-factor, as is desired.
RSZD+ & RSZD:
More useful scores can be generated by scoring the negative and positive values of delta-rho separately. This gives two scores, RSZD+ and RSZD-. RSZD+ gives the significance/prevalence of unexplained density (missing atoms) and RSZD- gives the significance/prevalence of unjustified model/misplaced atoms.
Precision & Reliability:
Although not discussed in as much depth in the paper, Tickle also proposes a metric to account for the precision of the electron density
This is clearly independent of the model, and is the signal-to-noise ratio of the average observed density in a specified region. Weak density (or large noise) will lead to a small RSZO, implying that any model placed here should be considered unreliable.