A SOUND SECOND OPINION…

A SOUND SECOND OPINION…

The idea of using traditional statistics as a broad comparison measure is controversial (at best). If the idea of a definitive test for "statistically significant differences" is implied, the metrics fail on several fronts as aptly discussed below in William Huber's review of the column "Comparing Map Surfaces (GeoWorld, December, 1999)."

_________________________

The t-Test is Inappropriate

Let's begin with the easy stuff. The t-test as described in your column is not usually appropriate. To relate the t-statistic to a p-value you need to verify many preconditions, almost none of which hold for maps generally:

The data must be independent,
The data should be identically distributed, and
The random components should be approximately normally distributed.

Second, in the rare event these conditions do hold, your readers should consider using a pairwise comparison test because it will be much more incisive. Your Table 1 shows computations for an unpaired comparison, which is inferior in its ability to identify real differences. You are allowed to construct any pairing you want for a t-test; the only requirement is that you have the same quantity of numbers in each sample. However, the pairing must be made independently of the data values themselves. For example, a random pairing will do, but pairing the ordered results will not. A pairing based on other information, such as a common X,Y location, is also ok. An extension of this idea is to pair data based on proximity when there is not a common set of sample locations. GISes make such operations easy to do. (And please--this is a nitpick--avoid reporting Excel's P-values when they're so ridiculously low. Whenever an Excel P-value is less than about 10^-6 it is unlikely to be more than a crude approximation. Consider replacing those P-values with something like "<0.0001%".)

Because mapped data usually exhibit strong spatial dependence, the closest your can come to a correct statistical test would have to be a geostatistical one: a block-kriged mean of the map area, together with the kriging standard deviation, are the statistics you are looking for. The kriging algorithms will not give you degrees of freedom, but there are ways to approximate them for use in a modified t-test. You could also eschew the t-test altogether and conclude that the block-kriged mean M of the map differences, having kriging standard deviation of S, is significantly different from zero whenever |M| > Z*S and Z, as usual, is defined relative to the standard normal distribution by Prob(x > Z or x < -Z) = 5% or 1% or whatever significance level you want. This would be the geostatistical analog of the paired-comparison t-test. It copes with deviations from the first assumption above (data independence) but does not help you with deviations from the second and third (changes in distribution over space and departures from normal distribution).

For those looking for something quick and dirty (speed and simplicity are helpful when you’re exploring data), you might try these steps:

a) Sample your map data based on a grid whose spacing is large enough to limit possible spatial correlation among data.

b) View a scatterplot of the map data at the sample points.

c) If the scatterplot looks like a horizontal elliptical cloud, construct the box-and-whisker plot of the point-by-point differences in the data.

d) If the box in that plot does not overlap zero, conclude there is a significant difference (assuming there are ten or more sample points).

e) If the median in the box is large enough to be of concern, conclude that the difference is meaningful.

This prescription assumes that your map data are NOT derived from other data by an interpolation algorithm (such as contouring, kriging, resampling, splining, etc.). (If they are, you will usually be better off comparing original data to original data, not mapped data to mapped data.) It does cope with deviations from assumptions (1) and (3) above, but not (2)—identical distributions. The way you deal with (2) is to map the differences between the two data sets. You might symbolize the differenced map cells by whether they are in the box, in a whisker, or outside the whiskers, using different colors for above and below the median. Look for clusters of points that are "outside" in one direction (positive or negative). If these clusters do not exist, and no strong patchiness appears in the map, then probably assumption (2) is not violated. Otherwise, your conclusion risks being based on some localized variations in data that do not reflect overall properties of the maps.

I encourage you to introduce your readers to statistical graphics rather than statistical tests. Graphics are revealing; tests rarely are. An appropriate graphic for comparing two maps is a scatterplot. Only when the scatterplot looks like horizontal "white noise" would a t-test be truly appropriate for obtaining a statistical blessing for the conclusion you want to draw. What is particularly powerful—and this I think is the theme you have been pushing in many of your recent articles--is the coupling of the map and the scatterplot. This enables outlying (or otherwise interesting) points in the scatterplot to be mapped. When they fall into clusters or regions of interest, something useful has been revealed. A technique called "scatterplot brushing" is available in most high-end statistical software for simultaneously selecting interesting points in a scatterplot and observing them on a map. GISes like ArcView implement similar functionality by dynamically linking scatterplots to mapped data.

Invalid One-Third Rule

I can identify neither empirical nor statistical validity for the "one-third (33 percent) difference" rule of thumb you mention. Ultimately, the size of a difference that matters depends on the genesis of the mapped data and on the decision at hand. For example, if your data are two sets of overlapping digital elevations and the intended use is for air navigation, accepting a 33 percent difference in altitudes in the mountains would be perilous. If instead your data are interpolated maps of hydraulic permeability used to model groundwater flow, random differences of up to an order of magnitude (900 percent) are common and acceptable. Thus, no rule of thumb of this nature will be applicable except for clearly and narrowly defined situations. Maybe the 33 percent rule works in your practice for the kinds of data you assess, but it would be at best misleading to recommend it for any general purpose to the Beyond Mapping readers.

Interpreting Box-and-Whisker Plots

The statement about interpreting side-by-side box and whisker plots ("Generally speaking, if the boxes tend to align there isn't a significant difference between data groups") is true when each plot represents about 10 items. In your example this is not true. You have been unnecessarily cautious. Really, only a slight difference in medians (on the order of 1/20th of the height of a box) is needed for a significant difference—(assuming the data are independent, which they are not!). Thus, in your example, the lack of overlap of the BOXES of the plots (forget about the whiskers) indicates a STRONGLY significant difference in medians. What you might say is that non-overlapping boxes, when each plot represents 10 or more data items, usually indicate significant differences in median values.

I think it's great that you're using box-and-whisker plots (it's consistent with my recommendation to introduce readers to statistical graphics). It would be a real service to your readers to spend a little more time in the future with these plots, to explain in a little more detail what they mean and how to interpret them.

Unsatisfactory Sanctification by Statistics

I can't resist sharing with you an obscure item prompted by your final statement. It reminds me strongly of an interesting philosophical paper by John Tukey ("Data Analysis and Behavioral Science ...," from the Collected Works of John Tukey (1986) Wadsworth, Inc., Belmont, CA). Tukey lists "unwise statements which ... may be ... so obviously 'bad'" [that they don't even deserve to be written down]. He arranges them in a hierarchy. At the top is

"If it's messy, sweep it under the rug."

The next level includes

"The one and only proper use of statistics is for sanctification."

A variant of the latter could be your (tongue-in-cheek?) phrase "…leading the visually malleable with quantitative analysis." I think my recommendation to look at the data using an appropriate graphic, such as a scatterplot, fits squarely within this philosophical framework. DON'T rely on the "numbers" (statistics); LOOK at the data fairly and let them tell their own story.

Regards,
Bill

William Huber, Ph.D.
Quantitative Decisions
Merion Station, PA
http://members.home.net/whuber

_______________________________
More Thoughts…

Bill-- great stuff; your comments as always make a very helpful and interesting column supplement. In reviewing your review, a couple of thoughts come to mind…

First, while your arguments that the t-Test as a test is suspect at best (sounds like a mantra) are sound, the t-Statistic is an interesting metric unto itself. It "benchmarks" the relationship between the central tendencies of two data sets. In a sense, it quantifies what one is looking for in the box-and-whisker plots… how close the Means are considering their Standard Deviations and the size of the sample.

Quite possibly the substitution of Median and Quartile Range would produce a better comparison metric as they don't presume normally distributed data. I suggest the following as a generalized comparison metric (sort of a knock-off of the t-Statistic)…

   c-Metric = ((MedA - MedB) / ((NA * QRA + NB * QRB ) / NA + NB -2))
                     where Med is the Median, QR is the Quartile Range, and N is the number
                     of samples for data sets A and B

The metric is sensitive to the difference in typical values, data dispersions, number of data values and doesn't assume a normal distribution. Is this "thinking outside the box" too far out? Of course, the statistic would preclude the already shaky "testing" aspect of the t-Test statistic, but it might help folks formulate a quantitative "benchmark" that helps them objectively communicate what they see in the box-and-whisker plots.

Secondly, I am confused by the first step in the quick and dirty approach for exploring data...

"a) Sample your map data based on a grid whose spacing is large enough to limit possible spatial correlation among data."

Isn't spatial correlation natural in many (most?) mapped variables (e.g., fir trees inherently grow on north slopes; no fir trees seem to grow in lakes; etc.)? How does one set a spacing that limits inherent spatial correlation among variables? I understand mechanisms for limiting spatial autocorrelation within a variable (based on the reach of a variogram), but can't think of a mechanism for limiting spatial correlation among the data layers.

Personally, I think the spatial autocorrelation and correlation encapsulated in mapped data set (whether directly measured as in remotely sensed data or modeled through spatial interpolation of field samples) is not only useful but critical in assessing relationships. They characterize inherent properties of mapped variables that often are very well-portrayed in mapped data. But sampling discards this information for the sake of traditional statistical assumptions-- yielding a set of "independent" samples with small N's versus the original mapped variables with very large N's that track the geographic continuum of spatial autocorrelation and spatial variable correlation within the data layers.

If maps fail to track these conditions, then they are worthless and must be simply be abstract renderings from the Geographer of Oz with no real-word reality. If the maps are erroneous or unregistered, how will sampling alleviate these problems? If the spatial autocorrelation and correlation are good estimates, how will sampling to gain spatial independence better represent the spatial relationships?

Joe

___________________________________

Letter to the Editor…

Dear Editor:

Joseph K. Berry is not correct when he states that "it's relatively safe to say that the larger the 't-Stat' value, the greater the difference between data groups" (Beyond Mapping, October 1999). The calculated t statistic depends not only upon the difference in means between two sample populations but upon sample variances of the populations and the sample size. A large calculated t statistic can, for example be produced by a small difference in means if variances are sufficiently small or sample sizes are sufficiently large. The t statistic cannot be used to draw inferences about the magnitude, on either an absolute or proportional basis, of the difference in means
between two populations.

The magnitude of the calculated t statistic tells us the probability of being wrong should we reject the hypothesis of no difference in means between the two populations. Should this probability be sufficiently low, we are justified in concluding that the means are different. This point is no cavil; rather, proper interpretation of test statistics is at the heart of our understanding of statistical hypothesis testing and inference.

Dr. Berry gets the interpretation correct in the sentence following the one I quoted and would have been better off to have restricted the discussion to that point.

Wayne Richter
Research Associate
Department of Biology
Skidmore College
Saratoga Springs, New York 12866
wrichter@skidmore.edu

Response:

You're right… my statement was far too general. It was made in context of the data under discussion (inside/outside of a partition of the data defining a single soil nutrient surface map with comparable N's and similar means/variances. I agree that the statement in the "sentence following" is more useful. However, a later statement is critical… "While the t-test example might serve as a reasonable instance of "blindly applying" non-spatial, statistical tests to mapped data, it suggests this approach is a bit shaky as it seldom provides a reliable test like it does in traditional, non-spatial statistics (see author's notes)."

Interested readers should checkout the October, 1999 column supplement at www.innovativegis.com/basis where additional concerns about statistical tests are raised, as well as the Excel spreadsheet containing the calculations used for all three approaches discussed in the column. The spatially dependent approaches of Percent Difference and Surface Configuration are preferable for comparing map surfaces... and they generate "really cool" maps, as well as really useful statistical summaries and indices.

Joe Berry

_______________________________

Anyone else out there in cyberland want to comment? Send comments to jberry@innovativegis.com.