A SOUND SECOND OPINION…
The
idea of using traditional statistics as a broad comparison measure is
controversial (at best). If the idea of
a definitive test for "statistically significant differences" is implied,
the metrics fail on several fronts as aptly discussed below in William Huber's
review of the column "Comparing Map Surfaces (GeoWorld, December,
1999)."
_________________________
The t-Test is Inappropriate
Let's
begin with the easy stuff. The t-test as described in your column is not
usually appropriate. To relate the t-statistic to a p-value you need to
verify many preconditions, almost none of which hold for maps generally:
Second, in the rare event these conditions do hold, your readers should consider using a pairwise comparison test because it will be much more incisive. Your Table 1 shows computations for an unpaired comparison, which is inferior in its ability to identify real differences. You are allowed to construct any pairing you want for a t-test; the only requirement is that you have the same quantity of numbers in each sample. However, the pairing must be made independently of the data values themselves. For example, a random pairing will do, but pairing the ordered results will not. A pairing based on other information, such as a common X,Y location, is also ok. An extension of this idea is to pair data based on proximity when there is not a common set of sample locations. GISes make such operations easy to do. (And please--this is a nitpick--avoid reporting Excel's P-values when they're so ridiculously low. Whenever an Excel P-value is less than about 10^-6 it is unlikely to be more than a crude approximation. Consider replacing those P-values with something like "<0.0001%".)
Because
mapped data usually exhibit strong spatial dependence, the closest your can
come to a correct statistical test would have to be a geostatistical one: a
block-kriged mean of the map area, together with the kriging standard
deviation, are the statistics you are looking for. The kriging algorithms
will not give you degrees of freedom, but there are ways to approximate them
for use in a modified t-test. You could also eschew the t-test altogether
and conclude that the block-kriged mean M of the map differences, having
kriging standard deviation of S, is significantly different from zero whenever
|M| > Z*S and Z, as usual, is defined relative to the standard normal
distribution by Prob(x > Z or x < -Z) = 5% or 1% or whatever significance
level you want. This would be the geostatistical analog of the
paired-comparison t-test. It copes with deviations from the first assumption
above (data independence) but does not help you with deviations from the second
and third (changes in distribution over space and departures from normal
distribution).
For
those looking for something quick and dirty (speed and simplicity are helpful
when you’re exploring data), you might try these steps:
a)
Sample
your map data based on a grid whose spacing is large enough to limit possible
spatial correlation among data.
b)
View
a scatterplot of the map data at the sample points.
c)
If
the scatterplot looks like a horizontal elliptical cloud, construct the
box-and-whisker plot of the point-by-point differences in the data.
d)
If
the box in that plot does not overlap zero, conclude there is a significant
difference (assuming there are ten or more sample points).
e)
If
the median in the box is large enough to be of concern, conclude that the
difference is meaningful.
This prescription assumes that your map data are NOT derived from other data by
an interpolation algorithm (such as contouring, kriging, resampling, splining,
etc.). (If they are, you will usually be better off comparing original
data to original data, not mapped data to mapped data.) It does cope with deviations from
assumptions (1) and (3) above, but not (2)—identical distributions. The way
you deal with (2) is to map the differences between the two data sets.
You might symbolize the differenced map cells by whether they are in the box,
in a whisker, or outside the whiskers, using different colors for above and
below the median. Look for clusters of points that are
"outside" in one direction (positive or negative). If these
clusters do not exist, and no strong patchiness appears in the map, then
probably assumption (2) is not violated. Otherwise, your conclusion risks
being based on some localized variations in data that do not reflect overall
properties of the maps.
I
encourage you to introduce your readers to statistical graphics rather than
statistical tests. Graphics are revealing; tests rarely are. An appropriate graphic for comparing two
maps is a scatterplot. Only when the scatterplot looks like horizontal
"white noise" would a t-test be truly appropriate for obtaining a
statistical blessing for the conclusion you want to draw. What is
particularly powerful—and this I think is the theme you have been pushing in
many of your recent articles--is the coupling of the map and the
scatterplot. This enables outlying (or
otherwise interesting) points in the scatterplot to be mapped. When they
fall into clusters or regions of interest, something useful has been
revealed. A technique called "scatterplot brushing" is
available in most high-end statistical software for simultaneously selecting
interesting points in a scatterplot and observing them on a map. GISes like
ArcView implement similar functionality by dynamically linking scatterplots to
mapped data.
Invalid One-Third Rule
I
can identify neither empirical nor statistical validity for the "one-third
(33 percent) difference" rule of thumb you mention. Ultimately, the
size of a difference that matters depends on the genesis of the mapped data and
on the decision at hand. For example, if your data are two sets of
overlapping digital elevations and the intended use is for air navigation,
accepting a 33 percent difference in altitudes in the mountains would be
perilous. If instead your data are
interpolated maps of hydraulic permeability used to model groundwater flow,
random differences of up to an order of magnitude (900 percent) are common and
acceptable. Thus, no rule of thumb of
this nature will be applicable except for clearly and narrowly defined
situations. Maybe the 33 percent rule works in your practice for the
kinds of data you assess, but it would be at best misleading to recommend it
for any general purpose to the Beyond Mapping readers.
Interpreting Box-and-Whisker Plots
The
statement about interpreting side-by-side box and whisker plots
("Generally speaking, if the boxes tend to align there isn't a significant
difference between data groups") is true when each plot represents about
10 items. In your example this is not true. You have been
unnecessarily cautious. Really, only a
slight difference in medians (on the order of 1/20th of the height of a box) is
needed for a significant difference—(assuming the data are independent, which
they are not!). Thus, in your example, the lack of overlap of the BOXES
of the plots (forget about the whiskers) indicates a STRONGLY significant
difference in medians. What you might say is that non-overlapping boxes,
when each plot represents 10 or more data items, usually indicate significant
differences in median values.
I
think it's great that you're using box-and-whisker plots (it's consistent with
my recommendation to introduce readers to statistical graphics). It would
be a real service to your readers to spend a little more time in the future
with these plots, to explain in a little more detail what they mean and how to
interpret them.
Unsatisfactory Sanctification by Statistics
I
can't resist sharing with you an obscure item prompted by your final
statement. It reminds me strongly of an interesting philosophical paper
by John Tukey ("Data Analysis and Behavioral Science ...," from the
Collected Works of John Tukey (1986) Wadsworth, Inc., Belmont, CA). Tukey
lists "unwise statements which ... may be ... so obviously 'bad'"
[that they don't even deserve to be written down]. He arranges them
in a hierarchy. At the top is
"If it's messy, sweep it under the rug."
The
next level includes
"The one and only proper use of statistics is
for sanctification."
A
variant of the latter could be your (tongue-in-cheek?) phrase "…leading
the visually malleable with quantitative analysis." I think my
recommendation to look at the data using an appropriate graphic, such as a
scatterplot, fits squarely within this philosophical framework. DON'T
rely on the "numbers" (statistics); LOOK at the data fairly and let
them tell their own story.
Regards,
Bill
William Huber, Ph.D.
Quantitative Decisions
Merion Station, PA
http://members.home.net/whuber
_______________________________
More Thoughts…
Bill--
great stuff; your comments as always make a very helpful and
interesting column supplement. In reviewing your review, a couple of
thoughts come to mind…
First, while your arguments that
the t-Test as a test is suspect at best (sounds like a mantra) are
sound, the t-Statistic is an interesting metric unto
itself. It "benchmarks" the
relationship between the central tendencies of two data sets. In a
sense, it quantifies what one is looking for in the box-and-whisker plots… how
close the Means are considering their Standard Deviations and the size of the
sample.
Quite possibly the substitution
of Median and Quartile Range would produce a better comparison metric as they
don't presume normally distributed data.
I suggest the following as a generalized comparison metric (sort of a
knock-off of the t-Statistic)…
c-Metric = ((MedA - MedB) / ((NA *
QRA + NB * QRB ) / NA + NB -2))
where Med is the
Median, QR is the Quartile Range, and N is the number
of samples for data
sets A and B
The metric is sensitive to
the difference in typical values, data dispersions, number of data values
and doesn't assume a normal distribution.
Is this "thinking outside the box" too far out? Of course, the statistic would preclude the
already shaky "testing" aspect of the t-Test statistic, but
it might help folks formulate a quantitative "benchmark"
that helps them objectively communicate what they see in the
box-and-whisker plots.
Secondly, I am confused
by the first step in the quick and dirty approach for exploring data...
"a) Sample your
map data based on a grid whose spacing is large enough to limit possible
spatial correlation among data."
Isn't spatial correlation
natural in many (most?) mapped variables (e.g., fir trees inherently grow on
north slopes; no fir trees seem to grow in lakes; etc.)? How
does one set a spacing that limits inherent spatial correlation among
variables? I understand mechanisms for limiting spatial autocorrelation
within a variable (based on the reach of a variogram), but can't think of a
mechanism for limiting spatial correlation among the data layers.
Personally, I think the
spatial autocorrelation and correlation encapsulated in mapped data set
(whether directly measured as in remotely sensed data or modeled through
spatial interpolation of field samples) is not only useful but critical in
assessing relationships. They
characterize inherent properties of mapped variables that often are very
well-portrayed in mapped data. But
sampling discards this information for the sake of traditional statistical
assumptions-- yielding a set of "independent" samples with small N's
versus the original mapped variables with very large N's that track the
geographic continuum of spatial autocorrelation and spatial variable
correlation within the data layers.
If maps fail to track
these conditions, then they are worthless and must be simply be abstract
renderings from the Geographer of Oz with no real-word reality. If the maps are erroneous or unregistered,
how will sampling alleviate these problems?
If the spatial autocorrelation and correlation are good estimates, how
will sampling to gain spatial independence better represent the spatial
relationships?
Joe
___________________________________
Letter to the Editor…
Dear
Editor:
Joseph
K. Berry is not correct when he states that "it's relatively safe to say
that the larger the 't-Stat' value, the greater the difference between data
groups" (Beyond Mapping, October 1999). The calculated t statistic
depends not only upon the difference in means between two sample populations
but upon sample variances of the populations and the sample size. A large
calculated t statistic can, for example be produced by a small difference in
means if variances are sufficiently small or sample sizes are sufficiently
large. The t statistic cannot be used to draw inferences about the magnitude,
on either an absolute or proportional basis, of the difference in means
between two populations.
The magnitude of the calculated t statistic tells us the probability of being
wrong should we reject the hypothesis of no difference in means between the two
populations. Should this probability be sufficiently low, we are justified in
concluding that the means are different. This point is no cavil; rather, proper
interpretation of test statistics is at the heart of our understanding of
statistical hypothesis testing and inference.
Dr.
Berry gets the interpretation correct in the sentence following the one I
quoted and would have been better off to have restricted the discussion to that
point.
Wayne Richter
Research Associate
Department of Biology
Skidmore College
Saratoga Springs, New York 12866
wrichter@skidmore.edu
Response:
You're
right… my statement was far too general.
It was made in context of the data under discussion (inside/outside of a
partition of the data defining a single soil nutrient surface map with
comparable N's and similar means/variances.
I agree that the statement in the "sentence following" is more
useful. However, a later statement is
critical… "While the t-test example might
serve as a reasonable instance of "blindly applying" non-spatial,
statistical tests to mapped data, it suggests this approach is a bit shaky as
it seldom provides a reliable test like it does in traditional, non-spatial
statistics (see author's notes)."
Interested
readers should checkout the October, 1999 column supplement at www.innovativegis.com/basis
where additional concerns about statistical tests are raised, as well as the
Excel spreadsheet containing the calculations used for all three approaches
discussed in the column. The spatially
dependent approaches of Percent Difference and Surface Configuration are
preferable for comparing map surfaces... and they generate "really
cool" maps, as well as really useful statistical summaries and indices.
Joe
Berry
_______________________________
Anyone else out there in cyberland want to comment? Send comments to jberry@innovativegis.com.