Topic 10 – Spatial Data Mining
Map Analysis book/CD
Statistically Compare Discrete Maps — discusses procedures for comparing discrete maps
Statistically Compare Continuous Map Surfaces — discusses procedures for comparing continuous map surfaces
Geographic Software Removes Guesswork from Map Similarity — discusses basic considerations and procedures for generating similarity maps
Use Similarity to Identify Data Zones — describes level-slicing for classifying areas into zones containing a specified data pattern
Use Statistics to Map Data Clusters — discusses clustering for partitioning an area into separate data groups
Spatial Data Mining “Down on the Farm” — discusses process for moving from Whole-Field to Site-Specific management
Further Reading — ten additional sections organized into three parts
(GeoWorld, July 2006)
One of the most fundamental techniques in map analysis is the comparison of two maps. Questions like…
“…how different are they?”
“…how are they different?”
“…where are they different?”
…immediately spring to mind. Quantitative answers are needed because visual comparison cannot fully consider all of the spatial detail in an objective manner.
The two maps shown in figure 1 identify crop yield for successive seasons (1997 and 1998) on the central-pivot cornfield. Note that the maps have a common legend from 0 to 300 bushels per acre grouped into five 60bu contour intervals. How different are the maps? How are they different? And where are they different? While your eyes flit back and forth in an attempt to visually compare the maps, the computer approaches the problem much more methodically.
Figure 1. Discrete yield maps for consecutive years.
Precision Agriculture application, thousands of point values for yield are
collected “on-the-fly” as a
The next step, as shown in figure 2, combines the two maps into a single map indicating the “joint condition” for both years. Since the two maps have an identical grid configuration, the computer simply retrieves the two class assignments for a grid location, and then converts them to a single number by computing the “first value times ten plus the second value” to form a two-digit code. In the example shown in the figure, the value “forty-three” is interpreted as class 4 in the first year but decreasing to class 3 in the next year.
The final step sums up the changes to generate the Coincidence Table shown in figure 3. The columns and rows in the table represent the class assignments on the 1997 and 1998 yield maps, respectively. The body of the table reports the number of cells for each joint condition. For example, column 4 and row 3 notes that there are 905 occurrences where the yield class slipped from level four (180-240bu/ac) to level three (120-180bu/ac).
Figure 2. Coincidence map identifying the joint conditions for both years.
Figure 3. Coincidence summary table.
The off-diagonal entries indicate changes between the two maps—the values indicate the relative importance of the change. For example, the 905 count for the “four-three” change is the largest and therefore identifies the most frequently occurring change in the field. The 0 statistic for the “four-one” combination indicates that level four never slipped all the way to level 1. Since the sum of the values above the diagonal (1224) is much larger than those below the diagonal (215), it clearly indicates that the downgrading of the yield classes dominates the change occurring in the field.
The diagonal entries summarize the agreement between the two maps. Generally speaking, the maps are very different as only a little more than half the field didn’t change (40+144+1648+18= 1850/3289= 56.25%). The greatest portion of the field that didn’t change occurs for yield class 3 (“three-three” with 1648 out of 1895 cells). The greatest difference occurred for class 4 (“four-four” with only 18/3289= 1.89% that didn’t change). The statistics in the table are simply summaries of the detailed spatial patterns of change depicted in the coincidence map shown in figure 2.
That’s a lot more meat in the answers to the basic map comparison questions (how much, how and where) than visceral viewing and eye flickering impression can do. The next section focuses on even more precise procedures for quantifying differences using two continuous map surfaces.
Statistically Compare Continuous Map Surfaces
(GeoWorld, November 2006)
Contour maps are the most frequently used and familiar form of presenting precision agriculture data. The two 3D perspective-plots in the top of figure 1 show the color-coded ranges of yield (0-60, 60-120, etc. bushels per acre) and are identical to the discrete maps discussed in the previous section. The color-coding of the contours is draped for cross-reference onto the continuous 3D surface of the actual yield data.
Note the “spikes and pits” in the surfaces that graphically portray the variance in yield data for each of the contour intervals. While discrete map comparison identifies shifts in broadly defined yield classes, continuous surface comparison identifies the precise difference at each location.
For example, a yield value of 179 bushels on one map and 121 on the other are both assigned to the third contour interval (120 to180; yellow). A discrete map comparison would suggest that no change in yield occurred for the location because the contour interval remained constant. A continuous surface comparison, would report a fairly significant 58-bushel decline.
Figure 1. 3-D Views of yield surfaces for consecutive years.
Figure 2 shows the calculations using the actual values for the same location highlighted in the previous section’s discussion. The discrete map comparison reported a decline from yield level 4 (180 to 240) to level 3 (120 to 180).
The MapCalc command, “Compute Yield_98 minus Yield_97 for Difference” generates the difference surface. If the simple “map algebra” equation is expanded to “Compute (((Yield_98 minus Yield_97) / Yield_97) *100)” a percent difference surface would be generated. Keep in mind that a map surface is merely a spatially organized set of numbers that awaits detailed analysis then transformation to generalized displays and reports for human consumption.
In figure 2, note that the wildest differences (side-by-side green spikes and red pits) occur at the field edges and along the access road—from an increase of 165 bushels to a decrease of 191 bushels between the two harvests.
Figure 2. A difference surface identifies the actual change in crop yield at each map location.
However, notice that most of the change is about a 25 bushel decline (mean= -22.6; median= -26.3) as identified in the summary table shown in upper right portion of figure 3.
The continuous surface comparison precisely reports the change for the example location as negative 38.1 bushels. The differences for other 3,289 grid cells are computed to derive a Difference Surface that tracks the subtle variations in the spatial pattern of the changes in yield.
The histogram of the yield differences in the figure shows the numerical distribution of the difference data. Note that it is normally distributed and that the bulk of the data is centered about a 25 bushel decline. The vertical lines in the histogram locate the contour intervals used in the 2D display of the difference map in the left portion of figure 3.
Figure 3. A 2-D map and statistics summarize the differences in crop yield between two periods.
The detailed legend links the color-coding of the map intervals to some basic frequency statistics. The example location with the calculated decline of –38.1 is assigned to the –39 to –30 contour range and is displayed as a mid-range red tone. The display uses an Equal Count method with seven intervals, each representing approximately 15% of the field. Green is locked for the only interval of increased yield. The decreased yield intervals form a color-gradient from yellow to red. All in all, surface map comparison provides more information in a more effective manner discrete map comparison. Both approaches, however, are far superior to simply viewing a couple yield maps side-by-side and guessing at the magnitude and pattern of the changes.
The ability to quantitatively evaluate continuous surfaces is fundamental to precision agriculture. A difference surface is one of the simplest and most intuitive forms. While the math and stat of other procedures are fairly basic, the initial thought of “you can’t do that to a map” is usually a reflection of our non-spatial statistics and paper-map legacies. In most instances, precision agriculture is simply an extension of current research and management practices from a few sample plots to extensive mapped data sets. The remainder of this case study investigates some of these extensions.
Geographic Software Removes Guesswork from Map Similarity
(GeoWorld, October 2001)
often have you seen a
But just how similar is one location to another? Really similar, or just a little similar? And just how dissimilar are all of the other areas? While visceral analysis can identify broad relationships it takes a quantitative map analysis approach to handle the detailed scrutiny demanded in site-specific management.
Figure 1. Map surfaces identifying the spatial distribution of P,K and N throughout a field.
Consider the three maps shown in figure 1— what areas identify similar patterns? If you focus your attention on a location in the southeastern portion how similar are all of the other locations?
The answers to these questions are much too complex for visual analysis and certainly beyond the geo-query and display procedures of standard desktop mapping packages.
While the data in the example shows the relative amounts of phosphorous, potassium and nitrogen throughout a cornfield, it could as easily be demographic data representing income, education and property values. Or sales data tracking three different products. Or public health maps representing different disease incidences. Or crime statistics representing different types of felonies or misdemeanors.
Regardless of the data and application arena, the map-ematical procedure for assessing similarity is the same. In visual analysis you move your eye among the maps to summarize the color assignments at different locations. The difficulty in this approach is two-fold— remembering the color patterns and calculating the difference. The map analysis procedure does the same thing except it uses map values in place of the colors. In addition, the computer doesn’t tire as easily and completes the comparison for all of the locations throughout the map window (3289 in this example) in a couple seconds.
Figure 2. Conceptually linking geographic space and data space.
The upper-left portion of figure 2 illustrates capturing the data patterns of two locations for comparison. The “data spear” at map location 45column, 18row identifies that the P-level as 11.0ppm, the K-level as 177.0 and N-level as 32.9. This step is analogous to your eye noting a color pattern of burnt-red, dark-orange and light-green. The other location for comparison (32c, 62r) has a data pattern of P= 53.2, K= 412.0 and N= 27.9. Or as your eye sees it, a color pattern of dark-green, dark-green and yellow.
The right side of figure 2 conceptually depicts how the computer calculates a similarity value for the two response patterns. The realization that mapped data can be expressed in both geographic space and data space is key to understanding the procedure.
Geographic space uses coordinates, such latitude and longitude, to locate things in the real world—such as the southeast and extreme north points identified in the example. The geographic expression of the complete set of measurements depicts their spatial distribution in familiar map form.
Data space, on the other hand, is a bit less familiar. While you can’t stroll through data space you can conceptualize it as a box with a bunch of balls floating within it. In the example, the three axes defining the extent of the box correspond to the P, K and N levels measured in the field. The floating balls represent grid cells defining the geographic space—one for each grid cell. The coordinates locating the floating balls extend from the data axes—11.0, 177.0 and 32.9 for the comparison point. The other point has considerably higher values in P and K with slightly lower N (53.2, 412.0, 27.9) so it plots at a different location in data space.
The bottom line is that the position of any point in data space identifies its numerical pattern—low, low, low is in the back-left corner, while high, high, high is in the upper-right corner. Points that plot in data space close to each other are similar; those that plot farther away are less similar.
In the example, the floating ball closest to you is the farthest one (least similar) from the comparison point. This distance becomes the reference for “most different” and sets the bottom value of the similarity scale (0%). A point with an identical data pattern plots at exactly the same position in data space resulting in a data distance of 0 that equates to the highest similarity value (100%).
The similarity map shown in figure 3 applies the similarity scale to the data distances calculated between the comparison point and all of the other points in data space. The green tones indicate field locations with fairly similar P, K and N levels. The red tones indicate dissimilar areas. It is interesting to note that most of the very similar locations are in the western portion of the field.
Figure 3. A similarity map identifying how related locations are to a given point.
A similarity map can be an invaluable tool for investigating spatial patterns in any complex set of mapped data. While humans are unable to conceptualize more than three variables (the data space box), a similarity index can handle any number of input maps. The different layers can be weighted to reflect relative importance in determining overall similarity.
In effect, a similarity map replaces a lot of laser-pointer waving and subjective suggestions of similar/dissimilar areas with a concrete, quantitative measurement at each map location. The technique moves map analysis well beyond the old “I’d never have seen, it if I hadn’t believe it” mode of cartographic interpretation.
Use Similarity to Identify Data Zones
(GeoWorld, November 2001)
The previous discussion introduced the concept of “data distance” as a means to measure similarity within a map. One simply mouse-clicks a location and all of the other locations are assigned a similarity value from 0 (zero percent similar) to 100 (identical) based on a set of specified maps. The statistic replaces difficult visual interpretation of map displays with an exact quantitative measure at each location.
An extension to the technique allows you to circle an area then compute similarity based on the typical data pattern within the delineated area. In this instance, the computer calculates the average value within the area for each map layer to establish the comparison data pattern, and then determines the normalized data distance for each map location. The result is a map showing how similar things are to the area of interest.
In the same way, a marketer could use an existing sales map to identify areas of unusually high sales for a product, and then generate a map of similarity based on demographic data. The result will identify locations with a similar demographic pattern elsewhere in the city. Or a forester might identify areas with similar terrain and soil conditions to those of a rare vegetation type to identify other areas to encourage regeneration.
The link between Geographic Space and Data Space is key. As shown in figure 1, spatial data can be viewed as a map or a histogram. While a map shows us “where is what,” a histogram summarizes “how often” measurements occur (regardless where they occur).
The top-left portion of the figure shows a 2D/3D map display of the relative amount of phosphorous (P) throughout a farmer’s field. Note the spikes of high measurements along the edge of the field, with a particularly big spike in the north portion.
Figure 1. Identifying areas of unusually high measurements.
The histogram to the right of the map view forms a different perspective of the data. Rather than positioning the measurements in geographic space it summarizes their relative frequency of occurrence in data space. The X-axis of the graph corresponds to the Z-axis of the map—amount of phosphorous. In this case, the spikes in the graph indicate measurements that occur more frequently. Note the single high occurrence spike of phosphorous around 11ppm, while the potassium indicates several spikes below 200ppm.
Now to put the geographic-data space link to use. The shaded area in the histogram view identifies measurements that are unusually high—more than one standard deviation above the mean. This statistical cutoff is used to isolate locations of high measurements as shown in the map on the right. The procedure is repeated for the potassium (K) map surface to identify its locations of unusually high measurements.
Figure 2 illustrates combining the P and K data to locate areas in the field that have high measurements in both. The graphic on the left is termed a scatter diagram or plot. It graphically summarizes the joint occurrence of both sets of mapped data.
Figure 2. Identifying joint coincidence in both data and geographic space.
Each ball in the scatter plot schematically represents a location in the field. Its position in the plot identifies the P and K measurements at that location. The balls plotting in the shaded area of the diagram identify field locations that have both high P and high K. The upper-left partition identifies joint conditions in which neither P nor K are high. The off-diagonal partitions in the scatter plot identify locations that are high in one but not the other.
The aligned maps on the right show the geographic solution for areas that are high in both of the soil nutrients. A simple map-ematical way to generate the solution is to assign 1 to all locations of high measurements in the P and K map layers (bight green). Zero is assigned to locations that aren’t high (light gray). When the two binary maps (0/1) are multiplied a zero on either map computes to zero. Locations that are high on both maps equate to 1 (1*1 = 1). In effect, this “level-slice” technique maps any data pattern you specify… just assign 1 to the data interval of interest for each map variable.
Figure 3 depicts level slicing for areas that are unusually high in P, K and N (nitrogen). In this instance the data pattern coincidence is a box in 3-dimensional scatter plot space.
Figure 3. Level-slice classification using three map variables.
However a map-ematical trick was employed to get the map solution shown in the figure. On the individual maps, high areas were set to P=1, K= 2 and N=4, then the maps were added together.
The result is a range of coincidence values from zero (0+0+0= 0; gray= no high areas) to seven (1+2+4= 7; red= high P, high K, high N). The map values in between identify the map layers having high measurements. For example, the yellow areas with the value 3 have high P and K but not N (1+2+0= 3). If four or more maps are combined, the areas of interest are assigned increasing binary progression values (…8, 16, 32, etc)—the sum will always uniquely identify the combinations.
While level-slicing isn’t a very sophisticated classifier, it does illustrate the useful link between data space and geographic space. This fundamental concept forms the basis for most geostatistical analysis… including map clustering and regression to be tackled in the next couple of sections.
Use Statistics to Map Data Clusters
(GeoWorld, December 2001)
The last couple of sections have focused on analyzing data similarities within a stack of maps. The first technique, termed Map Similarity, generates a map showing how similar all other areas are to a selected location. A user simply clicks on an area and all other map locations are assigned a value from 0 (0% similar—as different as you can get) to 100 (100% similar—exactly the same data pattern).
The other technique, Level Slicing, enables a user to specify a data range of interest for each map in the stack then generates a map identifying the locations meeting the criteria. Level Slice output identifies combinations of the criteria met—from only one criterion (and which one it is), to those locations where all of the criteria are met.
While both of these techniques are useful in examining spatial relationships, they require the user to specify data analysis parameters. But what if you don’t know what Level Slice intervals to use or which locations in the field warrant Map Similarity investigation? Can the computer on its own identify groups of similar data? How would such a classification work? How well would it work?
Figure 1 shows some examples derived from Map Clustering. The “floating” maps on the left show the input map stack used for the cluster analysis. The maps are the same P, K, and N maps identifying phosphorous, potassium and nitrogen levels throughout a cornfield that were used for the examples in the previous topics. However, keep in mind that the input maps could be crime, pollution or sales data—any set of application related data. Clustering simply looks at the numerical pattern at each map location and ‘sorts” them into discrete groups.
Figure 1. Examples of Map Clustering.
The map in the center of the figure shows the results of classifying the P, K and N map stack into two clusters. The data pattern for each cell location is used to partition the field into two groups that are 1) as different as possible between groups and 2) as similar as possible within a group. If all went well, any other division of the field into two groups would be not as good at balancing the two criteria.
The two smaller maps at the right show the division of the data set into three and four clusters. In all three of the cluster maps red is assigned to the cluster with relatively low responses and green to the one with relatively high responses. Note the encroachment on these marginal groups by the added clusters that are formed by data patterns at the boundaries.
The mechanics of generating cluster maps are quite simple. Simply specify the input maps and the number of clusters you want then miraculously a map appears with discrete data groupings. So how is this miracle performed? What happens inside cluster’s black box?
The schematic in figure 2 depicts the process. The floating balls identify the data patterns for each map location (geographic space) plotted against the P, K and N axes (data space). For example, the large ball appearing closest to you depicts a location with high values on all three input maps. The tiny ball in the opposite corner (near the plot origin) depicts a map location with small map values. It seems sensible that these two extreme responses would belong to different data groupings.
Figure 2. Data patterns for map locations are depicted as floating balls in data space.
The specific algorithm used in clustering is discussed in one of the further references at the end of this topic (see Identifying Data Patterns, in the Underlying Spatial Data Mining Concepts section of the references). However for this discussion, it suffices to note that “data distances” between the floating balls are used to identify cluster membership—groups of balls that are relatively far from other groups and relatively close to each other form separate data clusters. In this example, the red balls identify relatively low responses while green ones have relatively high responses. The geographic pattern of the classification is shown in the map at the lower right of the figure.
Identifying groups of neighboring data points to form clusters can be tricky business. Ideally, the clusters will form distinct “clouds” in data space. But that rarely happens and the clustering technique has to enforce decision rules that slice a boundary between nearly identical responses. Also, extended techniques can be used to impose weighted boundaries based on data trends or expert knowledge. Treatment of categorical data and leveraging spatial autocorrelation are other considerations.
So how do know if the clustering results are acceptable? Most statisticians would respond, “you can’t tell for sure.” While there are some elaborate procedures focusing on the cluster assignments at the boundaries, the most frequently used benchmarks use standard statistical indices.
Figure 3 shows the performance table and box-and-whisker plots for the map containing two clusters. The average, standard deviation, minimum and maximum values within each cluster are calculated. Ideally the averages would be radically different and the standard deviations small—large difference between groups and small differences within groups.
Figure 3. Clustering results can be roughly evaluated using basic statistics.
Box-and-whisker plots enable us to visualize these differences. The box is centered on the average (position) and extends above and below one standard deviation (width) with the whiskers drawn to the minimum and maximum values to provide a visual sense of the data range.
the diagrams for the two clusters overlap, as they do for the phosphorous
responses, it tells us that the clusters aren’t very distinct along this
axis. The separation between the boxes
for the K and N axes suggests greater distinction between the clusters. Given the results a practical
Spatial Data Mining “Down on the Farm”
(GeoWorld, August 2006)
Until the 1990s, maps played a minor role in production agriculture. Most soil maps and topographic sheets were too generalized for use at the farm level. As a result, the principle of Whole-Field Management based on broad averages of field data, dominated management actions. Weigh-wagon and grain elevator measurements established a field's overall yield performance, and soil sampling determined the typical/average nutrient levels for a field. Farmers used such data to determine best overall seed varieties, fertilization rates and a bushel of other decisions that all treated an entire field as uniform within its boundaries.
Precision Agriculture, on the other hand, recognizes the variability within a field and involves doing the right thing, in the right way, at the right place and time—Site-Specific Management. The approach involves assessing and reacting to field variability by tailoring management actions, such as fertilization levels, seeding rates and selection variety, to match changing field conditions. It assumes that managing field variability leads to cost savings and production increases as well as improved stewardship and environmental benefits.
Figure 1 outlines the major steps in transforming spatial data and derived relationships into on-the-fly variable rate maps that puts a little here, more over there and none at other places in the field. The prescription maps are derived by applying spatial data mining techniques to uncover the relationship between crop production and management variables, such as fertility applications of phosphorous, potassium and nitrogen (P, K and N).
Figure 1. Relationships among yield and field nutrient levels are analyzed to derive a prescription map that identifies site-specific adjustments to nutrient application.
detailed map of crop yield is constructed by on-the-fly yield monitoring as a
harvester moves through a field. A
record of the yield volume and
data are analyzed to relate the spatial variations in yield to the nutrient
data patterns in the field (spatial dependency/correlation) using such
techniques as map comparison, similarity analysis, zoning, clustering,
regression and other statistical approaches (see Author’s note). Once viable relationships are identified, a
prescription map is generated that instructs on-the-fly application of varying
nutrient inputs as a
might thing precision farming is an oxymoron when reality is mud up to the
axels and 400 acres to plow; however sophisticated uses of geotechnology are
rapidly changing production agriculture.
In just ten years it has become nearly impossible to purchase equipment
without wiring for
Figure 2. A similar process for mining and utilizing spatial relationships can be applied to other disciplinary fields.
What really is revolutionary is the changing mindset from whole-field to site-specific management and the spatial data mining process used to derive spatial relationships that translate geographic variation in data patterns into prescription maps. Figure 2 depicts a generalized flowchart of the process that can be applied to a number of disciplinary fields.
For example, my first encounter with spatial data mining procedures was in extending a test market project for a phone company in the early 1990s. Customer addresses were used to geo-code map coordinates for sales of a new product enabling different rings to be assigned to a single phone line—one for the kids and another for the parents. Like pushpins on a map, the pattern of sales throughout the test market area emerged with some areas doing very well, while other areas sales were few and far between.
demographic data for the city was analyzed to calculate a prediction equation
between product sale and census block data.
The equation then was applied to another city by evaluating existing
demographics to “solve the equation” for a predicted sales map. In turn, the predicted sales map was combined
with a wire-exchange map to identify switching faculties that required upgrading
before release of the product in the
Crop yield and sales yield at first might seem worlds apart, as do the soil nutrient and demographic variables that drive them. However from an analytical point of view the process is identical— just the variables are changed to protect the innocent. It leads one to wonder what other application opportunities await a paradigm shift from traditional to spatial statistics in other disciplines.
Author’s Notes: For more information on Precision Agriculture, see feature article “Who’s Minding the Farm” (GeoWorld, February, 1998) at www.innovativegis.com/basis/present/GW98_PrecisionAg/GW98_PrecisionAg.htm and the The Precision Farming Primer, a compilation of "Inside the GIS Toolbox" columns by J. K. berry published in the @gInnovator newsletter from 1993 to 2000 posted at www.innovativegis.com/basis/pfprimer/.
Further Online Reading: (Chronological listing posted at www.innovativegis.com/basis/BeyondMappingSeries/)
(Underlying Spatial Data Mining Concepts)
Beware the Slippery Surfaces of GIS Modeling — discusses the relationships among maps, map surfaces and data distributions (May 1998)
Link Data and Geographic Distributions — describes the direct link between numeric and geographic distributions (June 1998)
Explore Data Space — establishes the concept of "data space" and how mapped data conforms to this fundamental view (July 1998)
Identify Data Patterns — discusses data clustering and its application in identifying spatial patterns (August 1998)
(Advanced Map Comparison Techniques)
Compare Maps by the Numbers — describes several techniques for comparing discrete maps (September 1999)
Use Statistics to Compare Map Surfaces — describes several techniques for comparing continuous map surfaces (October 1999)
(Approaches Used in Deriving Prediction Maps)
Use Scatterplots to Understand Map Correlation — discusses the underlying concepts in assessing correlation among maps (November 1999)
Can Predictable Maps Work for You? — describes a procedure for deriving a spatial prediction model (December 1999)
Spatial Data Mining Allows Users to Predict Maps — describes the basic concepts and procedures for deriving equations that can be used to derive prediction maps (January 2002)
Stratify Maps to Make Better Predictions — illustrates a procedure for subdividing an area into smaller more homogenous groups prior to generating prediction equations (February 2002)