Is Technology Ahead of Science?
Joseph K. Berry
Berry & Associates // Spatial Information Systems, Inc. (BASIS), Fort Collins, Colorado
Agriculture is inherently a spatial endeavor. The newly emerging technologies of Geographic Information Systems (GIS), Global Positioning System (GPS) and Intelligent Devices and Implements (IDI) are enabling farmers to more effectively organize, update, query and analyze mapped data about their fields. The melding of these technologies marks a turning point in the collection and analysis of field data from a "whole-field" to a "site-specific" perspective of the inherent differences throughout a field. In many ways, the new technological environment is as different as it is similar to traditional mapping and data analysis. For agriculture research it appears to propagate as many questions as answers. Is the "scientific method" relevant in the data-rich age of knowledge engineering? Is the "random thing" pertinent in deriving mapped data? Are geographic distributions a natural extension of numerical distributions? Can spatial dependencies be modeled? How can "on-farm studies" augment agriculture research? This paper explores the conceptual differences between spatial and non-spatial data, its analysis and the opportunities and challenges it poses.
Site-specific management, commonly referred to as precision farming, is about doing the right thing, in the right way, at the right place and time. It involves assessing and reacting to field variability and tailoring management actions, such as fertilization levels, seeding rates and variety selection, to match changing field conditions. It assumes that managing field variability leads to both cost savings and production increases. Site-specific management isnt just a bunch of pretty maps, but a set of new procedures that link mapped variables to appropriate management actions. This conceptual linkage between crop productivity and field conditions requires the technical integration of several elements.
Site-specific management consists of four basic elements: global positioning system (GPS), data collection devices, geographic information systems (GIS) and intelligent implements. Modern GPS receivers are able to establish positions within a field to about a meter. When attached to a harvester and connected to a data collection device, such as a yield/moisture meter, these data can be "stamped" with geographic coordinates. A GIS is used to map the yield data so a farmer can see the variations in productivity throughout a field.
The GIS also can be used to extend map visualization of yield to "map-ematical" analysis of the relationships among yield variability and field conditions. Once established these relationships can be used to derive a "prescription" map of management actions required for each location in a field. The final element, intelligent implements, reads the prescription map as a tractor moves through a field and varies the application rate of field inputs in accordance with the precise instructions for each location. The combining of GPS, GIS and IDI (intelligent devices and implements) provides a foothold for both the understanding and the management of field variability.
To date, most analysis of yield maps have been visual interpretations. By viewing a map, all sorts of potential relationships between yield variability and field conditions spring to mind based on a farmers indigenous knowledge. Data visualization can be extended through GIS analysis linking digital maps of yield to field conditions. This "map-ematical" processing involves the same three levels required by visual analysis: cognitive, analysis and synthesis. At the cognitive level (termed desktop mapping) computer maps of variables, such as crop yield and soil nutrients, are generated. These graphical descriptions form the foundation of site-specific management. The analysis level uses the GISs analytical toolbox to discover relationships among the mapped variables. This step is analogous to a farmers visceral visions of relationships, but uses the computer to establish mathematical and statistical connections. The synthesis level of processing uses spatial modeling to translate the newly discovered relationships into management actions (prescriptions). The result is a prescription map linking management actions, such as variable rate control of inputs, to the unique pattern of field conditions.
Differences Between Spatial and Non-Spatial Data
Data is fundamental to site-specific management. All of these data are geographical in their occurrence (collected in real-world space), but the nature of the data how it is processed determines its spatial character. Some data, such as the number of times an aspiring assistant professor is cited in the literature, is inherently non-spatial. However, if you simultaneously recorded the location in the library of the publications containing the citations, the data would take on a geographic component. The utilization of the spatial component during processing determines the extent of the spatial nature of the data.
The coupling of descriptive information (what, how much, etc.) with location information (where) identifies geographical data. Descriptive information, when expressed numerically, forms four basic data types: nominal (values are merely exclusive), ordinal (values imply a hierarchical ordering, such as small, bigger, biggest), interval (values are ordered within a scale containing a constant interval, such as 42, 50 and 54 degrees centigrade), and ratio (values are ordered, contain a constant interval and have an absolute reference, such as zero degrees Kelvin).
Traditionally, location information has been depicted graphically as map features comprised of point, lines and polygons. Within a computer, these features are defined by a sequence of numeric strings identifying geographic coordinates (X,Y,Z) and a descriptive value, such as an ID number linking the feature to tables of descriptive information. This additional data type, termed choropleth, characterizes discrete objects, such as property lines, roads and pipelines.
However, some things do not exhibit sharp boundaries in geographic space. This data type, termed isopleth, forms a continuous distribution in geographic space. Gradients result from the nature of things, such as air temperature, chemical concentrations and terrain elevation, or from how we characterize them, such as soil nutrient distribution from point samples.
A final data type, termed binary, identifies things that exist only in two states, such as land/water, present/absent and good/bad. Sharp borders are formed, but they often reflect interpretation and conditions more than the inherent nature of the data (e.g., "suitable areas" and a reservoirs "shoreline").
Discrete objects can be directly digitized into a computer from existing maps, air photos, GPS records, or other sources and their descriptive information can be expressed in any of the four basic data types. Spatially defined binary data most often result from reclassification of exiting spatial data and is usually expressed as nominal data. Spatial gradients are derived from discrete measurements through aggregating, smoothing and interpolation of interval or ratio data types. Map contouring, reverses the effect by dividing continuous data into a set of discrete polygons, or response zones, implying sharp boundaries.
By its very nature, site-specific management primarily involves continuous spatial variables. While much of the infrastructure is discrete, such as roads, fences, and ditches, the focus of a farm database is on expansive fields with production factors that vary with geographic space. As a result, surface modeling plays a dominant role in site-specific management.
Surface Modeling Using Continuously Logged Data
Map surfaces, also termed spatial gradients, can be derived by summarizing continuously logged data and assigning the summary value to regularly spaced grid spaces. For example, yield mapping systems summarize point measurements (i.e., yield monitor readings) falling within each grid cell of an imaginary grid laid over a field. The average, standard deviation and other statistics are used to characterize the yield within the "spatially aggregated unit." There are two main advantages to this approach: 1) it provides consistent geographic referencing among various mapped data layers, and 2) it smoothes out measurement "noise" while reporting statistics on localized yield variation.
However, there are some technical issues that need to be considered. First, the positioning of the samples is assumed to be exact. In the case of yield monitors, there are several sources of error, such as mass flow time lags, that contribute to imprecise positioning. Also, the gridding resolution (cell size) can greatly affect summary calculations. If the grid pattern is too large, some of the information in the data will be lost ("averaged-over"). If it is too small, undo credence might be attached to differences arising simply from measurement and positioning errors.
Other technical issues involve the configuration of the summary window and the summary procedure employed. A single cell window design uses only those measurements that actually fall within a cell for the summary calculations. An inline design uses the direction of travel to "filter" surrounding data, thereby helping to smooth out measurement errors resulting from the "coughs and spits" of uneven grain flows through a combine. The technique is analogous to "moving averages" used in analyzing times series data, such as commodity prices or the stock market indices. Both the single and inline techniques have been used for "on-the-fly" data compressionsimply keep the summary statistics and discard the pile of raw measurements (not recommended). The nearest-neighbors technique involves post-processing of the raw data. It moves a window around the field sequentially centered on each grid cell (spatial summary unit). At each stop, it calculates a summary of the points falling within the window and assigns that value to that grid cell. In addition to window design, the summary procedure can vary such as simple or weighted averaging. For example, a distance-weighted average is influenced more by nearby measurements than those that are farther away.
Surface Modeling Using Point Sampled Data
Whereas continuously logged data is analogous to a "census" of spatial units, point sampling derives a statistical estimate for each spatial unit based on a set of dispersed measurements. Sampling design is critical to interpolation success and involves four distinct considerations: 1) stratification, 2) sample size (intensity), 3) sampling grid, and 4) sampling pattern. It is important to note that traditional non-spatial sampling considerations (a representative number of random samples) are inappropriate for surface modeling as they focus on assessing the typical response (average) in numeric space. Surface modeling, on the other hand, seeks to map the geographic distribution (variance) and requires a sampling design that proportions samples as a function of spatial extent.
Two broad approaches are used in deriving geographic distributions of spatial variables: map generalization and spatial interpolation. Map generalization fits a functional form to an entire set of sample points. The simplest is a flat X,Y plane (geographic axes) that has half the points above it and half below it as viewed along the Z-axis (measurement axis). The Z value corresponds to the arithmetic "average" in non-spatial statistics that is assumed to have a uniform distribution in geographic space (flat plane).
If the plane is allowed to tilt while minimizing its deviations to the data points (best fit of a 1st degree polynomial in three-space), it will identify the spatial trend in the data. This procedure is analogous to fitting a "regression line" in non-spatial statistics. Relaxing the assumption of a plane (linear relationship) involves fitting Nth degree polynomials as curved surfaces which is similar to non-linear regression and other prediction equation fitting techniques used in traditional data analysis.
Instead of fitting a functional form to an entire data set, spatial interpolation fits a localized function within a "roving window" moved throughout a spatial extent. Thiessen polygons are formed by assigning the value of the nearest sample point to each spatial unit resulting in a map of the perpendicular bisectors between neighboring sample points. Nearest-neighbors technique simply averages the set of sample points within a specified radius of each spatial unit. If done repeatedly, the technique results in consistent smoothing of the geographic distribution of the data and ultimately approaches the flat "average" plane. Inverse-distance-weighted procedures use the distance from the spatial unit to each sample within the summary window to weight-average with closer samples having more influence.
Kriging techniques determine window configuration and weighting factors as a function of the spatial autocorrelation in the sample set. Although a detailed discussion of spatial autocorrelation is beyond the scope of this paper, it should be noted that it relates the similarity among sample points (i.e., the inverse of the variance) to the distance between samples. In essence, it expresses to what degree "nearby things are more related than distant things" (Toblers first law of geography), the assumption behind all spatial interpolation techniques. Whereas the Geary and Moran indices report an overall measure of spatial autocorrelation, a variogram plot captures its functional form over a range of inter-sample distances.
There are numerous procedures for analyzing relationships within (univariate) and among (multivariate) map surfaces. Surface analysis techniques provide insight into subtle patterns contained in the surfaces, such as areas of significantly different high and low yields from typical levels in a field. In addition, other techniques can summarize the degree of coincidence among surfaces and/or test hypotheses, such as which wheat variety performs best within areas of high nematode concentrations. Surface analysis also can be used to derive spatially responsive prediction equations, such as a regression equation relating yield (independent variable) to soil nutrient surfaces (independent variables).
Underlying all of these techniques is the realization that a map surface is a set of spatially organized numbers first, colorful image (traditional map) later. The important outcome is a "map-ematical" treatment of the numbers that respects the spatial autocorrelation and spatial dependencies captured in the surfaces.
Many of the traditional mathematical and statistical procedures in non-spatial data analysis translate to surface analysis. The accompanying tables identify several of the procedures for univariate (within a single surface, Table 1) and multivariate (among two or more surfaces, Table 2) mapped data.
Spatially organizing data provides an additional set of analytical operations beyond the extensions of traditional data analysis techniques. Geographically dependent procedures, such as optimal path delineation, inter-visibility among map features, binary masking, and effective proximity are examples of an entirely new set of map-ematical operations. Although a detailed discussion of spatial analysis is beyond the scope of this paper, it should be noted that these procedures arise from the spatial character of mapped data and do not have direct lineage to non-spatial mathematics and statistics.
For example, the distance between two points is easily measured with a ruler. In mathematics, the manual procedure is replicated by evaluating Pythagoras theorem (c2= a2+b2) using the X,Y coordinates to define the sides of a right triangle, then solving for the hypotenuse. Both approaches, however, assume the "shortest, straight line distance between two points" a constrained definition of distance that is rarely the actual path of movement connecting things in the real world.
The traditional concept of distance is expanded to one of proximity by relaxing the assumption that connections are always "between two points," but can be among sets of points (e.g., current location to everywhere within a spatial extent). Further expansion of the concept relaxes the assumption of "straight line" connectivity by introducing absolute and relative barriers that must be respected in determining the shortest" movement path connecting locations.
In site-specific management, effective proximity can be used to identify locations that are "uphill," "up-wind," or "up along surface or ground water flows." Intervening terrain can be considered in determining sediment loading potential, resulting in "effective environmental buffers" allowing farming activity close to streams when conditions warrant. Current laws delineating a "fixed" buffer, such as 100 feet around class 2 streams, might be easy to draft, but serves neither fish nor farmer as real world conditions affecting sediment loading potential vary along a stream.
Most resource and environmental processes do not adhere to simple geographic concepts, such as Euclidian distance. The GIS modeling "toolbox" contains a myriad of new spatial statistics and spatial analysis procedures that promise to revolutionize agricultural research, as much as it impacts farm management and operations. As technology relaxes simplifying assumptions, understanding of complex spatial relationships increases. The technology is in place, but the science supporting its application is not. The transition from non-spatial science to spatially-driven science is the missing link needed for successful site-specific management.
Opportunities and Challenges
The new spatial technologies mark a turning point in the collection and analysis of field data from a "whole-field" to a "site-specific" perspective of the inherent differences throughout a field. In many ways, this new technology is as different as it is similar to traditional mapping and data analysis. For agriculture research it appears to propagate as many questions as it answers.
Is the "scientific method" relevant in the data-rich age of knowledge engineering?
The first step in the scientific method is the statement of a hypothesis. It reflects a "possible" relationship or new understanding of a phenomena. Once a hypothesis is established, a methodology for testing it is developed. The data needed for evaluation is collected and analyzed and, as a result, the hypothesis is accepted or rejected. Each completion of the process contributes to the body of science, stimulates new hypotheses, and furthers knowledge.
The scientific method has served science well. Above all else, it is efficient in a data constrained environment. However, technology has changed the nature of that environment. Yield monitors, for example, can "census" an entire field easier and at less cost than manually sampling a few plots. Surface modeling can map the spatial autocorrelation within a set of soil samples. Effective proximity to influential features can be derived. Terrain conditions, such as slope and aspect, can be mapped. Localized variation in a variable, such as moisture content, can be spatially characterized. Remotely sensed data encapsulates the conditions and characteristics of a scene as spectral response values.
The result is a robust database composed of thousands of spatially-registered locations (spatial units) relating a diverse set of variables. In this data-rich environment, the focus of the scientific method shifts from efficiency in data collection and analysis to the derivation of alternative of hypotheses. Hypothesis building results from "mining" the data under various spatial and thematic partitions. The radical change is that the data collection and initial analysis steps proceed the hypothesis statement in effect, turning the traditional scientific method on its head.
Is the "random thing" pertinent in deriving mapped data?
A cornerstone of traditional data analysis is randomness. In data collection it seeks to minimize the effects of autocorrelation and dependence among variables. Historically, "census" of a variable was prohibitive and randomness provided an unbiased sample set for estimating the typical state of a variable (i.e., average). Calculation of the variation within the sample set (i.e., variance) establishes how typical the typical is. In multivariate analysis, the mean vector and covariance matrix are used to assess the dependency among variables (i.e., correlation).
For questions of central tendency and non-spatial dependency in data, randomness is essential, as it supports the basic assumptions about analyzing data in numeric space (devoid of spatial interactions). However, in geographic space, randomness rarely exists. The ambient temperatures for neighboring locations are not random. Nor is crop yield. The spatial relationships among mapped variables, such as crop yield and soil nutrient levels, rarely display random patterns of coincidence.
Spatial interactions are fundamental to site-specific management and research. Adherence to the "random thing" runs counter to continuous spatial expression of variables. This is particularly true in sampling design. While efficiently establishing the central tendency, random sampling fails to consistently exam the spatial pattern of variations in a variable. An underlying systematic sampling design, such as systematic unaligned, is needed to insure a consistent distribution of samples over the spatial extent.
Are geographic distributions a natural extension of numerical distributions?
To characterize a variable in numeric space, density functions, such as the standard normal curve, are used. They translate the pattern of discrete measurements along a "number line" into a continuous numeric distribution. Statistics describing the functional form of the distribution determine the central tendency of the variable and ultimately its probability of occurrence. If two variables are considered simultaneously, a three-dimensional probability surface is derived. Consideration of additional variables results in an N-dimensional numerical distribution.
The geographic distribution of a variable can be derived from discrete sample points positioned in geographic space. The map generalization and spatial interpolation techniques described earlier can be used to form a continuous distribution, in a manner similar to deriving a numeric distribution. In effect, the Gaussian, Poisson and binomial density functions used in non-spatial statistics are analogous to the polynomial, inverse-distance-squared and krigging density functions used in spatial statistics.
Although the mechanical expressions are similar, the information contained in numeric and geographic distributions is different. Whereas numeric distributions provide insight into the central tendency of a variable, geographic distributions provide information about the pattern of variations. Generally speaking, non-spatial characterization supports "whole-field" management, while spatial characterization supports "site-specific" management. It can be argued that research using non-spatial techniques provides minimal guidance for site-specific management in fact it might even be dysfunctional.
Can spatial dependencies be modeled?
Non-spatial modeling, such as linear regressions derived from point sampled data, assume spatially independent data and seeks to implement the "best overall" action everywhere. Site-specific management assumes spatially dependent data and seeks to evaluate "IF <spatial condition> THEN <spatial action>" rules for the specific conditions at each location throughout a field. The underlying philosophies the two approaches are at odds. However, the "mechanics" of their expression spring from the same roots.
Within a traditional mathematical context, each map represents a "variable," each cell or polygon represents a "case" and the value at that location represents a "measurement." In a sense, each cell or polygon can be conceptualized as a sample plot it is just that sample plots are everywhere. A yield monitor (and remotely sensed data for that matter) provides a direct measurement for each spatial unit (average of several yield readings or integration of electromagnetic energy). Point sampling, such as for soil nutrients and elevation, uses surface modeling techniques to statistically estimate a response for each spatial unit.
The result is a data structure that tracks spatial autocorrelation and spatial dependency. The structure can be conceptualized as a stack of maps in which a vertical pin spears a sequence of values defining each variable for that location sort of a data shishkebab. Regression, or similar techniques, can be applied to the data vectors uncovering a spatially-dependent model of the relationships.
Admittedly, imprecise, inaccurate or poorly modeled surfaces, may incorrectly track the spatial relationships. But, given good data, the map-ematical approach has the capability to model the spatial character inherent in the data. What is needed is a concerted effort by the scientific community to identify guidelines for spatial modeling and develop techniques for assessing the accuracy of mapped data and results of its analysis.
How can "on-farm studies" augment agriculture research?
Agriculture research has historically focused on intensive investigations implemented at experimental fields. These studies are well-designed and methodically executed by researchers who are close to the data. As a result, the science performed is both rigorous and professional. However, it is extremely limited in both time and space. The findings might accurately reflect relationships for the experimental field during the study period, but offer minimal information for a farmer 70 miles away under different biological agent, soil and climatic conditions.
Farmers, on the other hand, manage large tracks of land for long periods of time, but are generally unaccustomed to administering scientific projects. As a result, farm operations and on-farm studies are often incompatible. On-farm variety trials, to a limited degree, bridge this gap. The growing popularity of site-specific management has the potential to fill the gap. Overhead, on-board and proximal sensors are posed to collect detailed farm-wide data that a couple of years ago would have required an army of graduate students.
It is recognized that sophisticated instrumentation and the databases they generate are required to implement site-specific management. But, often overlooked is the reality that these data form the scientific fodder needed to build the spatial relationships demanded by the process. Site-specific management has changed farming operations, now it must change farm research. A close alliance between researchers and farmers is fundamental to this change. Without it, constrained research (viz. esoteric) mismatches the needs of evolving farm technologies, and heuristic (viz. unscientific) rules-of-thumb are substituted. The farmer has the data and the researcher has the methodology both are key to successfully implementing site-specific management.
Agriculture is inherently a spatial endeavor. Emerging technologies enable farmers to effectively organize, update, query and analyze mapped data about their fields. The melding of these technologies marks a turning point in the collection and analysis of field data from a "whole-field" to a "site-specific" perspective of the inherent differences throughout a field. In many ways, the new technological environment is as different as it is similar to traditional mapping and data analysis. It provides new capabilities for characterizing spatial relationships, such as spatial statistics, surface modeling and spatial analysis. These techniques allow better understanding of the spatial dependencies within and among mapped data. In addition to new opportunities, the techniques pose new challenges for farmers and researchers alike. In a sense, technology is ahead of science sort of the cart before the horse. Site-specific management can map spatial patterns and reactions to a meter (technological cart), but our historical science base has been calibrated for the entire field (scientific horse).
Berry, J.K., 1993. Beyond Mapping: Concepts, Algorithms and Issues in GIS, published by GIS World Books. A compilation of columns on map analysis considerations published in GIS World from 1989 to 1993. Companion software tMAP contains "hands-on" exercises in map analysis capabilities that are cross-referenced to Beyond Mapping and Spatial Reasoning books.
Berry, J.K., 1995. Spatial Reasoning for Effective GIS, published by GIS World Books. A compilation of columns on map analysis considerations published in GIS World from 1993 to 1995. Companion software gCON contains digital slide shows on GIS concepts that are cross-referenced to Beyond Mapping and Spatial Reasoning books.
Berry J.K., in press. Precision Farming Primer. A compilation of columns on site-specific crop management considerations published in Successful Farmings ag/INNOVATOR newsletter from 1994 to 1998. Companion software pfMAP contains "hands-on" exercises and digital slide shows in precision farming data analysis that are cross-referenced to the Precision Farming Primer book.
Cressie, N.A., 1993. Statistics for Spatial Data, published by John Wiley and Sons. Aimed at scientists and engineers, the book uses spatial data to illustrate spatial theory and methods. Includes highly detailed treatments of such integral areas as geostatistical data, models of spatial lattice data, asymptotics and spatial point patterns.
Fotheringham, S. and P. Rogerson, 1994. Spatial Analysis and GIS, published by Talyor and Francis. The book focuses on the relative lack of research into spatial analysis and GIS integration and its potential benefits. It examines GIS and spatial analysis integration issues, research emphasizing methods of spatial analysis, and mathematical modeling and GIS issues.
Shaw, G. and D. Wheeler, 1994. Statistical Techniques in Geographical Analysis, published by Halsted Press. Covers a range of techniques, from simple descriptive to parametric and nonparametric methods, in bivariate and multivariate settings. It sequentially introduces topics and reinforces them with appropriate application examples.
Note1: the above books are available from the GIS World Warehouse, GIS World, 400 N. College, Suite 100, Fort Collins, CO, USA 80524; Phone (970) 221-0037; Fax (970) 221-5150; Email email@example.com, Web http://www.geoplace.com/books/.
Note2: an online PowerPoint presentation of the slides used in presenting this paper is available on the Worldwide Web at http://www.innovativegis.com/basis/. Several additional papers and presentations on this and related topics are available as well.
Table 1. Annotated Listing of Example Univariate Surface Analysis Techniques
Table 2. Annotated Listing of Example Multivariate Surface Analysis Techniques