Beyond Mapping I

Topic 1 – Maps as Data and Data Structure Implications


BM_graphic.gif (9366 bytes)


Beyond Mapping book



Maps as Data: a 'Map-ematics' is emerging  describes the differences between Discrete and Continuous mapped data

It Depends: Implications of data structure  discusses and compares the similarities and differences between Vector and Raster data structure applications

GIS Technology Is Technical Oz  discusses and compares the relative advantages/disadvantages between Vector and Raster processing

<Click here> for a printer-friendly version of this topic (.pdf).

(Back to the Table of Contents)

Maps as Data: a 'Map-ematics' is emerging

(GIS World, March 1989)  

(return to top of Topic)


  Old Proverb: A picture is worth a thousand words...

  New Proverb: A map is worth (and most often exceeds) a thousand numbers.


Our historical perspective of maps is one of accurate location of physical features primarily for travel through unfamiliar areas.  Early explorers used them to avoid angry serpents, alluring sirens and even the edge of the earth.  The mapping process evokes images of map sheets, and drafting aids such as pens, rub-on shading, rulers, planimeters, dot grids, and acetate transparencies for light-table overlays.  This perspective views maps as analog mediums composed of lines, colors and cryptic symbols that are manually created and analyzed.  As manual analysis is difficult and often limited, the focus of the analog map and manual processing is descriptive— recording the occurrence and distribution of features.


More recently, the analysis of mapped data has become an integral part of resource and land planning.  By the 1960's, manual procedures for overlaying maps were popularized.  These techniques marked a turning point in the use of maps— from one emphasizing the physical descriptors of geographic space, to one spatially characterizing appropriate management actions.  The movement from 'descriptive' to 'prescriptive' mapping has set the stage for modern computer-assisted map analysis.


Since the 1960's, decision-making has become much more quantitative.  Mathematical models for non-spatial analyses are becoming commonplace.  However, the tremendous volumes of data used in spatial analysis limits the application of traditional statistics and mathematics to spatial models.  Non-spatial procedures require maps be generalized to typical values before they can be used.  The spatial detail for large areas are reduced to a single value expressing the 'central tendency' of a variable at that location— a tremendous reduction from the spatial specificity in the original map. 


Recognition of this problem led to the 'stratification' of maps at the onset of analysis by dividing geographic space into assumed homogenous response parcels.  Heated arguments often arise as to whether a standard normal, binomial or Poisson distribution best characterizes the typical value in numeric space.  However, relatively little attention is given to the broad assumption that this value must be presumed to be uniformly distributed in geographic space.  The area-weighted average of several parcels' typical values is used to statistically characterize an entire study area.  Mathematical modeling of spatial systems has followed a similar approach as that of spatial statistics— aggregating spatial variation in model variables.  Most ecosystem models, for example, identify 'level' and 'flow' variables presumed to be typical for vast geographic expanses.



Figure 1.  Conventional elevation (topographic) contour map versus three-dimensional terrain representation.


However, maps actually map the details of spatial variation.  Manual cartographic techniques allow manipulation of these detailed data; yet they are fundamentally limited by their non-digital nature.  Traditional statistics and mathematics are digital; yet they are fundamentally limited by their generalizing of the data.  This dichotomy has led to the revolutionary concepts of map structure, content and use forming the foundation of GIS technology.  It radically changes our perspective.  Maps move from analog images describing the location of features to digital mapped data quantifying a physical, social or economic system in prescriptive terms. 


This revolution is founded in the recognition of the digital nature of computerized maps— maps as data, maps as numbers.  To illustrate, consider the tabular and graphic information in Figure 1.  The upper left inset is a typical topographic map.  One hundred foot contour lines show the pattern of the elevation gradient over the area.  The human eye quickly assesses the flat areas, the steep areas, the peaks and the depressions.  However in this form, the elevation information is incompatible with any quantitative model requiring input of this variable. 


Traditional statistics can be used to generalize the elevation gradient as shown in the table in the upper right.  We note that the elevation ranges from 500 to 2500 feet with an average of 1293 feet.  The standard deviation of +- 595 feet tells us how typical this average is— most often (about two-thirds of the time) expect to encounter elevations from 698 to 1888 feet.  But where would I expect higher elevations; where would I expect lower?  The statistic offers no insight other than that the larger the variation, the less 'typical' is the average; the smaller the better.  In this instance, it's not very good as the standard deviation is nearly half the mean (coefficient of variation= .46).


The larger centered inset is a 3-dimentional plot of the elevation data.  The gridded data contains an estimate of the elevation at each hectare throughout the area.  In this form, your eye sees the variability in the terrain— the flat area in the NW, the highlands in the NE.  For contrast, the average elevation is represented as the horizontal plane intersecting the surface at 1293 feet.  Its standard deviation can be conceptualized as two additional planes 'floating' +- 595 feet above and below the average plane (arrows along the 'Z' axis). 


A non-spatial model must assume the actual elevation for any parcel is somewhere between these variation planes, most likely about 1293 feet.  But your eye notes the eastern portion is above the mean, while the western portion is below.  The digital representation stored in a GIS maps this variation in quantitative terms.  Thus the average and variance is the conceptual link between spatial and non-spatial data.  The average of traditional statistics reduces the complexity of geographic space to a single value.  Spatial statistics retains this complexity as a map of the variation in the data. 


In computer-assisted map analysis all maps are viewed as an organized set of numbers.  These numbers have numerical significance, as well as conventional spatial positioning concerns, such as scale or projection.  It is the numerical attribute of GIS maps that fuels the concepts of 'map-ematics'.  For example, the first derivative of the elevation surface in the figure creates a slope map.  The second derivative creates a terrain roughness map (where slope is changing).  An aspect map (azimuthal orientation) indicates the direction of terrain slope at each hectare parcel. 


But what if the figure wasn't mapping elevation— rather the concentration of an environmental variable, such as lake temperatures or soil concentrations of lead?  For lake temperatures, the first derivative would map the rate of cooling.  The aspect map would indicate the direction of cooling throughout the lake.  For lead concentrations, the first derivative would map the rate of lead accumulation throughout the study area.  The second derivative (change in the rate of accumulation) would provide information about multiple sources of lead pollution or abrupt changes in seasonal wind patterns.  The aspect map of lead concentrations would indicate the direction of accumulation.  If the figure were a cost surface, the first derivative maps marginal cost; the aspect map indicates direction of minimal cost movement throughout the area.  If it were a travel-time surface, the first derivative maps speed, the second, acceleration, and the aspect map indicates the optimal movement through each parcel.


This quantitative treatment of maps will be the subject of a series of articles in GIS World.  We will investigate such topics as data structure implications, error assessment, measuring effective distance, establishing optimal paths and visual connectivity, spatial interpolation, and linking spatial and non-spatial data.  The foundation for these new analytic capabilities is the digital nature of GIS maps— a map is worth (and sometimes exceeds) a thousand numbers.



It Depends: implications of data structure

(GIS World, May 1989)  

(return to top of Topic)


The main purpose of a geographic information system is to process spatial information.  In doing so it must be capable of four things:


-       create digital abstractions of the landscape (encode),

-       efficiently handle these data (store),

-       develop new insights into the relationships of spatial variables (analyze),

-       and ultimately create 'human-compatible' summaries these relationships (display). 


The data structure used for storage has far reaching implications in how we encode, analyze and display digital maps.  It has also has fueled heated debate as to the 'universal truth' in data structure since the inception of GIS.  In truth, there are more similarities than differences in the various approaches. 


All GIS are 'internally referenced' which means they have an automated linkage between the data (or thematic attribute) and the where‑abouts (or positional attribute) of that data.  There are two basic approaches used in describing positional attributes.  One approach (vector) uses a collection of line segments to identify the boundaries of point, linear and areal features.  The alternative approach (raster) establishes an imaginary grid pattern over a study area, then stores values identifying the thematic attribute occurring within each grid space. 


Although there are significant practical differences in these data structures, the primary theoretical difference is that the grid structure stores information on the interior of areal features, and implies boundaries; whereas, the line structure stores information about boundaries, and implies interiors.  This fundamental difference determines, for the most part, the types of applications that may be addressed by a particular GIS. 


It is important to note that both systems are actually grid-based, it's just in practice that line-oriented systems use a very fine grid of 'digitizer' coordinates.  Point features, such as springs or wells on a water map, are stored the same for both systems— a single digitizer 'x,y' coordinate pair or a single 'column,row' cell identifier.  Similarly, line features, such as streams on a water map, are stored the same— a series of 'x,y' or 'column,row' identifiers.  If the same gridding resolution is used, there is no theoretical difference between the two data structures, and considering modern storage devices, only minimal practical differences in storage requirements. 


Yet, it was storage considerations that fueled most of the early debate about the relative merits of the two data structures.  Demands of a few, or even one, megabyte of storage were considered a lot in the early 1970's.  To reduce storage, very coarse grids were used in early grid systems.  Under this practice, streams were no longer the familiar thin lines assumed a few feet in width, but represented as a string of cells of several acres each.  This, coupled with the heavy reliance on pen-plotter output, resulted in 'ugly, saw-toothed' map products when using grid systems.  Recognition of any redeeming qualities of this data form was lost to the unfamiliar character of the map product.


Consideration of areal features present significant theoretical differences between the two data structures.  Its border defined as a series of line segments, or its interior defined by a set of cells identifying open water might describe a lake on a water map.  This difference has important implications in the assumptions about mapped data.  In a line-based system, the lines are assumed to be 'real' divisions of geographic space into homogenous units.  This assumption is reasonable for most lakes if you accept the premise that the shoreline remains constant. 


However, if the body of water is a flood-control reservoir the shoreline could shift several hundred meters during a single year.  A better example of an ideal line feature is a property boundary.  Although these divisions are not physical, they are real and represent indisputable boundaries that, if you step one foot over the line, often jeopardize friendships and international treaties alike. 


However, consider the familiar contour map of elevation.  The successive contour lines form a series of long skinny polygons.  Within each of these polygons the elevation is assumed to be constant— forming a 'layer-cake' of flat terraces in 3-dimensional data space.  For a few places in the world, such as rice patties in mountainous portions of Korea or the mesas of New Mexico, this may be an accurate portrayal.  This aggregation of a continuous spatial gradient discards much of the information used in its derivation. 

An even less clear example of a traditional line-based image is the familiar soil type map.  The careful use of a fine-tipped pen in characterizing the distribution of soils imparts artificial accuracy at best.  At worst, it severely limits the potential uses of soil information in a geographic information system. 


As with most resource and environmental data, a soil map is not 'certain'; as contrasted with the surveyed and legally filed property map.  Rather the distribution of soils is probabilistic— the lines form artificial boundaries presumed to be the abrupt transition from one soil type to another.  Throughout each of the soil polygons, the occurrence of the designated soil type is treated as equally likely.  Most soil map users reluctantly accept the 'inviolately accurate' assumption of this data structure, as the recourse is to dig soil pits everywhere within a study area.  It’s a lot easier to just go with the flow.


A more useful data structure for representing soils is gridded, with each grid location identified by its most probable soil, a statistic indicating how probable, the next most probable soil, its likelihood, and so on.  In this context, soils are characterized as a continuous statistical gradient— detailed data, rather than an aggregated, human-compatible image.  Such treatment of map information is a radical departure from the traditional cartographic image.  Such treatment highlights the revolution in spatial information handling brought about by the digital map.  From this new perspective, maps move from images 'describing' the location of features to mapped information quantifying a physical or abstract system in 'prescriptive' terms— from inked lines and colorful patterns to mapped variables affording numerical analysis of complex spatial interrelationships.


The data structure (lines or grids) plays an important part in map analysis.  Storage requirements of GIS's are massive.  A typical U.S Soil Conservation Service map based on the US Geological 7.5 minute quadrangle contains about 1,200 soil polygons.  A complete digital data base containing all 54,000 quadrangle maps covering the lower 49 U.S. states would involve keeping track of nearly 65,000,000 polygons each defined by numerous coordinates.  A similar grid system would require 1,000,000,000,000,000 bits of data for a detailed gridding resolution of 1.7 meters (Light, 1986). 


With current technology, all those data could be stored on 4000 optical disks— smaller than a phonograph record library at a typical radio station.  The storage requirements for a hectare gridding resolution (a reasonable land planning cell size) for a similar data base for the entire continent of Africa could be stored on five optical disks.  A map stored in this data base could be accessed, then analyzed to derive a new map, and that map stored in a matter of few seconds.  Though processing and storage requirements of GIS technology are significant, advances in computer technology are rapidly changing our views defining the 'practical' limits.  



Figure 1.  Comparison of two data structures and their applications.


The theoretical differences between the two data structures—'line' and 'grid'—are significant in considering the future of GIS technology. The insets in Figure 1 overlaid results of three simple geometric shapes— lines on the left; grids on the right.  As noted previously, the lines describe boundaries around areas assumed to be the same throughout their interior.  Grid structure, on the other hand, defines the interior of features as groupings of contiguous cells.  For lines, this consists of a few coordinate pairs stored for each shape, with the curved line of the circle having the most line segments at fourteen. 


The grid structure uses a 25 by 25 matrix of numbers (625 total cells) to represent each of the three geometric shapes.  Even though a data compression technique was used for the gridded maps, the storage requirement for these simple shapes is significantly less for the line structure— 21 coordinate pairs versus nearly five hundred numbers for the three gridded maps.  In addition, the boundaries of the features are more accurately plotted.  Why would anyone ever use grids?  Well it depends. 


Significant differences are apparent during analysis of these data.  In the line structure, seventeen new polygons were derived, comprised of 39 individual line segments.  A significant increase is noted in the storage requirement for the composite map over any of the original maps.  Consider the complexity of overlaying a typical land use map of several hundred polygons with a soil map of over a thousand— the result is more 'son and daughter' polygonal offspring than you would care to count (or most small computers would care to store). 


Even more significant, are the computational demands involved in splitting and fusing the thousands of line segments forming the new boundaries of the composite map.  By contrast, a composite of the maps stored in grid structure simply involves matrix addition.  The storage requirement for the result is slightly more than that of any of the original maps, but can never exceed the maximum dimensionality of the grid.  In most advanced map analyses, the line structure is significantly less efficient in both computation and storage of derived data.  In addition, recent advances in computer hardware, such as array processors and fast access, high-resolution raster displays, utilize a grid structure.  To take advantage of this new technology, line systems must be converted to grids, adding an additional processing step.



GIS Technology Is Technical Oz

(GIS World, July/August 1989)  

(return to top of Topic)'re hit with a tornado of new concepts, temporarily hallucinate and come back to yourself a short time later wondering what on earth all those crazy things meant (JKB)


As promised (or threatened) in the last issue of GIS World, this article continues to investigate the implications of data structure on map analysis.  Recall that first and foremost, maps in a GIS are digital data organized as large sets of numbers; not analog images comprised of inked lines, colors and shadings.  Data structure refers to how we organize these numbers— basically as a collection of line segments or as a set of grid cells.  Theoretical differences between these two structures arise for storage of polygonal features.  Line-based structures store information about polygon boundaries, and imply interiors.  Cell-based structures do just the opposite; implying boundaries while storing information on interiors.  So much for review.  What does this imply for map analysis?


In short, which of the two basic approaches is used significantly affects map analysis speed, accuracy and storage requirements.  It also defines the set of potential analytic 'tools' and their algorithms.  For example, consider the accompanying figure depicting three simple geometric shapes stored in typical formats of both structures.  As noted previously, the lines describe boundaries around areas assumed to be the same throughout their interior (right side of figure 1). 


Figure 1.  Comparison of overlay results using vector (left side) and raster (right side).


Cells, on the other hand, define the interior of features as groupings of contiguous cells (left side).  The series of numbers with both insets in the figure show example storage structures (stylized).  For lines, this consists of a few coordinate pairs stored for each shape, with the curved line of the circle having the most line segments at fourteen.  The grid structure uses a 25 by 25 matrix of numbers (625 numbers per map), shown as the storage arrays to the immediate right of each geometric shape.  The storage requirement of these features is obviously less for the line structure (tens of numbers versus hundreds).  The spatial precision of the boundaries is also obviously better for  the line structure— the saw tooth effect in the grid structure is an unreal and undesirable artifact.   It's fair to say that the line structure frequently has an advantage in spatial precession and storage efficiency of base maps— inventory.


However, other differences are apparent during analysis of these data.  For example, the composite maps at the bottom of the figure are the results of simply overlaying the three features; one of the basic analytic functions.  In the line structure, seventeen new polygons are derived, composed of 39 individual line segments.  This is a significant increase in the storage requirement for the composite map as compared to any of the simple original maps.  But consider the realistic complexity of overlaying a land use map of several hundred polygons with a soil map of over a thousand— the result is more 'son and daughter' polygonal prodigy than you would care to count (or most small computers would care to store). 


On the other hand, the storage requirement for the grid structure can never exceed the maximum dimensionality of the grid— no matter how many input maps or their complexity.  Even more significant, is the computational demands involved in splitting and fusing the potentially thousands of line segments forming the new boundaries of the derived map.  By contrast, the overlaying of the maps stored in grid structure simply involves direct storage access and matrix addition.  It's fair to say that the grid structure frequently has an advantage in computation and storage efficiency of derived maps— analysis. 


It is also fair to say that the relative advantages and disadvantages of the two data structures have not escaped GIS technologists.  Database suppliers determine the best format for each variable (USGS uses vector DLG format for all 7.5 minute quadrangle information but elevation, which is in raster DEM format).  Most vendors provide conversion routines for transferring data between vector and raster.  Many provide 'schizophrenic' systems with both a vector and a raster processing side.  Some have developed specialized data structure offshoots, such as 'Rasterized lines, Quadtrees and TIN.'  In each instance careful consideration is made to nature of the data, processing considerations and the intended use— it depends.


Another concern is the characteristics of the data derived in map analysis.  In the case of line structure, each derived polygon is assumed to be accurately defined— precise intersection of real boundaries surrounding a uniform geographic distribution of data.  True for overlaying a property map with a zip code map, but a limiting assumption for probabilistic resource data, such as soils and land cover, as well as gradient data, such as topographic relief and weather maps.  For example, recall the geographic search (overlay) for areas of Cohassett soil, moderate slope, and ponderosa pine forest cover described in the first article of this series.  A line-based system generates an 'image' of the intersections of the specified polygons.  Each derived polygon is assumed to locate the precisely defined combinations of the variables.  In addition, the likelihood of actual occurrence is assumed the same for all of the polygonal prodigy— even small slivers formed by intersecting edges of the input polygons.


A grid-oriented system calculates the coincidence of variables at each cell location as if each were an individual 'polygon'.  Since these 'polygons' are organized as a consistent, uniform grid, the calculations simply involve storage retrieval and numeric evaluation— not geometric calculations for intersecting lines.  In addition, if an estimate of error is available for each variable at each cell, the value assigned as a function of these data can also indicate the most likely composition (coincidence) of the variables— 'there is an 80% chance that this hectare is Cohassett soil, moderately sloped and ponderosa pine covered.'  The result is a digital map of the derived variable, expressed as a geographic distribution, plus its likelihood of error (a sort of 'shadow' map of certainty of result).  This concept, termed 'error propagation' modeling, is admittedly an unfamiliar, and likely an uncomfortable one. 


It is but one of the gusts in the GIS whirlwind that is taking us beyond mapping.  Others include drastically modified techniques, such as weighted distance measurement (a sort of rubber ruler), and entirely new procedures, such as optimal path density analysis (identifying the Nth best route).  These new analytic concepts and constructs will be the focus of future articles.



        (Return to top of Topic)


(Back to the Table of Contents)