The Nature of Geographic Information

This chapter reviews the basic fundamentals of geographic data and information. The focus is on understanding the basic structure of geographic data, and how issues of accuracy, error, and quality are paramount to properly using GIS technology. The establishment of a robust database is the cornerstone of a successful GIS.

Maps and Spatial Information

The main method of identifying and representing the location of geographic features on the landscape is a map. A map is a graphic representation of where features are, explicitly and relative to one another. A map is composed of different geographic features represented as either points, lines, and/or areas. Each feature is defined both by its location in space (with reference to a coordinate system), and by its characteristics (typically referred to as attributes). Quite simply, a map is a model of the real world.

The map legend is the key linking the attributes to the geographic features. Attributes, e.g. such as the species for a forest stand, are typically represented graphically by use of different symbology and/or color. For GIS, attributes need to be coded in a form in which they can be used for data analysis (Burrough, 1986). This implies loading the attribute data into a database system and linking it to the graphic features.

Maps are simply models of the real world. They represent snapshots of the land at a specific map scale. The map legend is the key identifying which features are represented on a map.

For geographic data, often referred to as spatial data, features are usually referenced in a coordinate system that models a location on the earth's surface. The coordinate system may be of a variety of types. For natural resource applications the most common are :

geographic coordinates such as latitude and longitude, e.g. 56°27'40" and 116°11'25". These are usually referred by degrees, minutes, and seconds. Geographic coordinates can also be identified as decimal degrees, e.g. 54.65°.
a map projection, e.g. Universe Transverse Mercator (UTM) where coordinates are measured in metres, e.g. 545,000.000 and 6,453,254.000 normally reference to a central meridian. Eastings refer to X coordinates while Northings refer to Y coordinates.
a legal survey description, e.g. Meridian, Township, Range such as the Alberta Township System, e.g. Township 075 Range 10 West of 4th Meridian.

Geographic data is distinguished from attribute data in that it is referenced spatially by a coordinate system, e.g. it has a spatial extent. Natural resource applications commonly use a Legal Survey system, e.g. the Alberta Township System (ATS), which identifies feature's locations as being a Meridian, Township, and a Range; or a projection such as the UTM coordinate system which identifies features by an Easting coordinate (X) and a Northing coordinate (Y) in a particular UTM zone.

In the UTM projection the area of the earth between 80 degrees North and 80 degrees South latitude is divided into north-south columns 6 degrees of longitude wide called zones. These are numbered 1 to 60 eastward beginning at the 180th meridian. Within each zone the central meridian is given an Easting value of 500,000 metres. The equator is designated as having a Northing value of 0 for northern hemisphere coordinates. Coordinates are recorded relative to the central meridian in metres in a particular zone. The basis of the UTM projection defines that the coordinates are duplicated within each UTM zone. Accordingly, use of the UTM projection is only appropriate for certain spatial extents and scales of data. It is not appropriate to use this projection if your area of interest crosses UTM zone boundaries.

For example, the province of Alberta is located in UTM zones 11 and 12. The central meridian for zone 11 is 117° longitude. The central meridian for zone 12 is 111° longitude. The UTM coordinate system is the most widely used projection in the mapping industry and consequently is becoming an de facto standard for use with geographic information systems. This is particularly true for regional data in Canada. The State Plane coordinate system is widely used in the United States.

Maps are the traditional method of storing and displaying geographic information.

A map portrays 3 kinds of information about geographic features. The :

Location and extent of the feature
Attributes (characteristics) of the feature
Relationship of the feature to other features.

Geography has often been described as the study of why what is where. This description is quite appropriate when considering the three kinds of information that are portrayed by the traditional map;

the location and extent of a feature is identified explicitly by reference to a coordinate system representing the earth's surface. This is where a feature is.
the attributes of a feature describe or characterize the feature. This is what the feature is.
The relationship of a feature to other features is implied from the location and attributes of all features. Relationships can be defined explicitly, e.g. roads connecting towns, regions adjacent to one another, or implicitly, e.g. close to, far from, similar to, etc. Implicit relationships are interpreted according to the knowledge that we have about the natural world. Relationships are described as how or why a feature is.

.The geographic information system distinguishes between the spatial and attribute aspect of geographic features.

The identification of relationships between features, within a common theme or across different themes, is the primary function of a GIS.

Characterizing Geographic Features

All geographic features on the earth's surface can be characterized and defined as one of three basic feature types. These are points, lines, and areas.

Point data exists when a feature is associated with a single location in space. Examples of point features include a fire lookout tower, an oil well or gas activity site, and a weather station.
Linear data exists when a feature's location is described by a string of spatial coordinates. Examples of linear data include rivers, roads, pipelines, etc.
Areal data exists when a feature is described by a closed string of spatial coordinates. An area feature is commonly referred to as a polygon. Polygonal data is the most common type of data in natural resource applications. Examples of polygonal data include forest stands, soil classification areas, administrative boundaries, and climate zones. Most polygon data is considered to be homogeneous in nature and thus is consistent throughout.

Every geograpic phenomenon can in principle be repsenteded by either a point, line, and/or an area.

(Adapted from Berry)

Commonly, an identifier accompanies all types of geographic features. This description or identifier is referred to as a label. Labels distinguish geographic features of the same type, e.g. forest stands, from one another. Labels can be in the form of a name, e.g. "Lake Louise", a description, e.g. "WELL" or a unique number, e.g. "123". Forest stand numbers are examples of polygon labels. Each label is unique and provides the mechanism for linking the feature to a set of descriptive characteristics, referred to as attribute data.

It is important to note that geographic features and the symbology used to represent them, e.g. point, line, or polygon, are dependant on the graphic scale (map scale) of the data. Some features can be represented by point symbology at a small scale, e.g. villages on a 1:1,000,000 map, and by areal symbology at a larger scale, e.g. villages on a 1:10 ,000 map. Accordingly, the accuracy of the feature's location is often fuzzier at a smaller scale than a larger scale. The generalization of features is an inherent characteristic of data presented at a smaller scale.

Data can always be generalized to a smaller scale, but detail CANNOT be created !

Remember, as the scale of a map increases, e.g. 1:15,000 to 1:100,000, the relative size of the features decrease and the following may occur :

Some features may disappear, e.g. features such as ponds, hamlets, and lakes, become indistinguishable as a feature and are eliminated.;
Features change from areas to lines or to points, e.g. a village or town represented by a polygon at 1:15,000 may change to point symbology at a 1:100,000 scale.;
Features change in shape, e.g. boundaries become less detailed and more generalized.; and
Some features may appear, e.g. features such as climate zones may be indistinguishable at a large scale (1:15,000) but the full extent of the zone becomes evident at a smaller scale (1:1,000,000).

Accordingly, the use of data from vastly different scales will result in many inconsistencies between the number of features and their type.

The use and comparison of geographic data from vastly different source scales is totally inappropriate and can lead to significant error in geographic data processing.

 

Data Accuracy and Quality

The quality of data sources for GIS processing is becoming an ever increasing concern among GIS application specialists. With the influx of GIS software on the commercial market and the accelerating application of GIS technology to problem solving and decision making roles, the quality and reliability of GIS products is coming under closer scrutiny. Much concern has been raised as to the relative error that may be inherent in GIS processing methodologies. While research is ongoing, and no finite standards have yet been adopted in the commercial GIS marketplace, several practical recommendations have been identified which help to locate possible error sources, and define the quality of data. The following review of data quality focuses on three distinct components, data accuracy, quality, and error.

Accuracy

The fundamental issue with respect to data is accuracy. Accuracy is the closeness of results of observations to the true values or values accepted as being true. This implies that observations of most spatial phenomena are usually only considered to estimates of the true value. The difference between observed and true (or accepted as being true) values indicates the accuracy of the observations.

Basically two types of accuracy exist. These are positional and attribute accuracy. Positional accuracy is the expected deviance in the geographic location of an object from its true ground position. This is what we commonly think of when the term accuracy is discussed. There are two components to positional accuracy. These are relative and absolute accuracy. Absolute accuracy concerns the accuracy of data elements with respect to a coordinate scheme, e.g. UTM. Relative accuracy concerns the positioning of map features relative to one another.

Often relative accuracy is of greater concern than absolute accuracy. For example, most GIS users can live with the fact that their survey township coordinates do not coincide exactly with the survey fabric, however, the absence of one or two parcels from a tax map can have immediate and costly consequences.

Attribute accuracy is equally as important as positional accuracy. It also reflects estimates of the truth. Interpreting and depicting boundaries and characteristics for forest stands or soil polygons can be exceedingly difficult and subjective. Most resource specialists will attest to this fact. Accordingly, the degree of homogeneity found within such mapped boundaries is not nearly as high in reality as it would appear to be on most maps.

Quality

Quality can simply be defined as the fitness for use for a specific data set. Data that is appropriate for use with one application may not be fit for use with another. It is fully dependant on the scale, accuracy, and extent of the data set, as well as the quality of other data sets to be used. The recent U.S. Spatial Data Transfer Standard (SDTS) identifies five components to data quality definitions. These are :

Lineage
Positional Accuracy
Attribute Accuracy
Logical Consistency
Completeness
Lineage

The lineage of data is concerned with historical and compilation aspects of the data such as the :

source of the data;
content of the data;
data capture specifications;
geographic coverage of the data;
compilation method of the data, e.g. digitizing versus scanned;
ransformation methods applied to the data; and
the use of an pertinent algorithms during compilation, e.g. linear simplification, feature generalization.
Positional Accuracy

The identification of positional accuracy is important. This includes consideration of inherent error (source error) and operational error (introduced error). A more detailed review is provided in the next section.

Attribute Accuracy

Consideration of the accuracy of attributes also helps to define the quality of the data. This quality component concerns the identification of the reliability, or level of purity (homogeneity), in a data set.

Logical Consistency

This component is concerned with determining the faithfulness of the data structure for a data set. This typically involves spatial data inconsistencies such as incorrect line intersections, duplicate lines or boundaries, or gaps in lines. These are referred to as spatial or topological errors.

Completeness

The final quality component involves a statement about the completeness of the data set. This includes consideration of holes in the data, unclassified areas, and any compilation procedures that may have caused data to be eliminated.

The ease with which geographic data in a GIS can be used at any scale highlights the importance of detailed data quality information. Although a data set may not have a specific scale once it is loaded into the GIS database, it was produced with levels of accuracy and resolution that make it appropriate for use only at certain scales, and in combination with data of similar scales.

Error

Two sources of error, inherent and operational, contribute to the reduction in quality of the products that are generated by geographic information systems. Inherent error is the error present in source documents and data. Operational error is the amount of error produced through the data capture and manipulation functions of a GIS. Possible sources of operational errors include :

mislabelling of areas on thematic maps;
misplacement of horizontal (positional) boundaries;
human error in digitizing
classification error;.
GIS algorithm inaccuracies; and
human bias.

While error will always exist in any scientific process, the aim within GIS processing should be to identify existing error in data sources and minimize the amount of error added during processing. Because of cost constraints it is often more appropriate to manage error than attempt to eliminate it. There is a trade-off between reducing the level of error in a data base and the cost to create and maintain the database.

An awareness of the error status of different data sets will allow user to make a subjective statement on the quality and reliability of a product derived from GIS processing.

The validity of any decisions based on a GIS product is directly related to the quality and reliability rating of the product.

Depending upon the level of error inherent in the source data, and the error operationally produced through data capture and manipulation, GIS products may possess significant amounts of error.

One of the major problems currently existing within GIS is the aura of accuracy surrounding digital geographic data. Often hardcopy map sources include a map reliability rating or confidence rating in the map legend. This rating helps the user in determining the fitness for use for the map. However, rarely is this information encoded in the digital conversion process.

Often because GIS data is in digital form and can be represented with a high precision it is considered to be totally accurate. In reality, a buffer exists around each feature which represents the actual positional location of the feature. For example, data captured at the 1:20,000 scale commonly has a positional accuracy of +/- 20 metres. This means the actual location of features may vary 20 metres in either direction from the identified position of the feature on the map. Considering that the use of GIS commonly involves the integration of several data sets, usually at different scales and quality, one can easily see how errors can be propagated during processing.

Example of areas of uncertainty for overlaying data.

Several comments and guidelines on the recognition and assessment of error in GIS processing have been promoted in papers on the subject. These are summarized below :

There is a need for developing error statements for data contained within geographic information systems (Vitek et al, 1984).;
The integration of data from different sources and in different original formats (e.g. points, lines, and areas), at different original scales, and possessing inherent errors can yield a product of questionable accuracy (Vitek et al, 1984).
The accuracy of a GIS-derived product is dependent on characteristics inherent in the source products, and on user requirements, such as scale of the desired output products and the method and resolution of data encoding (Marble, Peuquet, 1983).
The highest accuracy of any GIS output product can only be as accurate as the least accurate data theme of information involved in the analysis (Newcomer, Szajgin, 1984).
Accuracy of the data decreases as spatial resolution becomes more coarse (Walsh et al, 1987). ; and
As the number of layers in an analysis increases, the number of possible opportunities for error increases (Newcomer, Szajgin, 1984).