Fundamental Concepts

This chapter reviews the structural components and design of GIS data models. The focus is on reviewing spatial and attribute data models, and how data is encoded by the GIS software. This chapter describes GIS technical components and will be of most interest to technical staff and GIS operators.

What is a a GIS ?

A geographic information system (GIS) is a computer-based tool for mapping and analyzing geographic phenomenon that exist, and events that occur, on Earth. GIS technology integrates common database operations such as query and statistical analysis with the unique visualization and geographic analysis benefits offered by maps. These abilities distinguish GIS from other information systems and make it valuable to a wide range of public and private enterprises for explaining events, predicting outcomes, and planning strategies. Map making and geographic analysis are not new, but a GIS performs these tasks faster and with more sophistication than do traditional manual methods.

Today, GIS is a multi-billion-dollar industry employing hundreds of thousands of people worldwide. GIS is taught in schools, colleges, and universities throughout the world. Professionals and domain specialists in every discipline are become increasingly aware of the advantages of using GIS technology for addressing their unique spatial problems.

We commonly think of a GIS as a single, well-defined, integrated computer system. However, this is not always the case. A GIS can be made up of a variety of software and hardware tools. The important factor is the level of integration of these tools to provide a smoothly operating, fully functional geographic data processing environment.

Overall, GIS should be viewed as a technology, not simply as a computer system.

In general, a GIS provides facilities for data capture, data management, data manipulation and analysis, and the presentation of results in both graphic and report form, with a particular emphasis upon preserving and utilizing inherent characteristics of spatial data.

The ability to incorporate spatial data, manage it, analyze it, and answer spatial questions is the distinctive characteristic of geographic information systems.

A geographic information system, commonly referred to as a GIS, is an integrated set of hardware and software tools used for the manipulation and management of digital spatial (geographic) and related attribute data.

GIS Subsystems

A GIS has four main functional subsystems. These are :

a data input subsystem;
a data storage and retrieval subsystem;
a data manipulation and analysis subsystem; and
a data output and display subsystem.
Data Input

A data input subsystem allows the user to capture, collect, and transform spatial and thematic data into digital form. The data inputs are usually derived from a combination of hard copy maps, aerial photographs, remotely sensed images, reports, survey documents, etc.

Data Storage and Retrieval

The data storage and retrieval subsystem organizes the data, spatial and attribute, in a form which permits it to be quickly retrieved by the user for analysis, and permits rapid and accurate updates to be made to the database. This component usually involves use of a database management system (DBMS) for maintaining attribute data. Spatial data is usually encoded and maintained in a proprietary file format.

Data Manipulation and Analysis

The data manipulation and analysis subsystem allows the user to define and execute spatial and attribute procedures to generate derived information. This subsystem is commonly thought of as the heart of a GIS, and usually distinguishes it from other database information systems and computer-aided drafting (CAD) systems.

Data Output

The data output subsystem allows the user to generate graphic displays, normally maps, and tabular reports representing derived information products.

The critical function for a GIS is, by design, the analysis of spatial data.

It is important to understand that the GIS is not a new invention. In fact, geographic information processing has a rich history in a variety of disciplines. In particular, natural resource specialists and environmental scientists have been actively processing geographic data and promoting their techniques since the 1960's.

Today's generic, geographic information system, is distinguished from the geo-processing of the past by the use of computer automation to integrate geographic data processing tools in a friendly and comprehensive environment.

The advent of sophisticated computer techniques has proliferated the multi-disciplinary application of geo-processing methodologies, and provided data integration capabilities that were logistically impossible before.

Components of a GIS

An operational GIS also has a series of components that combine to make the system work. These components are critical to a successful GIS.

A working GIS integrates five key components:
hardware,
software,
data,
people,
methods.
Hardware

Hardware is the computer system on which a GIS operates. Today, GIS software runs on a wide range of hardware types, from centralized computer servers to desktop computers used in stand-alone or networked configurations.

Software

GIS software provides the functions and tools needed to store, analyze, and display geographic information. A review of the key GIS software subsystems is provided above.

Data

Perhaps the most important component of a GIS is the data. Geographic data and related tabular data can be collected in-house, compiled to custom specifications and requirements, or occasionally purchased from a commercial data provider. A GIS can integrate spatial data with other existing data resources, often stored in a corporate DBMS. The integration of spatial data (often proprietary to the GIS software), and tabular data stored in a DBMS is a key functionality afforded by GIS.

People

GIS technology is of limited value without the people who manage the system and develop plans for applying it to real world problems. GIS users range from technical specialists who design and maintain the system to those who use it to help them perform their everyday work. The identification of GIS specialists versus end users is often critical to the proper implementation of GIS technology.

Methods

A successful GIS operates according to a well-designed implementation plan and business rules, which are the models and operating practices unique to each organization.

As in all organizations dealing with sophisticated technology, new tools can only be used effectively if they are properly integrated into the entire business strategy and operation. To do this properly requires not only the necessary investments in hardware and software, but also in the retraining and/or hiring of personnel to utilize the new technology in the proper organizational context. Failure to implement your GIS without regard for a proper organizational commitment will result in an unsuccessful system ! Many of the issues concerned with organizational commitment are described in Implementation Issues and Strategies.

It is simply not sufficient for an organization to purchase a computer with some GIS software, hire some enthusiastic individuals and expect instant success.

GIS Data Models

A GIS stores information about the world as a collection of thematic layers that can be linked together by geography. This simple but extremely powerful and versatile concept has proven invaluable for solving many real-world problems from tracking delivery vehicles, to recording details of planning applications, to modeling global atmospheric circulation. The thematic layer approach allows us to organize the complexity of the real world into a simple representation to help facilitate our understanding of natural relationships.

GIS Data Types

The basic data types in a GIS reflects traditional data found on a map. Accordingly, GIS technology utilizes two basic types of data. These are :

Spatial data

describes the absolute and relative location of geographic features.

Attribute data

describes characteristics of the spatial features. These characteristics can be quantitative and/or qualitative in nature. Attribute data is often referred to as tabular data.

The coordinate location of a forestry stand would be spatial data, while the characteristics of that forestry stand, e.g. cover group, dominant species, crown closure, height, etc., would be attribute data. Other data types, in particular image and multimedia data, are becoming more prevalent with changing technology. Depending on the specific content of the data, image data may be considered either spatial, e.g. photographs, animation, movies, etc., or attribute, e.g. sound, descriptions, narration’s, etc.

Spatial Data Models

Traditionally spatial data has been stored and presented in the form of a map. Three basic types of spatial data models have evolved for storing geographic data digitally. These are referred to as :

Vector;
Raster; and
Image.

The following diagram reflects the two primary spatial data encoding techniques. These are vector and raster. Image data utilizes techniques very similar to raster data, however typically lacks the internal formats required for analysis and modeling of the data. Images reflects pictures or photographs of the landscape.

Vector Data Formats

All spatial data models are approaches for storing the spatial location of geographic features in a database. Vector storage implies the use of vectors (directional lines) to represent a geographic feature. Vector data is characterized by the use of sequential points or vertices to define a linear segment. Each vertex consists of an X coordinate and a Y coordinate.

Vector lines are often referred to as arcs and consist of a string of vertices terminated by a node. A node is defined as a vertex that starts or ends an arc segment. Point features are defined by one coordinate pair, a vertex. Polygonal features are defined by a set of closed coordinate pairs. In vector representation, the storage of the vertices for each feature is important, as well as the connectivity between features, e.g. the sharing of common vertices where features connect.

Several different vector data models exist, however only two are commonly used in GIS data storage.

The most popular method of retaining spatial relationships among features is to explicitly record adjacency information in what is known as the topologic data model. Topology is a mathematical concept that has its basis in the principles of feature adjacency and connectivity.

The topologic data structure is often referred to as an intelligent data structure because spatial relationships between geographic features are easily derived when using them. Primarily for this reason the topologic model is the dominant vector data structure currently used in GIS technology. Many of the complex data analysis functions cannot effectively be undertaken without a topologic vector data structure. Topology is reviewed in greater detail later on in the book.

The secondary vector data structure that is common among GIS software is the computer-aided drafting (CAD) data structure. This structure consists of listing elements, not features, defined by strings of vertices, to define geographic features, e.g. points, lines, or areas. There is considerable redundancy with this data model since the boundary segment between two polygons can be stored twice, once for each feature. The CAD structure emerged from the development of computer graphics systems without specific considerations of processing geographic features. Accordingly, since features, e.g. polygons, are self-contained and independent, questions about the adjacency of features can be difficult to answer. The CAD vector model lacks the definition of spatial relationships between features that is defined by the topologic data model.

(Adapted from Berry)

Raster Data Formats

Raster data models incorporate the use of a grid-cell data structure where the geographic area is divided into cells identified by row and column. This data structure is commonly called raster. Whle the term raster implies a regularly spaced grid other tessellated data structures do exist in grid based GIS systems. In particular, the quadtree data structure has found some acceptance as an alternative raster data model.

The size of cells in a tessellated data structure is selected on the basis of the data accuracy and the resolution needed by the user. There is no explicit coding of geographic coordinates required since that is implicit in the layout of the cells. A raster data structure is in fact a matrix where any coordinate can be quickly calculated if the origin point is known, and the size of the grid cells is known. Since grid-cells can be handled as two-dimensional arrays in computer encoding many analytical operations are easy to program. This makes tessellated data structures a popular choice for many GIS software. Topology is not a relevant concept with tessellated structures since adjacency and connectivity are implicit in the location of a particular cell in the data matrix.

Several tessellated data structures exist, however only two are commonly used in GIS's. The most popular cell structure is the regularly spaced matrix or raster structure. This data structure involves a division of spatial data into regularly spaced cells. Each cell is of the same shape and size. Squares are most commonly utilized.

Since geographic data is rarely distinguished by regularly spaced shapes, cells must be classified as to the most common attribute for the cell. The problem of determining the proper resolution for a particular data layer can be a concern. If one selects too coarse a cell size then data may be overly generalized. If one selects too fine a cell size then too many cells may be created resulting in a large data volumes, slower processing times, and a more cumbersome data set. As well, one can imply an accuracy greater than that of the original data capture process and this may result in some erroneous results during analysis.

As well, since most data is captured in a vector format, e.g. digitizing, data must be converted to the raster data structure. This is called vector-raster conversion. Most GIS software allows the user to define the raster grid (cell) size for vector-raster conversion. It is imperative that the original scale, e.g. accuracy, of the data be known prior to conversion. The accuracy of the data, often referred to as the resolution, should determine the cell size of the output raster map during conversion.

Most raster based GIS software requires that the raster cell contain only a single discrete value. Accordingly, a data layer, e.g. forest inventory stands, may be broken down into a series of raster maps, each representing an attribute type, e.g. a species map, a height map, a density map, etc. These are often referred to as one attribute maps. This is in contrast to most conventional vector data models that maintain data as multiple attribute maps, e.g. forest inventory polygons linked to a database table containing all attributes as columns. This basic distinction of raster data storage provides the foundation for quantitative analysis techniques. This is often referred to as raster or map algebra. The use of raster data structures allow for sophisticated mathematical modelling processes while vector based systems are often constrained by the capabilities and language of a relational DBMS.

(Adapted from Berry)

This difference is the major distinguishing factor between vector and raster based GIS software. It is also important to understand that the selection of a particular data structure can provide advantages during the analysis stage. For example, the vector data model does not handle continuous data, e.g. elevation, very well while the raster data model is more ideally suited for this type of analysis. Accordingly, the raster structure does not handle linear data analysis, e.g. shortest path, very well while vector systems do. It is important for the user to understand that there are certain advantages and disadvantages to each data model.

The selection of a particular data model, vector or raster, is dependent on the source and type of data, as well as the intended use of the data. Certain analytical procedures require raster data while others are better suited to vector data.

Image Data

Image data is most often used to represent graphic or pictorial data. The term image inherently reflects a graphic representation, and in the GIS world, differs significantly from raster data. Most often, image data is used to store remotely sensed imagery, e.g. satellite scenes or orthophotos, or ancillary graphics such as photographs, scanned plan documents, etc. Image data is typically used in GIS systems as background display data (if the image has been rectified and georeferenced); or as a graphic attribute. Remote sensing software makes use of image data for image classification and processing. Typically, this data must be converted into a raster format (and perhaps vector) to be used analytically with the GIS.

Image data is typically stored in a variety of de facto industry standard proprietary formats. These often reflect the most popular image processing systems. Other graphic image formats, such as TIFF, GIF, PCX, etc., are used to store ancillary image data. Most GIS software will read such formats and allow you to display this data.

Image data is most often used for remotely sensed imagery such as satellite imagery or digital orthophotos.

Vector and Raster - Advantages and Disadvantages

There are several advantages and disadvantages for using either the vector or raster data model to store spatial data. These are summarized below.

Vector Data   Advantages :
  Data can be represented at its original resolution and form without generalization.
  Graphic output is usually more aesthetically pleasing (traditional cartographic representation);
  Since most data, e.g. hard copy maps, is in vector form no data conversion is required.
  Accurate geographic location of data is maintained.
  Allows for efficient encoding of topology, and as a result more efficient operations that require topological information, e.g. proximity, network analysis.
     
    Disadvantages:
  The location of each vertex needs to be stored explicitly.
  For effective analysis, vector data must be converted into a topological structure. This is often processing intensive and usually requires extensive data cleaning. As well, topology is static, and any updating or editing of the vector data requires re-building of the topology.
  Algorithms for manipulative and analysis functions are complex and may be processing intensive. Often, this inherently limits the functionality for large data sets, e.g. a large number of features.
  Continuous data, such as elevation data, is not effectively represented in vector form. Usually substantial data generalization or interpolation is required for these data layers.
  Spatial analysis and filtering within polygons is impossible.

 

Raster Data   Advantages :
  The geographic location of each cell is implied by its position in the cell matrix. Accordingly, other than an origin point, e.g. bottom left corner, no geographic coordinates are stored.
  Due to the nature of the data storage technique data analysis is usually easy to program and quick to perform.
  The inherent nature of raster maps, e.g. one attribute maps, is ideally suited for mathematical modeling and quantitative analysis.
  Discrete data, e.g. forestry stands, is accommodated equally well as continuous data, e.g. elevation data, and facilitates the integrating of the two data types.
  Grid-cell systems are very compatible with raster-based output devices, e.g. electrostatic plotters, graphic terminals.
     
    Disadvantages:
  The cell size determines the resolution at which the data is represented.;
  It is especially difficult to adequately represent linear features depending on the cell resolution. Accordingly, network linkages are difficult to establish.
  Processing of associated attribute data may be cumbersome if large amounts of data exists. Raster maps inherently reflect only one attribute or characteristic for an area.
  Since most input data is in vector form, data must undergo vector-to-raster conversion. Besides increased processing requirements this may introduce data integrity concerns due to generalization and choice of inappropriate cell size.
  Most output maps from grid-cell systems do not conform to high-quality cartographic needs.

It is often difficult to compare or rate GIS software that use different data models. Some personal computer (PC) packages utilize vector structures for data input, editing, and display but convert to raster structures for any analysis. Other more comprehensive GIS offerings provide both integrated raster and vector analysis techniques. They allow users to select the data structure appropriate for the analysis requirements. Integrated raster and vector processing capabilities are most desirable and provide the greatest flexibility for data manipulation and analysis.

Attribute Data Models

A separate data model is used to store and maintain attribute data for GIS software. These data models may exist internally within the GIS software, or may be reflected in external commercial Database Management Software (DBMS). A variety of different data models exist for the storage and management of attribute data. The most common are :

Tabular
Hierarchial
Network
Relational
Object Oriented

The tabular model is the manner in which most early GIS software packages stored their attribute data. The next three models are those most commonly implemented in database management systems (DBMS). The object oriented is newer but rapidly gaining in popularity for some applications. A brief review of each model is provided.

Tabular Model

The simple tabular model stores attribute data as sequential data files with fixed formats (or comma delimited for ASCII data), for the location of attribute values in a predefined record structure. This type of data model is outdated in the GIS arena. It lacks any method of checking data integrity, as well as being inefficient with respect to data storage, e.g. limited indexing capability for attributes or records, etc.

Hierarchial Model

The hierarchial database organizes data in a tree structure. Data is structured downward in a hierarchy of tables. Any level in the hierarchy can have unlimited children, but any child can have only one parent. Hierarchial DBMS have not gained any noticeable acceptance for use within GIS. They are oriented for data sets that are very stable, where primary relationships among the data change infrequently or never at all. Also, the limitation on the number of parents that an element may have is not always conducive to actual geographic phenomenon.

Network Model

The network database organizes data in a network or plex structure. Any column in a plex structure can be linked to any other. Like a tree structure, a plex structure can be described in terms of parents and children. This model allows for children to have more than one parent.

Network DBMS have not found much more acceptance in GIS than the hierarchial DBMS. They have the same flexibility limitations as hierarchial databases; however, the more powerful structure for representing data relationships allows a more realistic modelling of geographic phenomenon. However, network databases tend to become overly complex too easily. In this regard it is easy to lose control and understanding of the relationships between elements.

Relational Model

The relational database organizes data in tables. Each table, is identified by a unique table name, and is organized by rows and columns. Each column within a table also has a unique name. Columns store the values for a specific attribute, e.g. cover group, tree height. Rows represent one record in the table. In a GIS each row is usually linked to a separate spatial feature, e.g. a forestry stand. Accordingly, each row would be comprised of several columns, each column containing a specific value for that geographic feature. The following figure presents a sample table for forest inventory features. This table has 4 rows and 5 columns. The forest stand number would be the label for the spatial feature as well as the primary key for the database table. This serves as the linkage between the spatial definition of the feature and the attribute data for the feature.

UNIQUE STAND NUMBER

DOMINANT COVER GROUP

AVG. TREE HEIGHT

STAND SITE INDEX

STAND AGE

001

DEC

3

G

100

002

DEC-CON

4

M

80

003

DEC-CON

4

M

60

004

CON

4

G

120

Data is often stored in several tables. Tables can be joined or referenced to each other by common columns (relational fields). Usually the common column is an identification number for a selected geographic feature, e.g. a forestry stand polygon number. This identification number acts as the primary key for the table. The ability to join tables through use of a common column is the essence of the relational model. Such relational joins are usually ad hoc in nature and form the basis of for querying in a relational GIS product. Unlike the other previously discussed database types, relationships are implicit in the character of the data as opposed to explicit characteristics of the database set up.

The relational database model is the most widely accepted for managing the attributes of geographic data.

There are many different designs of DBMSs, but in GIS the relational design has been the most useful. In the relational design, data are stored conceptually as a collection of tables. Common fields in different tables are used to link them together. This surprisingly simple design has been so widely used primarily because of its flexibility and very wide deployment in applications both within and without GIS.

In fact, most GIS software provides an internal relational data model, as well as support for commercial off-the-shelf (COTS) relational DBMS'. COTS DBMS’ are referred to as external DBMS’. This approach supports both users with small data sets, where an internal data model is sufficient, and customers with larger data sets who utilize a DBMS for other corporate data storage requirements. With an external DBMS the GIS software can simply connect to the database, and the user can make use of the inherent capabilities of the DBMS. External DBMS’ tend to have much more extensive querying and data integrity capabilities than the GIS’ internal relational model. The emergence and use of the external DBMS is a trend that has resulted in the proliferation of GIS technology into more traditional data processing environments.

The relational DBMS is attractive because of its :

simplicity in organization and data modelling.
flexibility - data can be manipulated in an ad hoc manner by joining tables.
efficiency of storage - by the proper design of data tables redundant data can be minimized; and
the non-procedural nature - queries on a relational database do not need to take into account the internal organization of the data.

The relational DBMS has emerged as the dominant commercial data management tool in GIS implementation and application.

The following diagram illustrates the basic linkage between a vector spatial data (topologic model) and attributes maintated in a relational database file.

(from Berry)

Object-Oriented Model

The object-oriented database model manages data through objects. An object is a collection of data elements and operations that together are considered a single entity. The object-oriented database is a relatively new model. This approach has the attraction that querying is very natural, as features can be bundled together with attributes at the database administrator's discretion. To date, only a few GIS packages are promoting the use of this attribute data model. However, initial impressions indicate that this approach may hold many operational benefits with respect to geographic data processing. Fulfilment of this promise with a commercial GIS product remains to be seen.

Spatial Data Relationships

The nature of spatial data relationships are important to understand within the context of GIS. In particular, the relationship between geographic features is a complex problem in which we are far from understanding in its entirety. This is of concern since the primary role of GIS is the manipulation and analysis of large quantities of spatial data. To date, the accepted theoretical solution is to topologically structure spatial data.

It is believed that a topologic data model best reflects the geography of the real world and provides an effective mathematical foundation for encoding spatial relationships, providing a data model for manipulating and analyzing vector based data.

Most GIS software segregate spatial and attribute data into separate data management systems. Most frequently, the topological or raster structure is used to store the spatial data, while the relational database structure is used to store the attribute data. Data from both structures are linked together for use through unique identification numbers, e.g. feature labels and DBMS primary keys. This coupling of spatial features with an attribute record is usually maintained by an internal number assigned by the GIS software. A label is required so the user can load the appropriate attribute record for a given geographic feature. Most often a single attribute record is automatically created by the GIS software once a clean topological structure is properly generated. This attribute record normally contains the internal number for the feature, the user's label identifier, the area of the feature, and the perimeter of the feature. Linear features have the length of the feature defined instead of the area.

Topology

The topologic model is often confusing to initial users of GIS. Topology is a mathematical approach that allows us to structure data based on the principles of feature adjacency and feature connectivity. It is in fact the mathematical method used to define spatial relationships. Without a topologic data structure in a vector based GIS most data manipulation and analysis functions would not be practical or feasible.

The most common topological data structure is the arc/node data model. This model contains two basic entities, the arc and the node. The arc is a series of points, joined by straight line segments, that start and end at a node. The node is an intersection point where two or more arcs meet. Nodes also occur at the end of a dangling arc, e.g. an arc that does not connect to another arc such as a dead end street. Isolated nodes, not connected to arcs represent point features. A polygon feature is comprised of a closed chain of arcs.

In GIS software the topological definition is commonly stored in a proprietary format. However, most software offerings record the topological definition in three tables. These tables are analogous to relational tables. The three tables represent the different types of features, e.g. point, line, area. A fourth table containing the coordinates is also utilized. The node table stores information about the node and the arcs that are connected to it. The arc table contains topological information about the arcs. This includes the start and end node, and the polygon to the left and right that the arc is an element of. The polygon table defines the arcs that make up each polygon. While arc, node, and polygon terminology is used by most GIS vendors, some also introduce terms such as edges and faces to define arcs and polygons. This is merely the use of different words to define topological definitions. Do not be confused by this.

Since most input data does not exist in a topological data structure, topology must be built with the GIS software. Depending on the data set this can be an CPU intensive and time consuming procedure. This building process involves the creation of the topological tables and the definition of the arc, node, and polygon entities. To properly define the topology there are specific requirements with respect to graphic elements, e.g. no duplicate lines, no gaps in arcs that define polygon features, etc. These requirements are reviewed in the Data Editing section of the book.

The topological model is utilized because it effectively models the relationship of spatial entities. Accordingly, it is well suited for operations such as contiguity and connectivity analyses. Contiguity involves the evaluation of feature adjacency, e.g. features that touch one another, and proximity, e.g. features that are near one another. The primary advantage of the topological model is that spatial analysis can be done without using the coordinate data. Many operations can be done largely, if not entirely, by using the topological definition alone. This is a significant advantage over the CAD or spaghetti vector data structure that requires the derivation of spatial relationships from the coordinate data before analysis can be undertaken.

The major disadvantage of the topological data model is its static nature. It can be a time consuming process to properly define the topology depending on the size and complexity of the data set. For example, 2,000 forest stand polygons will require considerably longer to build the topology that 2,000 municipal lot boundaries. This is due to the inherent complexity of the features, e.g. lots tend to be rectangular while forest stands are often long and sinuous. This can be a consideration when evaluating the topological building capabilities of GIS software. The static nature of the topological model also implies that every time some editing has occurred, e.g. forest stand boundaries are changed to reflect harvesting or burns, the topology must be rebuilt. The integrity of the topological structure and the DBMS tables containing the attribute data can be a concern here. This is often referred to as referential integrity. While topology is the mechanism to ensure integrity with spatial data, referential integrity is the concept of ensuring integrity for both linked topological data and attribute data.