This chapter describes the theoretical and practical sides of relational database management systems (RDBMSs), including how to design a relational database system or a set of relational tables to achieve optimum performance with Visual Basic 4.0
database applications. This book distinguishes between a client/server RDBMS such as Microsoft SQL Server 6.0, a desktop RDBMS such as Access 2.0/95, and a set of relational tables such as a database created from a collection of dBASE table and index
files. The database design principals you learn in this chapter apply equally to each of these database systems. Indexing methods that improve the speed of searching for a specific record in a table and increase the performance when two or more related
tables are joined also are covered in this chapter.
This book is written for Visual Basic programmers who want to extend their programming skills to 32-bit Windows database applications as well as for database developers who are making the transition from the character-based world of DOS to the
graphical environment of Windows. Visual Basic 4.0 is an ideal tool for use with client/server RDBMSs (also called SQL databases), so a substantial part of this book is devoted to creating front-ends for displaying and updating data in networked
client/server environments. This chapter is intended for readers with limited database design experience. If you're an accomplished relational database developer, you might want to skim this chapter for now and then proceed to 5, "Learning Structured
Query Language."
The history of digital computers is inexorably tied to the concurrent development of database methodology. It is probably a safe estimate that at least 80 percent of worldwide computer resources and programming activities are devoted to database
applications.
The first military and commercial applications for tabulating machines and digital computers were devoted to retrieving, calculating, and reporting on data stored in the form of punched cards. For example, a deck of cards containing the names and
addresses of customers constituted a database table. Another deck of punched cards, containing information on the invoices issued to customers for a given period, represented another table. Using a collating device, you could shuffle the two decks so that
a new deck was created wherein each customer card was followed by the cards detailing the orders for that customer. You could then use the cards to print a report in which the customer's card provided a report subheading and all of the orders for the
customer were printed under the subheading. Cards from other decks could be collated into the two-level deck to create more detailed reports. Early tabulating machines included the capability to create customer subtotals and order grand totals. Figure 4.1
shows the effect of collating two decks of cards to print a report of invoices.
Figure 4.1. Collating punched cards for customers and invoices to print a report.
The obvious problem with the punched-card collation technique was that every time you wanted a different report, you had to separate (decollate) the cards back into their original decks and then manually run a different set of collation processes.
Replacing tabulating machines with computers equipped with nine-track magnetic tape drives solved many of the problems associated with moving decks of cards back and forth between collators. You transferred the data from a sorted deck of cards to a
magnetic tape, then mounted the tapes you needed on tape drives and let the computer combine (merge) the data on the "table" tapes onto a new tape whose data was identical to that of a collated deck of punched cards. Then you could print a report
from the data on the newly recorded tape.
Punched-card decks and magnetic tapes are sequential devices. Finding a particular record requires that you search from the first card of the deck or first record of a tape and read each card or record until you find a match (or determine that no
matching record exists.) When high-capacity random-access data storage devices (such as disk drives) became available, searching for a particular record became much faster, even if you had to read each record in the table. To speed the process, sorting and
indexing methods were developed to minimize the number of records the computer had to read until matching data was found. The seminal work in this field was Volume 3 of Stanford University Professor Donald E. Knuth's Art of Computer Programming
series, Sorting and Searching, published by Addison-Wesley in 1973, which remains in print today (ISBN 0-201-03803-X).
Advances in computer technology after the advent of the random-access disk drive have occurred primarily in the form of architectural, rather than conceptual, changes to both hardware and software. The pace of improvement in the operating speed and the
rate of cost reduction of computer hardware has far out-distanced the rate of progress in software engineering, especially in database design and programming methodology. You can make substantial improvement in the performance of an ill-conceived and
poorly implemented database design simply by acquiring a faster computer. The price of a new computer is usually much less than the cost of re-engineering the organization's legacy database structure. Ultimately, however, poor database designs and
implementations result in a severe case of organizational inefficiency. One of the purposes of this chapter is to provide a sufficient background in database design to make sure that the database structures you create do not fall into this category.
The sections that follow discuss elementary database terminology in the object-oriented language to which you were introduced in Chapter 2, "Understanding the Jet 3.0 Data Access Object." Also, the different types of computer database
structures used today are described and compared.
Whatever the data storage and retrieval mechanism used, a database is a collection of one or more table objects. A table is a database object that consists of a collection of rows (records) that have an identical collection
of properties. The values associated with the properties of a table appear in columns (fields). Row-column (spreadsheet) terminology is most commonly used in databases that employ SQL statements to manipulate table data; whereas desktop
databases commonly use record-field. This book uses the terms record and field when referring to persistent database objects (TableDef objects), and row and column when referring to virtual tables (Recordset objects) created
from TableDef and QueryDef objects. The distinction between the two forms of terminology, however, is not very significant. Figure 4.2 illustrates the generalized structure of a database table.
Figure 4.2. The generalized structure of a database table.
The object hierarchy of the Microsoft Jet 1.x database engine supplied with Visual Basic 3.0 and Access 1.x included a Table object, a member of the Tables collection. The Table object and Tables collections do not appear in the Microsoft DAO 3.0 Object Library or the Microsoft DAO 2.5/3.0 Compatibility Library. Visual Basic 4.0 supports operations on Table objects such as the OpenTable method for backward compatibility with Jet 1.x code.
The following list describes the most important properties of database table objects. These property descriptions and rules apply to tables of conventional databases that use only fundamental data typescharacter strings and numerical values.
Online help for Visual Basic 4.0 defines a base table as "a table in a Microsoft Jet database." This is a Jet-centric definition of base table, not the definition commonly used in RDBMS circles.
A database consists of one or more tables; if more than one table is included in a database, the entities described in the tables must be related by at least one attribute class (field) that is common to two of the tables. The formal statistical name
for a database is heterogeneous universea recently coined alternative term is universe of discourse. This book adheres to the term database. Object-oriented databases (OODBs) do not conform strictly to the atomicity rules
for attributes. Similarly, the OLE Object field data type of Jet databases is not atomic because the data in the fields of the OLE Object field data type contains both the object's data (or only the presentation for a linked object) and a reference to the
application that created the object. The object type of the content of Jet OLE Object fields may vary from record to record.
A query is a method by which you obtain access to a subset of records from one or more tables that have attribute values satisfying some criteria. There are a variety of ways to process queries against databases. Processing queries against
databases with the Jet database engine is the subject of the next chapter, "Learning Structured Query Language." You also can use queries to modify the data in tables; this type of query is described in Chapter 7, "Running Crosstab and
Action Queries."
The simplest database form consists of one table with records having enough columns to contain all of the data you need to describe the entity class. The term flat-file is derived from the fact that the database itself is
two-dimensionalthe number of table fields determine the database's width, and the quantity of table records specify its height. There are no related tables in the database, so the concept of data depth, the third dimension, does not apply. Any
database that contains only one table is, by definition, a flat-file database if the database requires that the tables be flat. (Relational databases, for example, require flat tables.)
Flat-file databases are suitable for simple telephone and mailing lists. The Windows 3.x Cardfile is an example of a simple flat-file database. Ranges of cells, which are designated as "databases" by spreadsheet applications, also are
flat-files. A mailing-list database, as an example, has designated fields for names, addresses, and telephone numbers. Data files used in Microsoft Word's print merge operations constitute flat-file databases.
You run into problems with flat-file databases when you attempt to expand the use of a mailing-list database to include sales contacts, for example. If you develop more than one sales contact at a firm, there are only two ways to add the data for the
new contact:
Neither of these choices is attractive because both choices are inefficient. Both methods can waste a considerable amount of disk space, depending on the database file structure you use. Adding extra records duplicates data, and adding new fields
results in many records that have no values (nulls) for multiple contact and telephone number fields. Adding new fields causes trouble when you want to print reports. It is especially difficult to format printed reports that have repeating groups.
Regardless of the deficiencies of flat-file databases, many of the early mainframe computers only offered flat-file database structures. All spreadsheet applications offer "database" cell ranges that you can sort by a variety of methods.
Although spreadsheet "databases" appear to be flat, this is seldom truly the case. One of the particular problems with spreadsheet databases is that the spreadsheet data model naturally leads to inconsistencies in attribute values and repeating
groups. Time-series data contained in worksheets is a classic example of a repeating group. The section, "Organizing Entity Classes," (which appears later in this chapter) shows you how to deal with inconsistent entity classes that occur in
worksheet "databases," and the "Normalizing Table Data" section describes how to eliminate repeating groups.
The inability of flat-file databases to efficiently deal with data that involved repeating groups of data led to the development of a variety of different database structures (called models) for mainframe computers. The first standardized and
widely accepted model for mainframe databases was the network model developed by the Committee for Data System Languages (CODASYL), which also developed Common Business-Oriented Language (COBOL) to write applications that manipulate the data in
CODASYL network databases. Although the CODASYL database model has its drawbacks, an extraordinary number of mainframe CODASYL databases remain in use today. There are billions of lines of COBOL code in use everyday in North America.
CODASYL databases substitute the term record type for table, but the characteristics of a CODASYL record type are fundamentally no different from the properties of a table. CODASYL record types contain pointers to records of other record
types. A pointer is a value that specifies the location of a record in a file or in memory. For example, a customer record contains a pointer to an invoice for the customer, which in turn contains a pointer to another invoice record for the customer, and
so on. The general term used to describe pointer-based record types is linked list; the pointers link the records into an organized structure called a network. Network databases offer excellent performance when you are seeking a set of
records that pertain to a specific object because the relations between records (pointers) are a permanent part of the database. However, the speed of network databases degrades when you want to browse the database for records that match specific criteria,
such as all customers in California who purchased more than $5,000 worth of product "A" in August 1993.
The problem with CODASYL databases is that database applications (primarily COBOL programs) need to update the data values and the pointers of records that have been added, deleted, or edited. The need to sequentially update both data and pointers adds a great deal of complexity to transaction-processing applications for CODASYL databases.
IBM developed the hierarchical model for its IMS mainframe database product line, which uses the DL/1 language. The hierarchical model deals with repeating groups by using a data structure that resembles an upside-down tree: Data in primary
records constitute the branches and data in repeating groups are the leaves. The advantage of the hierarchical model is that the methods required to find related records are simpler than the techniques needed by the network model. As with the CODASYL
model, there are a large number of hierarchical databases running on mainframe computers today.
The relational database model revolutionized the database world and enabled PCs to replace expensive minicomputers and mainframes for many database applications. The relational database model was developed in 1970 by Dr. E. F. Codd of IBM's San
Jose Research Laboratories. The primary advantage of the relational model is that there is no need to mix pointers and data in tables. Instead, records are linked by relations between attribute values. A relation consists of a linkage between
records in two tables that have identical attribute values. Figure 4.3 illustrates relations between attribute values of relational tables that constitute part of a sales database.
Figure 4.3. Relationships between tables in a sales database.
Because relational tables do not contain pointers, the data in relational tables is independent of the methods used by the database management system to manipulate the records. A relational database management system is an executable application that
can store data in and retrieve data from sets of related tables in a database. The RDBMS creates transitory virtual pointers to records of relational tables in memory. Virtual pointers appear as they are needed to relate (join) tables and are
disposed of when the relation is no longer required by a database application. The "joins" between tables are shown as dashed lines in Figure 4.3. Joins are created between primary key fields and foreign key fields of relational tables. The
primary and foreign key fields of the tables of Figure 4.3 are listed in Table 4.1.
Table | Primary Key | Foreign Key |
Customers | Cust# | None |
Invoices | Inv# | Cust# |
Invoice Items | Inv# and Prod# | Inv# |
Relational databases require duplicate data among tables but don't permit duplication of data within tables. You must duplicate the values of the primary key of one table as the foreign key of dependent tables. A dependent table
is a table that requires a relationship with another table to identify its entities fully. Dependent tables often are called secondary or related tables. Thus, the Invoices table is dependent on the Customers table to supply the real-world
name and address of the customer represented by values in the Cust# field. Similarly, the Invoice Items table is dependent on the Invoices table to identify the real-world object (in this case an invoice) to which records are related.
There are three types of relations defined by the relational database models, each of which is described in the following list:
One situation in which a one-to-one relationship is useful is for an employees table in which the employees' names, addresses, and telephone numbers need to be available to many database users; however, information about salaries, benefits, and other personal information should be restricted on a need-to-know basis. Databases, such as Jet, don't provide column-level permissions, so you create a one-to-one relationship between the table that contains the nonconfidential data and the one that contains confidential information. Then, you grant read-only permission to everyone (the users group) for the nonconfidential table, and only grant permission to a limited number of users for the confidential table.
The proper definition of the relations between entity classes and the correct designation of primary and foreign keys constitute the foundation of effective relational database design methods. The relational database model is built on formal
mathematical concepts embedded in relational algebra. Fortunately, you don't need to be a mathematician to design a relational database structure. A set of five rules, discussed in a forthcoming section of this chapter, define the process of creating
tables that conform to the relational model.
The preceding description of the relational database model made the important point that the properties of (such as the data in) a relational table object are independent of the methods used to manipulate the data. This means that you can use any
relational database management application to process the data contained in a set of relational tables. For example, you can export the data in the tables of an IBM DB2 mainframe database as a set of text files that preserve the tables' structure. You can
then import the text files into tables created by another database management system. Alternatively, you can use Jet, an ODBC driver for DB2, and a network gateway to the DB2 database to access the data directly. The independence of data and implementation
in relational databases also enables you to attach tables from one database type to another. You can join the attached tables to the native tables in your Jet database without going through the export-import exercise. Thus, you can design a relational
database that can be implemented with any relational database manager.
Relational database managers differ in the types of data you can store in tables and in how you name the fields of tables. Many RDBMS, such as SQL Server, include the long varbinary field data type, which enables you to store image data in tables; others, including the most commonly used versions of IBM's DB2, do not support long varbinary fields or their equivalent. You can embed spaces and other punctuation symbols in Jet table and field names, but you can't in most other RDBMS tables. If you are designing a database that may be ported from the original RDMBS to another relational database implementation, make sure you use only the fundamental field data types and conform to the table- and field-naming conventions of the least versatile of the RDBMSs.
There are, to be sure, substantial differences in how relational database systems are implemented. These differences are often overlooked by persons new to database management or persons converting from a mainframe database system to a desktop database
manager. The following sections discuss how mainframe, mini-computer, and client/server databases differ from traditional desktop database managers.
Full-featured client/server relational database management systems separate the database management application (server or back-end) from the individual (client) applications that display, print, and update the information in the database.
Client/server RDBMSs, such as Microsoft SQL Server 6.0, run as a process on the server computer. The majority of client/server systems in use today run under one or more flavors of the UNIX operating system, but Windows NT 3.5+ rapidly is gaining ground on
UNIX as an application server operating system. The client/server RDBMS is responsible for the following activities:
Separate database applications (front-ends) are responsible for creating the query statements sent to the database management system and for processing the rows of data returned by the query. Front-ends, as mentioned in Chapter 1, "Positioning
Visual Basic 4.0 in the Database Front-End Market," handle all of the data formatting, display, and report-printing chores. One of the primary advantages of using an SQL RDBMS is that the features in the preceding list, such as security and integrity,
are implemented by the RDBMS itself. Therefore, the code to implement these features doesn't need be added to each different front-end application. Chapter 20, "Creating Front-Ends for Client/Server Databases," describes the features of
client/server RDBMSs in greater detail.
The stored procedures of client/server databases used to execute predefined queries and maintain database integrity use SQL plus proprietary SQL language extensions such as Transact-SQL, used by Microsoft and Sybase SQL Server products, and Sybase
System 10+. SQL is a set-oriented, not a procedural, programming language; therefore, dialects of SQL are not well suited to writing programs for validating data in accordance with complex business rules. An example of a complex business rule is this:
"The current credit limit of a customer is equal to the customer's maximum credit limit, less uncontested open invoices and orders in process, unless the customer has outstanding, uncontested invoices beyond terms plus 10 days, or if the total amount
of contested invoices exceeds 50 percent of the total amount of open invoices. If a pending order exceeds the customer's calculated credit limit, or any customer payment is behind terms plus 10 days, approval must be obtained from the credit manager prior
to accepting the order." Such a test is impossible to program as a conventional, linear SQL stored procedure, because the last condition requires the credit manager to intervene in the process.
Three-tier client/server architecture adds a processing layer between the front-end client and the back-end server. The processing layer, often called a Line-Of-Business object (LOBject), processes requests from client applications, tests the
requests for conformance with programmed business rules, and sends conforming requests to the back-end RDBMS, which updates the affected tables. Visual Basic 4.0 lets you create OLE 2.1 Remote Automation Objects (RAOs) that reside on an application
server, which need not be the server that runs the RDBMS. Each client application using the LOBject creates its own instance of the RAO. Figure 4.4 illustrates the architecture of a three-tier client/server application that uses Microsoft Mail 3.5 to
process credit approvals (or rejections) for the scenario described in the preceding paragraph. Chapter 23, "Creating and Managing Remote Automation Objects," describes how to design and code RAOs for three-tier client/server applications.
Figure 4.4. A three-tier client/server database system for implementing a credit management LOBject.
Traditional desktop database managers, such as dBASE and Paradox, combine their database management features with the interpreter or compiler that executes the application's source code. The early versions of these products enabled you to create quite
sophisticated database applications that would run on PCs with only 256K of RAM. The constraints of available RAM in the early days of the PC required that the database management portion of the desktop DBM's code include only those features that were
absolutely necessary to make the product operable. Therefore, products of this type, which also include early versions of FoxPro and Clipper for DOS, do not truly qualify as full-fledged relational database management systems and are more properly termed
database file managers. You implement the "relational" and the "management" features through the application source code you write.
The components of the dBASE and Paradox DBMs that manipulate the data contained in individual tables do not provide all of the built-in features of true relational database management systems listed in the preceding section. (The exception is the
desktop products' capability to create a file that contains one table.) You need to write application code to enforce referential and domain integrity, and a one-DOS-file-per-table system does not lend itself to the ease of establishing secure databases.
The commercial success of dBASE (especially dBASE III+) and Paradox for DOS created a user base of perhaps six million. (Borland claims there are four million dBASE users worldwide; about the same number of copies of Microsoft Access sold when this
edition was written.) Therefore, dBASE and Paradox product upgrades need to be backward-compatible with tens of millions of .DBF and .DB files and billions of lines of dBASE and PAL code. New features, such as file- and record-locking for multiuser
applications and the capability to use SQL to create queries, are add-ins (or tack-ons) to the original DBMS. Therefore, both dBASE and Paradox are currently losing market share to relatively low-cost client/server RDBMSs, such as Microsoft SQL Server,
Microsoft Access, a hybrid of the desktop DBMS and the full-featured RDBMS, and Visual Basic.
Microsoft Access is a cross between a conventional desktop DBMS and a complete, full-featured RDBMS. Access uses a single database file that includes all of the tables native to the database. Access's Jet database engine enforces referential integrity
for native tables at the database level, so you don't need to write Access Basic (Access 1+) or VBA (Access 95) code to do this. Jet enforces domain integrity at the field and table levels when you alter the value of a constrained field.
Jet databases include system tables that catalog the objects in the database, and concurrency issues are handled by the database drivers.
Access lets you break the back-end/front-end barrier that separates RDBMSs from desktop DBMSs. Application code and objects, such as forms and reports, can be included in the same database file as the tables. Microsoft used Access's capability to
include both front-end and back-end components in a single .mdb file as strong selling point. It soon became apparent to Access developers that separating application and database objects into individual .mdb files was a better design. You create a Jet
database that contains only tables and attach these tables to an Access .mdb file that provides the front-end functionality. User names and passwords are stored in a separate .mdw workgroup (formerly .mda) library file; this is necessary because a Jet .mdb
file contains only one database. Here again, you can put sets of unrelated tables in a single .mdb file. Jet's proprietary flavor of SQL is the native method of manipulating data in tables, it's not an afterthought. The Jet DLLs that implement the database
functionality are independent from the MSACCESS.EXE file that includes the code you use to create forms and reports. Jet databases are about as close as you are likely to get to a RDBMS in a low-cost, mass-distributed software product.
Using Jet .mdb database files with Visual Basic front-ends approximates the capabilities and performance of client/server RDBMSs at a substantially lower cost for both hardware and software. If you are currently using one-file-per-table DBMSs, consider
attaching the tables to a Jet database during the transition stage while your new 32-bit Windows front-ends and your existing applications need to simultaneously access the tables. Once the transition to Visual Basic front-ends is completed, you can import
the data to a Jet database and take full advantage of the additional features .mdb files offer. If you outgrow the Jet database structure, it's a quick and easy port to SQL Server 6.0 and the Microsoft SQL Server ODBC driver.
You no longer need a copy of Microsoft Access to take advantage of Access's built-in enforcement of referential integrity and the security features of Jet databases. Visual Basic 4.0 and the Jet 3.0 Data Access Object now provide programmatic implementation of referential integrity and security. The Jet database engine and the 32-bit Jet ODBC 2.0 driver now support the SQL FOREIGN KEY and REFERENCES reserved words to define relationships during table creation; however, don't implement the SQL-92 CHECK reserved word, which enables you to enforce domain integrity with ranges or lists of values that constrain attribute values. If you're seriously into database development with Jet 3.0 databases, you should purchase a copy of Microsoft Access for Windows 95 (called Access 95 in this book). You will quickly recoup your investment because Access lets you establish relationships in a graphic Relationships window, supplies a simple method for adding field-level and table-level constraints, provides a graphic Query-by-Design window, and generates Jet SQL statements for you.
The first step in designing a relational (or any other) database is to determine what objects need to be represented by database entities and what properties of each of these objects require inclusion as attribute classes. The process of identifying
the tables required in the database and the fields that each table needs is called data modeling. There are two approaches that you can take during the process of data modeling:
Although application-oriented design can enable you to quickly create an ad hoc database structure and the applications to accomplish a specific goal, bottom-up design is seldom a satisfactory long-term solution to an organization's information
needs. It is common to find several application-oriented databases within an organization that have duplicate data, such as independent customer lists. When the firm acquires a new customer, each of the customer tables needs to be updated. This is an
inefficient and error-prone process.
Subject-oriented database design is a far more satisfactory method. You might want to divide the design process into department-level or workgroup-related databases, such as those in the following list:
There is no fixed set of rules to determine which shared tables should be located in what database. Often, these decisions are arbitrary or are based on political, rather than logical, reasoning. Department-level databases are especially suited for
multiuser Jet databases running on peer-to-peer networks with 30 or fewer users. Each department can have its own part-time database administrator (DBA) who handles backing up the database, granting and revoking the rights of users to share individual
tables in the database, and periodically compacting the database to regain the space occupied by deleted records.
Diagramming relations between tables can aid in visualizing database design. Entity-relation (E-R) diagrams, also called entity-attribute-relation (EAR) diagrams, are one of the most widely used methods of depicting the relations between
database tables. The E-R diagramming method was introduced by Peter Chen in 1976. An E-R diagram consists of rectangles that represent the entity classes (tables). Ellipses above the table rectangles show the attribute class (field) involved in the
relation. Pairs of table rectangles and field ellipses are connected by parallelograms to represent the relation between the fields. Figure 4.5 illustrates an E-R diagram for the Customers and Invoices tables of the database described in Figure 4.3 and
Table 4.1. The 1 and m adjacent to the table rectangles indicate a one-to-many relationship between the two tables.
E-R diagrams describe relations by predicates. One of the definitions of predicate is "a term designating a property or relation." If you remember parsing sentences in grammar school English classes, you'll observe that
"Customers" is the subject, "are sent" is the predicate, and "invoices" is the predicate object of a complete sentence. E-R diagrams are capable of describing virtually any type of allowable relation between two tables by
adding additional symbols to the basic diagram shown in Figure 4.5. A very large number of E-R diagrams are required to define relationships between the numerous entities in enterprise-wide databases.
Designing databases to accommodate the information requirements of an entire firm is a major undertaking. Therefore, computer-aided software engineering (CASE) tools often are used for the design of complex database systems. CASE tools for database
design usually include the following capabilities:
Mainframe database developers have a variety of CASE tools from which to choose, and there are several CASE tools that serve the client/server market, such as Popkin Software's Data Architect. Developers using desktop DBMSs have not been so fortunate;
no commercial CASE tools with the basic features in the preceding list were available for xBase and Paradox developers at the time this book was written.
Database modeling tools are a simplified version of CASE tools for information system development. Modeling tools omit the business modeling aspects of CASE tools, but implement at least the majority of the features described in the preceding list. An
example of a database modeling tool designed specifically for Jet databases is Asymetrix, Inc.'s InfoModeler. InfoModeler is a full-fledged database design system that you can use to create Jet databases from structured English statements that define
entity and attribute classes. InfoModeler employs a unique database-diagramming method, called Object-Role Modeling (ORM), derived from the Natural-language Information Analysis Method (NIAM) and Binary Relationship Modeling (BRM) database design
methodologies. A Formal ORM (FORM) diagram for the classic student-instructor-course database model, included as a tutorial example with InfoModeler, appears in Figure 4.6. Dr. Terry Halpin of the University of Queensland, who devised ORM, has written a
500-page college-level textbook, Conceptual Schema & Relational Database Design, Second Edition (Prentice Hall Australia, 1995, ISBN 0-13-355702-2) that explains ORM in detail.
Figure 4.6. A Formal Object-Role Modeling diagram created in InfoModeler.
InfoModeler uses a language called Formal Object-Role Modeling Language (FORML) to describe entity classes, attribute classes, and relations between entities. The semantics of FORML are similar to the subject/predicate statements of E-R diagrams. In
E-R diagramming, you draw the pictorial elements and then add the callouts that describe the elements. Using FORML, you start with a structured English sentence that states a fact about an object. Figure 4.7 shows InfoModeler's Object Browser dialog with a
list of FORML facts that apply to the Building object (table).
Figure 4.7. The Object Browser dialog of InfoModeler.
After you've entered as many facts as you can collect about each object that is to be represented in your database, you run InfoModeler's fact compiler to create the database design. The fact compiler adds table and field definitions to the repository
database and then builds the FORM diagram. FORM diagrams are more compact than E-R diagrams, so it is practical to fully depict all fields of tables and relations between tables in the database schema. An even more compact schema is provided by
InfoModeler's Table Browser window, as shown in Figure 4.8.
Figure 4.8. The Table Browser window of InfoModeler.
The Table Browser window shown in Figure 4.8 displays eight of the 12 tables that constitute InfoModeler's tutorial database. You can hide or minimize the individual windows that display the details of a table to reduce display clutter. Relations
between table fields are depicted by lines that terminate in "+" symbols, which represent the "one" side, and ">" symbols, which represent the "many" side of a one-to-many relation.
Experienced database designers might conclude that a database modeling tool such as InfoModeler is "overkill" for relatively simple database design projects. If you've designed a number of complex databases, you can probably create a
relational table structure by rote. Although entering FORML sentences to describe entity and attribute classes might seem tedious, the benefits of FORM far outweigh the cost of the time needed to write object descriptions. You get an object repository,
schema, table diagram, FORM diagram, and data dictionary, plus InfoModeler creates the Jet database tables for you. An additional benefit is that FORM diagrams are comprehensible to clients who are not database experts. Therefore, your client can validate
a database design before you commit to creating your database applications. This avoids the sinking feeling that pervades database developers when they hear their clients utter those dreaded words, "Uh, we forgot to tell you about. . .."
InfoModeler also includes a reverse-engineering feature that lets you "reverse engineer" your Jet databases to create a FORM diagram stored in InfoModeler's repository. One of the major benefits of using InfoModeler is your ability to track
changes to database schema, in a manner similar to that of Visual SourceSafe.
If you use a modeling tool to create your database structure, the modeling tool automatically creates tables that comply with the basic rules of relational database design. Database developers, however, often are faced with the task of importing or
using existing data that is not in a suitable format for a relational database. It is quite common for database developers to be presented with the task of transforming data contained in spreadsheet "databases" into tables of a relational
database. Another scenario is the conversion of a poorly designed, dysfunctional database, or a CODASYL network database that contains repeating groups, into a proper relational structure. (COBOL permits the use of the GroupName OCCURS
Several TIMES statement to create repeating groups in network databases.)
The sections that follow describe the methods you use to transform nonrelational data to fully relational form.
In the "Flat-File Databases" section of this chapter, I mention that the worksheet data model often contains inconsistent entities in rows. The stock prices example, shown in Figure 4.9, is a sample Excel worksheet whose structure violates
every rule applicable to relational database tables except attribute atomicity. STOCKS is a worksheet that lists the New York Stock Exchange's (NYSE) closing, high, and low prices for shares and the sales volume of 25 stocks for a five-day period. Rows
contain different entities and columns B through F are repeating groups. The Stocks5.xls workbook in Excel 5.0/7.0 format, which is included on the accompanying CD-ROM, is used in the examples that follow. You'll find Stocks5.xls in the
\Ddg_vb4\32_bit\Chaptr04 folder.
Figure 4.9. A worksheet whose structure is the antithesis of proper database table design.
You need to separate the entity classes according to the object each entity class represents. The four entity classes of the STOCKS worksheet of the Stocks5.xls workbook are the closing price, the highest transaction price, the lowest transaction
price, and the trading volume of a particular stock on a given day. To separate the entity classes you need to add a column so that the stock is identified by its abbreviation in each row. You can identify the data entities by their classesClose,
High, Low, and Volumeplus the abbreviation for the stock, which is added to the new column with a simple recorded Excel VBA macro. Then, you sort the data with the Entity and Key columns. The result of this process appears, as shown for the Stocks1
worksheet, in Figure 4.10.
The dates column represents a mixed entity type (three prices in dollars and the volume in shares), but each entity is now identified by its type. Therefore, you can divide the entities into separate tables at any point in the transformation process.
Figure 4.10. The STOCKS worksheet with entities sorted by entity class.
Now you have a table that contains entities with consistent attribute values, because you moved the inconsistent stock name abbreviation to its own attribute class, Key, and replaced the stock abbreviation in the Entity column A to a value consistent
with the Entity attribute class, Close. However, the repeating-groups problem remains.
Manual worksheet methods of manipulating tabular data are used in this chapter because worksheet techniques, such as selecting, cutting, and pasting groups of cells, represent the easiest and fastest way to change the structure of tabular data. If you need to transform a large amount of worksheet data into relational tables, you should use VBA OLE Automation methods or create general-purpose VBA macros in the worksheet application to automate the transformation process. Examples of Visual Basic 4.0 applications that use OLE Automation methods for data transformation appear in Chapter 14, "Integrating Database Front-Ends with OLE 2.1."
The process of transforming existing data into relational form is called normalization. Normalization of data is based on the assumption that you have organized your data into a tabular structure wherein the tables contain only a single entity
class. Here are the objectives of data normalization:
The sections that follow describe the five steps that constitute full normalization of relational tables. In most cases, you can halt the normalization process at the third normal form. Many developers bypass the fourth and fifth normal forms because
these normalization rules appear arcane and inapplicable to everyday database design.
The first normal form requires that tables be flat and contain no repeating groups. A data cell of a flat table may contain only one atomic (indivisible) data value. If your imported data contains multiple data items in a single field, you need to add
one or more new fields to contain each data item; then, move the multiple data items into the new field.
The Northwind.mdb sample database included with Access 95 includes a table, Customers, whose Address field contains data that violates the first normal form because some cells contain a two-line address. Figure 4.11 shows the Customers table of
Northwind.mdb in datasheet mode. The multi-line addresses for Hungry Coyote Import Store and Island Trading violate the atomicity rule. Therefore, you need another field, such as Location, to contain the second line of two-line entries in the Address
field. For parcel delivery services, such as Federal Express, you need the physical address in the Location field for firms that use postal boxes to receive their mail.
If you are an xBase or Paradox developer, you might think that adding a field to contain the physical location portion of the address causes unwarranted expansion of the size of the Customers table. This is not the case with Jet tables, because Jet databases use variable-length fields for the Text field data type. If an entry is missing in the Location field for a customer, the field contains only the Null value in databases that support Null values. The size is an issue applicable to fixed-width xBase and Paradox table fields, because you must provide enough width to accommodate both lines of the address, whether one or two fields are used to contain the address data. Jet tables use variable-length fields for Text entries. An alternative to adding a Location field is to create a separate table for location data that has an optional one-to-one relation with the Customers table. This is a less efficient process than accepting Null or blank values in records that do not have Location values.
Eliminating repeating groups is often a tougher process when you are transforming worksheet data. Four of the five columns of stock price data shown in the preceding Figure 4.10 are repeating groups. The "quick and dirty" method of
eliminating repeating groups is a series of copy, cut, paste and fill operations. You add another column to specify the date for each entry. Then, you cut and paste the cells for the four repeating groups into the column renamed from the beginning date of
the series to PriceVolume. The final step is to sort the data on the Entity, Key, and Date fields. A portion of the resulting worksheet (Stocks2) appears in Figure 4.12. Instead of 101 rows for the 25 stocks, you now have 501 rows and what appears to be a
large amount of duplicated data. However, your data is now in first normal form.
Figure 4.12. The STOCKS worksheet transformed to first normal form in the Stocks2 worksheet.
The second normal form requires that all data in nonkey fields of a table be fully dependent on the primary key and on each element (field) of the primary key when the primary key is a composite primary key. "Fully dependent on" is a synonym
for "uniquely identified by." It is clear from examining the data shown in Figure 4.12 that the only nonkey column of Stocks2 is PriceVolume. The Entity, Key, and Date fields are members of the composite primary key. The sorting process used in
the preceding section proves this point.
There is a controversy among database designers as to whether or not objects that have a common attribute class, such as price, should be combined into a single table with an identifier to indicate the type of price, such as List, Distributor, or OEM
for products, or in this case, Close, High, and Low transaction price for the day. This process is called subclassing an entity. There's no argument, however, that the volume data deserves its own worksheet, at least for now, so you cut the volume
data from Stocks2 and paste it to a new worksheet called Volume. You can delete the Entity column from the Volume sheet, because the name of the sheet now specifies the entity class.
The data in the volume sheet, with field names added, is shown in Figure 4.13. Each entity now is uniquely identified by the two-part composite primary key comprising the Key and Date fields. You can import the data from the Volume sheet into a table
from any application that supports importing Excel 5.0/7.0 tables contained in a workbookAccess 95, for example. Both the Stocks2 and Volume worksheets contain data in second normal form.
Figure 4.13. The Volume worksheet in second normal form.
The third normal form requires that all nonkey fields of a table be dependent on the table's primary key and independent of one another. Thus, the data in a table must be normalized to second normal form to ensure dependency on the primary key. The
issue here is the dependencies of nonkey fields. A field is dependent on another field if a change in the value of one nonkey field forces a change in the value of another nonkey field.
At this point in the normalization process, you have the following three choices for how to design the table(s) to hold the stock price data:
The decision on your available choices for a table structure that meets the third normal form is a judgment call based on the meaning of the term independent. Are stock prices and trading volumes truly independent of one another? Are the
opening, high, and low prices dependent upon the vagaries of the stock market and the whims of traders, and thus independent of one another? These questions mix the concepts of dependence and causality. Although it is likely that a higher opening price
will result in a higher closing price, the causality is exogenous to the data itself. Exogenous data is data that is determined by factors beyond the control of any of the users of the database. The values of the data in the table are determined by
data published by the NYSE after the exchange closes for the day. Therefore, the values of each of the attribute classes are independent of one another, and you can choose any of the three methods to structure your stock prices table(s).
After you have determined that your data structure meets the third normal form, the most important consideration is to avoid over-normalization of your data. Over-normalization is the result of applying too strict an interpretation of dependency
at the third normal stage. Creating separate tables for High, Low, and Close prices, as well as share-trading Volume, is overkill. You need to join three tables in a one-to-one relationship to display the four data values for a stock. This will be a very
slow process unless you create indexes on the primary key of each table. You have four tables, so you need four indexes. Even after indexing, the performance of your stock prices database will not be as fast as a table that contains all of the values.
Plus, the four indexes will be larger than a single index on a table that contains fields for all four attributes.
The rule for third normal form should have two corollary rules:
Therefore, the answer to the question of which structure is the best (which was posed in the preceding section) is answered by the suggested corollary rules for the third normal form. Create a new Stocks3 worksheet with fields for the High, Low, and
Close prices, as well as for the trading Volume. Then paste the appropriate cells to Stocks3 and add field names. Figure 4.14 shows the result of this process.
Figure 4.14. The stock prices data in third normal form.
A more elegant method of transforming worksheet data with repeating data groups to third normal form is to use Excel's new Pivot feature, introduced with Excel 5.0, to perform some of the transformation operations for you. (The Pivot feature is related to the TRANSFORM and PIVOT statements of Jet SQL, which are discussed in Chapter 7.)
The fourth normal form requires that independent data entities not be stored in the same table when many-to-many relations exist between these entities. If many-to-many relations exist between data entities, the entities are not truly independent;
therefore, such tables usually fail the third normal form test. The fourth normal form requires that you create a relation table that contains any data entities that have many-to-many relations with other tables. The stock prices data does not contain data
in a many-to-many relation, so this data cannot be used to demonstrate decomposition of tables to the fourth normal form.
The fifth normal form requires that you be able to exactly reconstruct the original table from the new table(s) into which the original table was decomposed or transformed. Applying the fifth normal form to your resulting table is a good test to make
sure you did not lose data in the process of decomposition or transformation. The Stocks3 worksheet contains every piece of data contained in the original STOCKS worksheet; therefore, with enough cutting, pasting, and sorting, you could restore it. It is
often a tedious process to prove compliance with the fifth normal form. Fortunately, compliance with the fifth normal form rule does not require that you be able to use ANSI SQL statements to reconstruct the original table.
The primary purpose of adding indexes to tables is to increase the speed of searches for specific data values. If you want to display all persons named Smith in the LastName field of a table, creating an index on the LastName field results in a
substantial improvement in the search performance. Without an index, the database manager must start at the beginning of the table and then test every record for the occurrence of "Smith" in the LastName field. If you create an index on the
LastName field of the table, the searching operation uses the index, not the table itself, to find the first record where LastName = 'Smith'.
Joining two tables by the primary key field(s) of one table and the foreign key field(s) of another table is a special case of searching for records. When you join two tables, the search criterion becomes Table2.ForeignKey = Table1.PrimaryKey. The
index must match every foreign key value with a primary key value. Without an index on both the primary key field(s) and the foreign key field(s), joining large tables can take a very long time.
The sections that follow describe the indexing methods in common use with today's desktop DBMSs and client/server RDBMSs, the structure of database tables and indexes, and how to choose the fields of tables to index so that you achieve optimum
application performance.
An index, in simplified terms, consists of a table of pointers to records or groups of records. The records that contain pointer values, usually with an unsigned long integer data type, are organized in a binary hierarchy to reduce the number of tests
required to find a record that matches the search criteria. Indexes traditionally refer to the three levels of the hierarchy as the root, branch, and leaf level. (Here again, the analogy to an inverted tree is used.) However, the number of levels in the
branch hierarchy actually depends on the number of records in the indexed table. The root leads to one of two branches, and each branch leads to another branch until you reach the leaf level, which is indivisible. The leaf level of the index contains the
pointers to the individual records or, in the case of Jet and most client/server databases, the pages that contain the records. The branches contain pointers to other branches in the index or to the leaves.
The exact method of indexing field values varies with the database manager you use; dBASE (.NDX and .MDX files), FoxPro (.IDX and .CDX), Clipper (.NTX), Paradox (.PX), Btrieve, and Jet indexes vary in structure. (Btrieve and Jet do not store indexes in
separate files, so no file classifications are given for these two databases.) Regardless of the indexing method, indexing techniques reduce the number of records that must be searched to find the first record matching the search criteria. The most
efficient indexes are those that find the first matching record with the fewest number of tests (passes) of the value of the indexed field.
Traditional desktop DBMs store fixed-width records in individual files and store indexes on the fields of the file in one or more index files. FoxPro 2+ and dBASE IV enable you to store multiple indexes for a single table in a single .CDX and .MDX
file, respectively. The table files used by these database managers have fixed-length records, so you can identify a record by its offset (its distance, in bytes) from the beginning of the data in the file, immediately after the header portion of the file.
Therefore, pointers to records in these files consist of offset values.
Jet and the majority of client/server databases store indexes as special structures (not tables) within the database file. These database types support variable-length fields for Text (varchar), Memo (long varchar), OLE Object and Binary field data
types (varbinary and long varbinary). To prevent the tables from becoming full of holes when you delete records or extend the length of a variable-length field, Jet and SQL Server databases use pages to store data, rather than records. Jet and SQL Server
pages are 2K in length, corresponding to the standard size of a cluster on a fixed disk of moderate size formatted by Windows 95 or DOS. (As the size of fixed disk partitions grow, so does the size of the clusters. As an example, the cluster size of a 1G
file using DOS's and Windows 95's FAT file system is 32K.) Therefore, if you increase the number of characters in a text field, the worst case condition is that the RDBMS must move 2K of data to make room for the data. If there is not enough empty space
(called slack) in the page to hold the lengthened data, the RDBMS creates a new page and moves the data in the record to the new page. Figure 4.15 illustrates the structure of the 2K data pages employed by Jet and SQL Server databases with about 1.7K of
variable-length records and roughly 350 bytes of slack.
Figure 4.15. A page in a Jet or SQL Server database with variable-length records.
The only drawback to the page-locking methodology is that you lock an entire page when updating a record in the page. If the record you are updating is very small, the lock can affect a number of records that you are not editing. If you use the optimistic locking technique offered by Jet and SQL Server, the lock is likely to be transparent to other database users, especially if your front-end application is very fast. dBASE and Paradox tables use record-locking, which affects only the single record being edited. Hopefully, future versions of Jet and Microsoft SQL server will support individual record-locking.
Btrieve table files include data pages and index pages. Data pages consist of separate fixed-length and variable-length pages. The size of all pages within a Btrieve file must be equal. You can specify the size of the pages in the range of from 512 to
4,096 bytes when you create the table. Choosing the correct page size for the type of data and the average length of records in the table can have a profound effect on the performance of Btrieve databases. A discussion of Btrieve table files is covered in
more detail in Chapter 12, "Extending Your Use of the Data Access Object."
The most common method of indexing tables is the balanced binary tree (B-tree) method originally proposed by Russian mathematicians G. M. Adleson-Velski and E. M. Landis in 1963. Prior to the B-tree method, editing, inserting, and deleting indexed
fields of records caused the index trees to become lopsided, increasing the number of passes required to find the record or page with a matching value. The balanced B-tree method reorganizes the tree to ensure that each branch connects either to two other
branches or to a leaf. Therefore, the B-tree index needs to be reorganized each time you add or delete a record. B-tree indexes speed decision-support queries at the expense of transaction-processing performance. In a B-tree index structure, the length of
a search path to a leaf is never more than 145 percent of the optimum path.
There is a truism in the database industry regarding the indexing of the fields of tables: Index only the fields you need to index in order to enhance the performance of your database front-ends, and don't index any other fields. The more indexes you
add to a table, the longer it takes to update entries that affect the value(s) of indexed field(s) and to add a new record, which requires updating all indexes. The problem here is knowing which fields improve application performance. The first step is to
determine what your options are. The following list discusses how the database types, supported by the Jet database engine, handle the indexing of primary key fields:
After you've determined whether you need to create a (primary) key field(s) index, you need to consider what other fields to index. The following list provides some suggestions that apply to all database types:
The Jet database engine uses query optimization techniques to choose which indexes to use when your application processes an SQL query. Query optimization is discussed in Chapter 2.
If you follow the rules in the preceding list, you are not likely to go too far wrong in choosing the fields of your tables to index. If you are using Jet tables and have a copy of Microsoft Access, comparing the performance of queries with and without
foreign key indexes is a simple process. Jet indexes, as a whole, are much smaller than dBASE and FoxPro 2+ indexes. Therefore, you can relegate disk space issues to a lower priority when determining how many indexes to add to an Jet table. Index size
comparisons between Jet and dBASE tables appear in Chapter 13, "Designing Online Transaction-Processing Systems."
You need at least several hundred records to test the effect of indexing foreign key fields on your application's performance. The more records, the better the test. If you or your client observe that the performance of your Visual Basic 4.0 front-end deteriorates as the number of records in the database increases, you may have failed to add an index to an important foreign key. You won't need to change your application's code to utilize the new indexes, except where you use Seek operations on Recordset object of the Table type, which you can now safely replace with Find. . . methods on Recordset objects of the Dynaset and Snapshot types. The Jet database engine's query optimizer automatically uses the new index when the new index aids a query's performance.
This chapter introduced you to the methodology for designing efficient relational database structures, including modeling tools for Jet databases and how to normalize tables that you create from existing data sources. Entire books are devoted to
database design techniques, and at least one full-size text covers only Object-Role Modeling design methods. If you're interested in the theoretical side of relational databases, including relational algebra, check out a copy of E. F. Codd's or Chris
Date's books on the subject at your local library. Methods of indexing tables created by desktop DBMs and client/server RDBMSs received only cursory treatment here. Donald E. Knuth's Searching and Sorting, mentioned early in the chapter,
provides an excellent introduction to indexing methods, despite the age of the title. The suggestions at the end of this chapter on how to optimize the indexing of tables are useful with any type of database.
The next chapter introduces you to Structured Query Language (SQL), the third language you need to understand to fully utilize this book and Visual Basic 4.0 as a database front-end generator. If you are an Access developer, you're probably used to
cutting and pasting SQL statements from the query design mode's SQL window into the RecordSource property of combo boxes or into your Access Basic code. Visual Basic 4.0 doesn't have a query design mode or an SQL window, so you really do need to learn
SQLspecifically the Jet dialect of ANSI SQLto create commercial database applications with Visual Basic 4.0.