Classifying Database Systems
Modeling Data
- Database Diagrams
- Using Modeling Tools for Database Design
Rules for Relational Database Design
- Organizing Entity Classes
- Normalizing Table Data
Indexing Tables for Performance and Domain Integrity
- Table Indexing Methods
  - Records and Data Pages
  - Balanced B-Tree Indexes
- Choosing Fields to Index
Summary

- 4 -
Optimizing the Design of Relational Databases

This chapter describes the theoretical and practical sides of relational database management systems (RDBMSs), including how to design a relational database system or a set of relational tables to achieve optimum performance with Visual Basic 4.0 database applications. This book distinguishes between a client/server RDBMS such as Microsoft SQL Server 6.0, a desktop RDBMS such as Access 2.0/95, and a set of relational tables such as a database created from a collection of dBASE table and index files. The database design principals you learn in this chapter apply equally to each of these database systems. Indexing methods that improve the speed of searching for a specific record in a table and increase the performance when two or more related tables are joined also are covered in this chapter.

This book is written for Visual Basic programmers who want to extend their programming skills to 32-bit Windows database applications as well as for database developers who are making the transition from the character-based world of DOS to the graphical environment of Windows. Visual Basic 4.0 is an ideal tool for use with client/server RDBMSs (also called SQL databases), so a substantial part of this book is devoted to creating front-ends for displaying and updating data in networked client/server environments. This chapter is intended for readers with limited database design experience. If you're an accomplished relational database developer, you might want to skim this chapter for now and then proceed to 5, "Learning Structured Query Language."

Classifying Database Systems

The history of digital computers is inexorably tied to the concurrent development of database methodology. It is probably a safe estimate that at least 80 percent of worldwide computer resources and programming activities are devoted to database applications.

The first military and commercial applications for tabulating machines and digital computers were devoted to retrieving, calculating, and reporting on data stored in the form of punched cards. For example, a deck of cards containing the names and addresses of customers constituted a database table. Another deck of punched cards, containing information on the invoices issued to customers for a given period, represented another table. Using a collating device, you could shuffle the two decks so that a new deck was created wherein each customer card was followed by the cards detailing the orders for that customer. You could then use the cards to print a report in which the customer's card provided a report subheading and all of the orders for the customer were printed under the subheading. Cards from other decks could be collated into the two-level deck to create more detailed reports. Early tabulating machines included the capability to create customer subtotals and order grand totals. Figure 4.1 shows the effect of collating two decks of cards to print a report of invoices.

Figure 4.1. Collating punched cards for customers and invoices to print a report.

The obvious problem with the punched-card collation technique was that every time you wanted a different report, you had to separate (decollate) the cards back into their original decks and then manually run a different set of collation processes. Replacing tabulating machines with computers equipped with nine-track magnetic tape drives solved many of the problems associated with moving decks of cards back and forth between collators. You transferred the data from a sorted deck of cards to a magnetic tape, then mounted the tapes you needed on tape drives and let the computer combine (merge) the data on the "table" tapes onto a new tape whose data was identical to that of a collated deck of punched cards. Then you could print a report from the data on the newly recorded tape.

Punched-card decks and magnetic tapes are sequential devices. Finding a particular record requires that you search from the first card of the deck or first record of a tape and read each card or record until you find a match (or determine that no matching record exists.) When high-capacity random-access data storage devices (such as disk drives) became available, searching for a particular record became much faster, even if you had to read each record in the table. To speed the process, sorting and indexing methods were developed to minimize the number of records the computer had to read until matching data was found. The seminal work in this field was Volume 3 of Stanford University Professor Donald E. Knuth's Art of Computer Programming series, Sorting and Searching, published by Addison-Wesley in 1973, which remains in print today (ISBN 0-201-03803-X).

Advances in computer technology after the advent of the random-access disk drive have occurred primarily in the form of architectural, rather than conceptual, changes to both hardware and software. The pace of improvement in the operating speed and the rate of cost reduction of computer hardware has far out-distanced the rate of progress in software engineering, especially in database design and programming methodology. You can make substantial improvement in the performance of an ill-conceived and poorly implemented database design simply by acquiring a faster computer. The price of a new computer is usually much less than the cost of re-engineering the organization's legacy database structure. Ultimately, however, poor database designs and implementations result in a severe case of organizational inefficiency. One of the purposes of this chapter is to provide a sufficient background in database design to make sure that the database structures you create do not fall into this category.

The sections that follow discuss elementary database terminology in the object-oriented language to which you were introduced in Chapter 2, "Understanding the Jet 3.0 Data Access Object." Also, the different types of computer database structures used today are described and compared.

Database Terminology

Whatever the data storage and retrieval mechanism used, a database is a collection of one or more table objects. A table is a database object that consists of a collection of rows (records) that have an identical collection of properties. The values associated with the properties of a table appear in columns (fields). Row-column (spreadsheet) terminology is most commonly used in databases that employ SQL statements to manipulate table data; whereas desktop databases commonly use record-field. This book uses the terms record and field when referring to persistent database objects (TableDef objects), and row and column when referring to virtual tables (Recordset objects) created from TableDef and QueryDef objects. The distinction between the two forms of terminology, however, is not very significant. Figure 4.2 illustrates the generalized structure of a database table.

Figure 4.2. The generalized structure of a database table.

The object hierarchy of the Microsoft Jet 1.x database engine supplied with Visual Basic 3.0 and Access 1.x included a Table object, a member of the Tables collection. The Table object and Tables collections do not appear in the Microsoft DAO 3.0 Object Library or the Microsoft DAO 2.5/3.0 Compatibility Library. Visual Basic 4.0 supports operations on Table objects such as the OpenTable method for backward compatibility with Jet 1.x code.

The following list describes the most important properties of database table objects. These property descriptions and rules apply to tables of conventional databases that use only fundamental data types—character strings and numerical values.

A record is a representation of a real-world object such as a person, a firm, an invoice, or one side of a transaction involving money. A record of a table is the equivalent of one punched card in a deck. In formal database terminology, a row or record is an entity. Synonyms for entities include data entities, data objects, data instances, or instances. Tables are the collection (set) of all entities of a single entity class. Statisticians call a table a homogeneous universe.
A field describes one of the characteristics of the objects represented by records. A field corresponds to a column of a spreadsheet.
The intersection between a row and a column is called an attribute, which represents the value of a significant property of a real-world object. Attributes are also called cells and data cells—terms derived from spreadsheet applications. All the attributes contained in a single column of a table are called an attribute class.
The fundamental rule of all table objects is that each field is devoted to one and only one property. (This rule is implied by the terms attribute and attribute class.) Attribute values are said to be atomic, used here as a synonym for indivisible. Each field is assigned a field name that is unique within the table. A Name field that contains entries such as "Dr. John R. Jones, Jr." is not atomic; the field actually consists of five attributes: title, first name, middle initial, last name, and suffix. You can sacrifice the atomicity of fields to a limited degree without incurring serious problems. It is a common practice to combine first name and middle initial, and sometimes the suffix, in a single field.
It is desirable, but not essential, for each record in a table to have a set of attributes by which you can uniquely distinguish one record in the table from any other record. This set is called the entity identifier or identifier. In some cases, such as tables that contain the line items of invoices, records need not have an unique identifier. It is a good database-design practice, however, to provide such an identifier, even if you have to add an item or line number attribute class to establish the uniqueness.
The field(s) that include the identifier attributes are called the primary key or primary key field(s) of the table. By definition, the set of attribute values that make up the primary key must be unique for each record. Records in related tables are joined by values of the primary key in one table and by equal values in the primary key or the foreign key(s) of the other table. A foreign key is one or more attributes of a joined table that do not constitute the primary key of the table.
Tables that contain records identifying objects that are inherently unique, such as human beings, are called primary or base tables. Each of the records of a primary table must have one or more attributes that uniquely identify the entity. Theoretically, all U.S. citizens, other than infants, have a unique Social Security number used for identification purposes. Thus, an employee table in the U.S. should be able to use a single attribute—a Social Security number—to serve as an entity identifier. A duplicate Social Security number indicates either a data entry error, a counterfeit Social Security card, or an error on the part of the Social Security Administration.
Tables are logical constructs—that is, tables need not be stored on a disk drive in a tabular format. Traditional desktop database managers, such as dBASE and Paradox, have a file structure that duplicates the appearance of the table. However, most mainframe and client/server database management systems store many tables (and even more than one database) within a single database file. Microsoft Access 1.0 was the first widely-used desktop RDBMS to store all the tables that constitute a single database in one Jet .mdb file. Thus, Jet .mdb files are called containers of tables. There is no easily discernible relationship between the physical and logical structures of tables of mainframe, client/server, and Jet database types.
The concept of a record number was introduced to the world of PC databases by dBASE II. The record number, returned by xBase's RECNO() function, is an artificial construct that refers to the relative physical position (called the offset) of a record in a table file. The record numbers change when you physically reorder (sort) the table file. Record number is not an attribute of a table, unless you create a field and add record number values (often called a record ID field). Record numbers that appear in the Access 95 equivalent of the Data control (navigation buttons) are generated by Access, not by the Jet database engine.

Online help for Visual Basic 4.0 defines a base table as "a table in a Microsoft Jet database." This is a Jet-centric definition of base table, not the definition commonly used in RDBMS circles.

A database consists of one or more tables; if more than one table is included in a database, the entities described in the tables must be related by at least one attribute class (field) that is common to two of the tables. The formal statistical name for a database is heterogeneous universe—a recently coined alternative term is universe of discourse. This book adheres to the term database. Object-oriented databases (OODBs) do not conform strictly to the atomicity rules for attributes. Similarly, the OLE Object field data type of Jet databases is not atomic because the data in the fields of the OLE Object field data type contains both the object's data (or only the presentation for a linked object) and a reference to the application that created the object. The object type of the content of Jet OLE Object fields may vary from record to record.

A query is a method by which you obtain access to a subset of records from one or more tables that have attribute values satisfying some criteria. There are a variety of ways to process queries against databases. Processing queries against databases with the Jet database engine is the subject of the next chapter, "Learning Structured Query Language." You also can use queries to modify the data in tables; this type of query is described in Chapter 7, "Running Crosstab and Action Queries."

Flat-File Databases

The simplest database form consists of one table with records having enough columns to contain all of the data you need to describe the entity class. The term flat-file is derived from the fact that the database itself is two-dimensional—the number of table fields determine the database's width, and the quantity of table records specify its height. There are no related tables in the database, so the concept of data depth, the third dimension, does not apply. Any database that contains only one table is, by definition, a flat-file database if the database requires that the tables be flat. (Relational databases, for example, require flat tables.)

Flat-file databases are suitable for simple telephone and mailing lists. The Windows 3.x Cardfile is an example of a simple flat-file database. Ranges of cells, which are designated as "databases" by spreadsheet applications, also are flat-files. A mailing-list database, as an example, has designated fields for names, addresses, and telephone numbers. Data files used in Microsoft Word's print merge operations constitute flat-file databases.

You run into problems with flat-file databases when you attempt to expand the use of a mailing-list database to include sales contacts, for example. If you develop more than one sales contact at a firm, there are only two ways to add the data for the new contact:

Add a new record with duplicate data in all fields except the contact and, perhaps, the telephone number field.
Add new fields so that you can have more than one contact name and telephone number field per record. In this case, you must add enough contact field pairs to accommodate the maximum number of contacts you expect to add for a single firm. The added fields are called repeating groups.

Neither of these choices is attractive because both choices are inefficient. Both methods can waste a considerable amount of disk space, depending on the database file structure you use. Adding extra records duplicates data, and adding new fields results in many records that have no values (nulls) for multiple contact and telephone number fields. Adding new fields causes trouble when you want to print reports. It is especially difficult to format printed reports that have repeating groups.

Regardless of the deficiencies of flat-file databases, many of the early mainframe computers only offered flat-file database structures. All spreadsheet applications offer "database" cell ranges that you can sort by a variety of methods. Although spreadsheet "databases" appear to be flat, this is seldom truly the case. One of the particular problems with spreadsheet databases is that the spreadsheet data model naturally leads to inconsistencies in attribute values and repeating groups. Time-series data contained in worksheets is a classic example of a repeating group. The section, "Organizing Entity Classes," (which appears later in this chapter) shows you how to deal with inconsistent entity classes that occur in worksheet "databases," and the "Normalizing Table Data" section describes how to eliminate repeating groups.

The Network and Hierarchical Database Models

The inability of flat-file databases to efficiently deal with data that involved repeating groups of data led to the development of a variety of different database structures (called models) for mainframe computers. The first standardized and widely accepted model for mainframe databases was the network model developed by the Committee for Data System Languages (CODASYL), which also developed Common Business-Oriented Language (COBOL) to write applications that manipulate the data in CODASYL network databases. Although the CODASYL database model has its drawbacks, an extraordinary number of mainframe CODASYL databases remain in use today. There are billions of lines of COBOL code in use everyday in North America.

CODASYL databases substitute the term record type for table, but the characteristics of a CODASYL record type are fundamentally no different from the properties of a table. CODASYL record types contain pointers to records of other record types. A pointer is a value that specifies the location of a record in a file or in memory. For example, a customer record contains a pointer to an invoice for the customer, which in turn contains a pointer to another invoice record for the customer, and so on. The general term used to describe pointer-based record types is linked list; the pointers link the records into an organized structure called a network. Network databases offer excellent performance when you are seeking a set of records that pertain to a specific object because the relations between records (pointers) are a permanent part of the database. However, the speed of network databases degrades when you want to browse the database for records that match specific criteria, such as all customers in California who purchased more than $5,000 worth of product "A" in August 1993.

The problem with CODASYL databases is that database applications (primarily COBOL programs) need to update the data values and the pointers of records that have been added, deleted, or edited. The need to sequentially update both data and pointers adds a great deal of complexity to transaction-processing applications for CODASYL databases.

IBM developed the hierarchical model for its IMS mainframe database product line, which uses the DL/1 language. The hierarchical model deals with repeating groups by using a data structure that resembles an upside-down tree: Data in primary records constitute the branches and data in repeating groups are the leaves. The advantage of the hierarchical model is that the methods required to find related records are simpler than the techniques needed by the network model. As with the CODASYL model, there are a large number of hierarchical databases running on mainframe computers today.

The Relational Database Model

The relational database model revolutionized the database world and enabled PCs to replace expensive minicomputers and mainframes for many database applications. The relational database model was developed in 1970 by Dr. E. F. Codd of IBM's San Jose Research Laboratories. The primary advantage of the relational model is that there is no need to mix pointers and data in tables. Instead, records are linked by relations between attribute values. A relation consists of a linkage between records in two tables that have identical attribute values. Figure 4.3 illustrates relations between attribute values of relational tables that constitute part of a sales database.

Figure 4.3. Relationships between tables in a sales database.

Because relational tables do not contain pointers, the data in relational tables is independent of the methods used by the database management system to manipulate the records. A relational database management system is an executable application that can store data in and retrieve data from sets of related tables in a database. The RDBMS creates transitory virtual pointers to records of relational tables in memory. Virtual pointers appear as they are needed to relate (join) tables and are disposed of when the relation is no longer required by a database application. The "joins" between tables are shown as dashed lines in Figure 4.3. Joins are created between primary key fields and foreign key fields of relational tables. The primary and foreign key fields of the tables of Figure 4.3 are listed in Table 4.1.

Table 4.1. The primary and foreign keys of the tables depicted in Figure 4.3.

Table	Primary Key	Foreign Key
Customers	Cust#	None
Invoices	Inv#	Cust#
Invoice Items	Inv# and Prod#	Inv#

Relational databases require duplicate data among tables but don't permit duplication of data within tables. You must duplicate the values of the primary key of one table as the foreign key of dependent tables. A dependent table is a table that requires a relationship with another table to identify its entities fully. Dependent tables often are called secondary or related tables. Thus, the Invoices table is dependent on the Customers table to supply the real-world name and address of the customer represented by values in the Cust# field. Similarly, the Invoice Items table is dependent on the Invoices table to identify the real-world object (in this case an invoice) to which records are related.

There are three types of relations defined by the relational database models, each of which is described in the following list:

One-to-one relations require that one and only one record in a dependent table can relate to a record in a primary table. One-to-one relations are relatively uncommon in relational databases.
One-to-many relations enable more than one dependent table to relate to a record in a primary table. The term many-to-one is also used to describe one-to-many relations. One-to-many relations constitute the relational database model's answer to the repeating-groups problem. Repeating groups are converted to individual records in the table on the "many" side of the relation. One-to-many relations are the most commonly found relations.
Many-to-many relations are not true relations, because many-to-many relations between two tables require an intervening table, called a relation table, to hold the values of the foreign keys. (Relational-database theory only defines relations between two tables.) If Figure 4.3 had included a Products table to describe the products represented by the Prod# field of the Invoice Items table, then the Invoice Items table would serve as a relation table between the Invoices and Products tables. Some relation tables only include foreign key fields.

One situation in which a one-to-one relationship is useful is for an employees table in which the employees' names, addresses, and telephone numbers need to be available to many database users; however, information about salaries, benefits, and other personal information should be restricted on a need-to-know basis. Databases, such as Jet, don't provide column-level permissions, so you create a one-to-one relationship between the table that contains the nonconfidential data and the one that contains confidential information. Then, you grant read-only permission to everyone (the users group) for the nonconfidential table, and only grant permission to a limited number of users for the confidential table.

The proper definition of the relations between entity classes and the correct designation of primary and foreign keys constitute the foundation of effective relational database design methods. The relational database model is built on formal mathematical concepts embedded in relational algebra. Fortunately, you don't need to be a mathematician to design a relational database structure. A set of five rules, discussed in a forthcoming section of this chapter, define the process of creating tables that conform to the relational model.

Types of Relational Database Managers

The preceding description of the relational database model made the important point that the properties of (such as the data in) a relational table object are independent of the methods used to manipulate the data. This means that you can use any relational database management application to process the data contained in a set of relational tables. For example, you can export the data in the tables of an IBM DB2 mainframe database as a set of text files that preserve the tables' structure. You can then import the text files into tables created by another database management system. Alternatively, you can use Jet, an ODBC driver for DB2, and a network gateway to the DB2 database to access the data directly. The independence of data and implementation in relational databases also enables you to attach tables from one database type to another. You can join the attached tables to the native tables in your Jet database without going through the export-import exercise. Thus, you can design a relational database that can be implemented with any relational database manager.

Relational database managers differ in the types of data you can store in tables and in how you name the fields of tables. Many RDBMS, such as SQL Server, include the long varbinary field data type, which enables you to store image data in tables; others, including the most commonly used versions of IBM's DB2, do not support long varbinary fields or their equivalent. You can embed spaces and other punctuation symbols in Jet table and field names, but you can't in most other RDBMS tables. If you are designing a database that may be ported from the original RDMBS to another relational database implementation, make sure you use only the fundamental field data types and conform to the table- and field-naming conventions of the least versatile of the RDBMSs.

There are, to be sure, substantial differences in how relational database systems are implemented. These differences are often overlooked by persons new to database management or persons converting from a mainframe database system to a desktop database manager. The following sections discuss how mainframe, mini-computer, and client/server databases differ from traditional desktop database managers.

Relational SQL Database Management Systems

Full-featured client/server relational database management systems separate the database management application (server or back-end) from the individual (client) applications that display, print, and update the information in the database. Client/server RDBMSs, such as Microsoft SQL Server 6.0, run as a process on the server computer. The majority of client/server systems in use today run under one or more flavors of the UNIX operating system, but Windows NT 3.5+ rapidly is gaining ground on UNIX as an application server operating system. The client/server RDBMS is responsible for the following activities:

Creating new databases and one or more files to contain the databases. (Several databases may reside in a single fixed disk file.)
Implementing database security to prevent unauthorized persons from gaining access to the database and the information it contains.
Maintaining a catalog of the objects in the database, including information on the owner (creator) of the database and the tables it contains.
Generating a log of all modifications made to the database so that the database can be reconstructed from a prior backup copy combined with the information contained in the log (in the event of a hardware failure).
Preserving referential integrity, maintaining consistency, and enforcing domain integrity rules to prevent corruption of the data contained in the tables. Most client/server RDBMSs use preprogrammed triggers that create an error when an application attempts to execute a query that violates the rules.
Managing concurrency issues so that multiple users can access the data without encountering significant delays in displaying or updating data.
Interpreting queries transmitted to the database by user applications and returning or updating records that correspond to the criteria embedded in the query statement. Virtually all client/server RDBMSs use statements written in SQL to process queries—thus the generic name, SQL RDBMS.
Executing stored procedures, which are precompiled queries that you execute by name in an SQL statement. Stored procedures speed the execution of commonly used queries by eliminating the necessity for the server to optimize and compile the query. The Remote Data Object (RDO) included with the Enterprise Edition of Visual Basic 4.0 is designed expressly for executing server stored procedures.

Separate database applications (front-ends) are responsible for creating the query statements sent to the database management system and for processing the rows of data returned by the query. Front-ends, as mentioned in Chapter 1, "Positioning Visual Basic 4.0 in the Database Front-End Market," handle all of the data formatting, display, and report-printing chores. One of the primary advantages of using an SQL RDBMS is that the features in the preceding list, such as security and integrity, are implemented by the RDBMS itself. Therefore, the code to implement these features doesn't need be added to each different front-end application. Chapter 20, "Creating Front-Ends for Client/Server Databases," describes the features of client/server RDBMSs in greater detail.

Three-Tier Client/Server Architecture and LOBjects

The stored procedures of client/server databases used to execute predefined queries and maintain database integrity use SQL plus proprietary SQL language extensions such as Transact-SQL, used by Microsoft and Sybase SQL Server products, and Sybase System 10+. SQL is a set-oriented, not a procedural, programming language; therefore, dialects of SQL are not well suited to writing programs for validating data in accordance with complex business rules. An example of a complex business rule is this: "The current credit limit of a customer is equal to the customer's maximum credit limit, less uncontested open invoices and orders in process, unless the customer has outstanding, uncontested invoices beyond terms plus 10 days, or if the total amount of contested invoices exceeds 50 percent of the total amount of open invoices. If a pending order exceeds the customer's calculated credit limit, or any customer payment is behind terms plus 10 days, approval must be obtained from the credit manager prior to accepting the order." Such a test is impossible to program as a conventional, linear SQL stored procedure, because the last condition requires the credit manager to intervene in the process.

Three-tier client/server architecture adds a processing layer between the front-end client and the back-end server. The processing layer, often called a Line-Of-Business object (LOBject), processes requests from client applications, tests the requests for conformance with programmed business rules, and sends conforming requests to the back-end RDBMS, which updates the affected tables. Visual Basic 4.0 lets you create OLE 2.1 Remote Automation Objects (RAOs) that reside on an application server, which need not be the server that runs the RDBMS. Each client application using the LOBject creates its own instance of the RAO. Figure 4.4 illustrates the architecture of a three-tier client/server application that uses Microsoft Mail 3.5 to process credit approvals (or rejections) for the scenario described in the preceding paragraph. Chapter 23, "Creating and Managing Remote Automation Objects," describes how to design and code RAOs for three-tier client/server applications.

Figure 4.4. A three-tier client/server database system for implementing a credit management LOBject.

Traditional Desktop Relational Database Managers

Traditional desktop database managers, such as dBASE and Paradox, combine their database management features with the interpreter or compiler that executes the application's source code. The early versions of these products enabled you to create quite sophisticated database applications that would run on PCs with only 256K of RAM. The constraints of available RAM in the early days of the PC required that the database management portion of the desktop DBM's code include only those features that were absolutely necessary to make the product operable. Therefore, products of this type, which also include early versions of FoxPro and Clipper for DOS, do not truly qualify as full-fledged relational database management systems and are more properly termed database file managers. You implement the "relational" and the "management" features through the application source code you write.

The components of the dBASE and Paradox DBMs that manipulate the data contained in individual tables do not provide all of the built-in features of true relational database management systems listed in the preceding section. (The exception is the desktop products' capability to create a file that contains one table.) You need to write application code to enforce referential and domain integrity, and a one-DOS-file-per-table system does not lend itself to the ease of establishing secure databases.

The commercial success of dBASE (especially dBASE III+) and Paradox for DOS created a user base of perhaps six million. (Borland claims there are four million dBASE users worldwide; about the same number of copies of Microsoft Access sold when this edition was written.) Therefore, dBASE and Paradox product upgrades need to be backward-compatible with tens of millions of .DBF and .DB files and billions of lines of dBASE and PAL code. New features, such as file- and record-locking for multiuser applications and the capability to use SQL to create queries, are add-ins (or tack-ons) to the original DBMS. Therefore, both dBASE and Paradox are currently losing market share to relatively low-cost client/server RDBMSs, such as Microsoft SQL Server, Microsoft Access, a hybrid of the desktop DBMS and the full-featured RDBMS, and Visual Basic.

Microsoft Access: A Hybrid RDBMS

Microsoft Access is a cross between a conventional desktop DBMS and a complete, full-featured RDBMS. Access uses a single database file that includes all of the tables native to the database. Access's Jet database engine enforces referential integrity for native tables at the database level, so you don't need to write Access Basic (Access 1+) or VBA (Access 95) code to do this. Jet enforces domain integrity at the field and table levels when you alter the value of a constrained field. Jet databases include system tables that catalog the objects in the database, and concurrency issues are handled by the database drivers.

Access lets you break the back-end/front-end barrier that separates RDBMSs from desktop DBMSs. Application code and objects, such as forms and reports, can be included in the same database file as the tables. Microsoft used Access's capability to include both front-end and back-end components in a single .mdb file as strong selling point. It soon became apparent to Access developers that separating application and database objects into individual .mdb files was a better design. You create a Jet database that contains only tables and attach these tables to an Access .mdb file that provides the front-end functionality. User names and passwords are stored in a separate .mdw workgroup (formerly .mda) library file; this is necessary because a Jet .mdb file contains only one database. Here again, you can put sets of unrelated tables in a single .mdb file. Jet's proprietary flavor of SQL is the native method of manipulating data in tables, it's not an afterthought. The Jet DLLs that implement the database functionality are independent from the MSACCESS.EXE file that includes the code you use to create forms and reports. Jet databases are about as close as you are likely to get to a RDBMS in a low-cost, mass-distributed software product.

Using Jet .mdb database files with Visual Basic front-ends approximates the capabilities and performance of client/server RDBMSs at a substantially lower cost for both hardware and software. If you are currently using one-file-per-table DBMSs, consider attaching the tables to a Jet database during the transition stage while your new 32-bit Windows front-ends and your existing applications need to simultaneously access the tables. Once the transition to Visual Basic front-ends is completed, you can import the data to a Jet database and take full advantage of the additional features .mdb files offer. If you outgrow the Jet database structure, it's a quick and easy port to SQL Server 6.0 and the Microsoft SQL Server ODBC driver.

You no longer need a copy of Microsoft Access to take advantage of Access's built-in enforcement of referential integrity and the security features of Jet databases. Visual Basic 4.0 and the Jet 3.0 Data Access Object now provide programmatic implementation of referential integrity and security. The Jet database engine and the 32-bit Jet ODBC 2.0 driver now support the SQL FOREIGN KEY and REFERENCES reserved words to define relationships during table creation; however, don't implement the SQL-92 CHECK reserved word, which enables you to enforce domain integrity with ranges or lists of values that constrain attribute values. If you're seriously into database development with Jet 3.0 databases, you should purchase a copy of Microsoft Access for Windows 95 (called Access 95 in this book). You will quickly recoup your investment because Access lets you establish relationships in a graphic Relationships window, supplies a simple method for adding field-level and table-level constraints, provides a graphic Query-by-Design window, and generates Jet SQL statements for you.

Modeling Data

The first step in designing a relational (or any other) database is to determine what objects need to be represented by database entities and what properties of each of these objects require inclusion as attribute classes. The process of identifying the tables required in the database and the fields that each table needs is called data modeling. There are two approaches that you can take during the process of data modeling:

Application-oriented design techniques start with a description of the type of application(s) required by the potential users of the database. From the description of the application, you design a database that provides the necessary data. This is called the bottom-up approach, because applications are ordinarily at the bottom of the database hierarchy.
Subject-oriented design methodology begins by defining the objects that relate to the subject matter of the database as a whole. This approach is called top-down database design. The content of the database determines what information front-end applications can present to the user.

Although application-oriented design can enable you to quickly create an ad hoc database structure and the applications to accomplish a specific goal, bottom-up design is seldom a satisfactory long-term solution to an organization's information needs. It is common to find several application-oriented databases within an organization that have duplicate data, such as independent customer lists. When the firm acquires a new customer, each of the customer tables needs to be updated. This is an inefficient and error-prone process.

Subject-oriented database design is a far more satisfactory method. You might want to divide the design process into department-level or workgroup-related databases, such as those in the following list:

A sales database that has tables based on customer, order and line item, invoice and line item, and product entity classes.
A production database with tables for parts, suppliers, bills of material, and cost accounting information. The product and invoice tables of the sales department's database would be attached to the production database.
A personnel database with tables for employees, payroll data, benefits, training, and other subjects relating to human resources management. The production and sales databases would attach to the employees table—production for the purposes of cost accounting purposes, and sales for commissions.
An accounting database, with tables comprising the general ledger and subsidiary ledgers, would attach to the majority of the tables in the other databases to obtain access to current finance-related information. Accounting databases often are broken down into individual orders, accounts receivable, accounts payable, and general ledger databases.

There is no fixed set of rules to determine which shared tables should be located in what database. Often, these decisions are arbitrary or are based on political, rather than logical, reasoning. Department-level databases are especially suited for multiuser Jet databases running on peer-to-peer networks with 30 or fewer users. Each department can have its own part-time database administrator (DBA) who handles backing up the database, granting and revoking the rights of users to share individual tables in the database, and periodically compacting the database to regain the space occupied by deleted records.

Database Diagrams

Diagramming relations between tables can aid in visualizing database design. Entity-relation (E-R) diagrams, also called entity-attribute-relation (EAR) diagrams, are one of the most widely used methods of depicting the relations between database tables. The E-R diagramming method was introduced by Peter Chen in 1976. An E-R diagram consists of rectangles that represent the entity classes (tables). Ellipses above the table rectangles show the attribute class (field) involved in the relation. Pairs of table rectangles and field ellipses are connected by parallelograms to represent the relation between the fields. Figure 4.5 illustrates an E-R diagram for the Customers and Invoices tables of the database described in Figure 4.3 and Table 4.1. The 1 and m adjacent to the table rectangles indicate a one-to-many relationship between the two tables.

Figure 4.5. An entity-relationship diagram showing the relationship between the Customers and Invoices tables.

E-R diagrams describe relations by predicates. One of the definitions of predicate is "a term designating a property or relation." If you remember parsing sentences in grammar school English classes, you'll observe that "Customers" is the subject, "are sent" is the predicate, and "invoices" is the predicate object of a complete sentence. E-R diagrams are capable of describing virtually any type of allowable relation between two tables by adding additional symbols to the basic diagram shown in Figure 4.5. A very large number of E-R diagrams are required to define relationships between the numerous entities in enterprise-wide databases.

Using Modeling Tools for Database Design

Designing databases to accommodate the information requirements of an entire firm is a major undertaking. Therefore, computer-aided software engineering (CASE) tools often are used for the design of complex database systems. CASE tools for database design usually include the following capabilities:

Business model generation—The first step in the use of a high-end CASE tool is to create an operational model of an enterprise, which consists of defining virtually every activity involved in operating the organization. Accurately modeling the operations of a large firm as it relates to information requirements is an extraordinarily difficult and time-consuming process.
Schema development—A database schema is a diagram that describes the entire information system pictorially, usually with less detail than that offered by E-R diagrams. The schema for a large information system with a multiplicity of databases can cover an entire wall of a large office or conference room.
Relation diagramming—Some CASE tools support several methods of diagramming relations between tables. Most, but not all, CASE tools support E-R diagrams, as well as other pictorial methods, such as Bachman diagrams.
Data dictionary development—A data dictionary is a full description of each table in the database and each field of every table. Other properties of tables and fields, such as primary keys, foreign keys, indexes, field data types, and constraints on field values, are included. Creating data dictionaries is one of the subjects of Chapter 25, "Documenting Your Database Front-Ends."
Repository creation—A repository is a database that is part of the CASE tool. The repository contains all the details of the structure and composition of the database. Data in the repository is used to create schema, relation diagrams, and data dictionaries. The repository is also responsible for maintaining version control when you change the database's design. Microsoft and Texas Instruments have joined forces to develop an object-oriented repository for 32-bit Windows database development.
Database generation—After you've designed the database, the CASE tool creates the SQL Data Definition Language (DDL) statements necessary to create the database and its tables. You then send the statements to the RDBMS, which builds the database for you.
Data flow diagramming—Some database CASE tools include the capability to create data flow diagrams that describe how data is added to tables and how tables are updated. Data flow diagrams, however, are application-related, not database design–related. Therefore, data flow diagramming capability is a prerequisite for qualification as a CASE database tool.

Mainframe database developers have a variety of CASE tools from which to choose, and there are several CASE tools that serve the client/server market, such as Popkin Software's Data Architect. Developers using desktop DBMSs have not been so fortunate; no commercial CASE tools with the basic features in the preceding list were available for xBase and Paradox developers at the time this book was written.

Database modeling tools are a simplified version of CASE tools for information system development. Modeling tools omit the business modeling aspects of CASE tools, but implement at least the majority of the features described in the preceding list. An example of a database modeling tool designed specifically for Jet databases is Asymetrix, Inc.'s InfoModeler. InfoModeler is a full-fledged database design system that you can use to create Jet databases from structured English statements that define entity and attribute classes. InfoModeler employs a unique database-diagramming method, called Object-Role Modeling (ORM), derived from the Natural-language Information Analysis Method (NIAM) and Binary Relationship Modeling (BRM) database design methodologies. A Formal ORM (FORM) diagram for the classic student-instructor-course database model, included as a tutorial example with InfoModeler, appears in Figure 4.6. Dr. Terry Halpin of the University of Queensland, who devised ORM, has written a 500-page college-level textbook, Conceptual Schema & Relational Database Design, Second Edition (Prentice Hall Australia, 1995, ISBN 0-13-355702-2) that explains ORM in detail.

Figure 4.6. A Formal Object-Role Modeling diagram created in InfoModeler.

InfoModeler uses a language called Formal Object-Role Modeling Language (FORML) to describe entity classes, attribute classes, and relations between entities. The semantics of FORML are similar to the subject/predicate statements of E-R diagrams. In E-R diagramming, you draw the pictorial elements and then add the callouts that describe the elements. Using FORML, you start with a structured English sentence that states a fact about an object. Figure 4.7 shows InfoModeler's Object Browser dialog with a list of FORML facts that apply to the Building object (table).

Figure 4.7. The Object Browser dialog of InfoModeler.

After you've entered as many facts as you can collect about each object that is to be represented in your database, you run InfoModeler's fact compiler to create the database design. The fact compiler adds table and field definitions to the repository database and then builds the FORM diagram. FORM diagrams are more compact than E-R diagrams, so it is practical to fully depict all fields of tables and relations between tables in the database schema. An even more compact schema is provided by InfoModeler's Table Browser window, as shown in Figure 4.8.

Figure 4.8. The Table Browser window of InfoModeler.

The Table Browser window shown in Figure 4.8 displays eight of the 12 tables that constitute InfoModeler's tutorial database. You can hide or minimize the individual windows that display the details of a table to reduce display clutter. Relations between table fields are depicted by lines that terminate in "+" symbols, which represent the "one" side, and ">" symbols, which represent the "many" side of a one-to-many relation.

Experienced database designers might conclude that a database modeling tool such as InfoModeler is "overkill" for relatively simple database design projects. If you've designed a number of complex databases, you can probably create a relational table structure by rote. Although entering FORML sentences to describe entity and attribute classes might seem tedious, the benefits of FORM far outweigh the cost of the time needed to write object descriptions. You get an object repository, schema, table diagram, FORM diagram, and data dictionary, plus InfoModeler creates the Jet database tables for you. An additional benefit is that FORM diagrams are comprehensible to clients who are not database experts. Therefore, your client can validate a database design before you commit to creating your database applications. This avoids the sinking feeling that pervades database developers when they hear their clients utter those dreaded words, "Uh, we forgot to tell you about. . .." InfoModeler also includes a reverse-engineering feature that lets you "reverse engineer" your Jet databases to create a FORM diagram stored in InfoModeler's repository. One of the major benefits of using InfoModeler is your ability to track changes to database schema, in a manner similar to that of Visual SourceSafe.

Rules for Relational Database Design

If you use a modeling tool to create your database structure, the modeling tool automatically creates tables that comply with the basic rules of relational database design. Database developers, however, often are faced with the task of importing or using existing data that is not in a suitable format for a relational database. It is quite common for database developers to be presented with the task of transforming data contained in spreadsheet "databases" into tables of a relational database. Another scenario is the conversion of a poorly designed, dysfunctional database, or a CODASYL network database that contains repeating groups, into a proper relational structure. (COBOL permits the use of the GroupName OCCURS Several TIMES statement to create repeating groups in network databases.)

The sections that follow describe the methods you use to transform nonrelational data to fully relational form.

Organizing Entity Classes

In the "Flat-File Databases" section of this chapter, I mention that the worksheet data model often contains inconsistent entities in rows. The stock prices example, shown in Figure 4.9, is a sample Excel worksheet whose structure violates every rule applicable to relational database tables except attribute atomicity. STOCKS is a worksheet that lists the New York Stock Exchange's (NYSE) closing, high, and low prices for shares and the sales volume of 25 stocks for a five-day period. Rows contain different entities and columns B through F are repeating groups. The Stocks5.xls workbook in Excel 5.0/7.0 format, which is included on the accompanying CD-ROM, is used in the examples that follow. You'll find Stocks5.xls in the \Ddg_vb4\32_bit\Chaptr04 folder.

Figure 4.9. A worksheet whose structure is the antithesis of proper database table design.

You need to separate the entity classes according to the object each entity class represents. The four entity classes of the STOCKS worksheet of the Stocks5.xls workbook are the closing price, the highest transaction price, the lowest transaction price, and the trading volume of a particular stock on a given day. To separate the entity classes you need to add a column so that the stock is identified by its abbreviation in each row. You can identify the data entities by their classes—Close, High, Low, and Volume—plus the abbreviation for the stock, which is added to the new column with a simple recorded Excel VBA macro. Then, you sort the data with the Entity and Key columns. The result of this process appears, as shown for the Stocks1 worksheet, in Figure 4.10.

The dates column represents a mixed entity type (three prices in dollars and the volume in shares), but each entity is now identified by its type. Therefore, you can divide the entities into separate tables at any point in the transformation process.

Figure 4.10. The STOCKS worksheet with entities sorted by entity class.

Now you have a table that contains entities with consistent attribute values, because you moved the inconsistent stock name abbreviation to its own attribute class, Key, and replaced the stock abbreviation in the Entity column A to a value consistent with the Entity attribute class, Close. However, the repeating-groups problem remains.

Manual worksheet methods of manipulating tabular data are used in this chapter because worksheet techniques, such as selecting, cutting, and pasting groups of cells, represent the easiest and fastest way to change the structure of tabular data. If you need to transform a large amount of worksheet data into relational tables, you should use VBA OLE Automation methods or create general-purpose VBA macros in the worksheet application to automate the transformation process. Examples of Visual Basic 4.0 applications that use OLE Automation methods for data transformation appear in Chapter 14, "Integrating Database Front-Ends with OLE 2.1."

Normalizing Table Data

The process of transforming existing data into relational form is called normalization. Normalization of data is based on the assumption that you have organized your data into a tabular structure wherein the tables contain only a single entity class. Here are the objectives of data normalization:

To eliminate duplicated information contained in tables
To accommodate future changes to the structure of tables
To minimize the impact of changes to database structure on the front-end applications that process the data

The sections that follow describe the five steps that constitute full normalization of relational tables. In most cases, you can halt the normalization process at the third normal form. Many developers bypass the fourth and fifth normal forms because these normalization rules appear arcane and inapplicable to everyday database design.

The First Normal Form

The first normal form requires that tables be flat and contain no repeating groups. A data cell of a flat table may contain only one atomic (indivisible) data value. If your imported data contains multiple data items in a single field, you need to add one or more new fields to contain each data item; then, move the multiple data items into the new field.

The Northwind.mdb sample database included with Access 95 includes a table, Customers, whose Address field contains data that violates the first normal form because some cells contain a two-line address. Figure 4.11 shows the Customers table of Northwind.mdb in datasheet mode. The multi-line addresses for Hungry Coyote Import Store and Island Trading violate the atomicity rule. Therefore, you need another field, such as Location, to contain the second line of two-line entries in the Address field. For parcel delivery services, such as Federal Express, you need the physical address in the Location field for firms that use postal boxes to receive their mail.

Figure 4.11. First normal form violations in the Customers table of Access 95's Northwind.mdb sample database.

If you are an xBase or Paradox developer, you might think that adding a field to contain the physical location portion of the address causes unwarranted expansion of the size of the Customers table. This is not the case with Jet tables, because Jet databases use variable-length fields for the Text field data type. If an entry is missing in the Location field for a customer, the field contains only the Null value in databases that support Null values. The size is an issue applicable to fixed-width xBase and Paradox table fields, because you must provide enough width to accommodate both lines of the address, whether one or two fields are used to contain the address data. Jet tables use variable-length fields for Text entries. An alternative to adding a Location field is to create a separate table for location data that has an optional one-to-one relation with the Customers table. This is a less efficient process than accepting Null or blank values in records that do not have Location values.

Eliminating repeating groups is often a tougher process when you are transforming worksheet data. Four of the five columns of stock price data shown in the preceding Figure 4.10 are repeating groups. The "quick and dirty" method of eliminating repeating groups is a series of copy, cut, paste and fill operations. You add another column to specify the date for each entry. Then, you cut and paste the cells for the four repeating groups into the column renamed from the beginning date of the series to PriceVolume. The final step is to sort the data on the Entity, Key, and Date fields. A portion of the resulting worksheet (Stocks2) appears in Figure 4.12. Instead of 101 rows for the 25 stocks, you now have 501 rows and what appears to be a large amount of duplicated data. However, your data is now in first normal form.

Figure 4.12. The STOCKS worksheet transformed to first normal form in the Stocks2 worksheet.

The Second Normal Form

The second normal form requires that all data in nonkey fields of a table be fully dependent on the primary key and on each element (field) of the primary key when the primary key is a composite primary key. "Fully dependent on" is a synonym for "uniquely identified by." It is clear from examining the data shown in Figure 4.12 that the only nonkey column of Stocks2 is PriceVolume. The Entity, Key, and Date fields are members of the composite primary key. The sorting process used in the preceding section proves this point.

There is a controversy among database designers as to whether or not objects that have a common attribute class, such as price, should be combined into a single table with an identifier to indicate the type of price, such as List, Distributor, or OEM for products, or in this case, Close, High, and Low transaction price for the day. This process is called subclassing an entity. There's no argument, however, that the volume data deserves its own worksheet, at least for now, so you cut the volume data from Stocks2 and paste it to a new worksheet called Volume. You can delete the Entity column from the Volume sheet, because the name of the sheet now specifies the entity class.

The data in the volume sheet, with field names added, is shown in Figure 4.13. Each entity now is uniquely identified by the two-part composite primary key comprising the Key and Date fields. You can import the data from the Volume sheet into a table from any application that supports importing Excel 5.0/7.0 tables contained in a workbook—Access 95, for example. Both the Stocks2 and Volume worksheets contain data in second normal form.

Figure 4.13. The Volume worksheet in second normal form.

The Third Normal Form

The third normal form requires that all nonkey fields of a table be dependent on the table's primary key and independent of one another. Thus, the data in a table must be normalized to second normal form to ensure dependency on the primary key. The issue here is the dependencies of nonkey fields. A field is dependent on another field if a change in the value of one nonkey field forces a change in the value of another nonkey field.

At this point in the normalization process, you have the following three choices for how to design the table(s) to hold the stock price data:

Leave the data remaining in the Stocks2 worksheet as is and create one Prices table, using the Entity column to subclass the price entity. This method requires a three-field composite key.
Create a Prices table with three columns, High, Low, and Close, using Key and Date as the composite primary key. You could even add the volume data to the Volume field and then have only one record per stock-day.
Create three separate prices tables, High, Low, and Close, and use Key and Date as the composite primary key. In this case, you don't subclass the entities.

The decision on your available choices for a table structure that meets the third normal form is a judgment call based on the meaning of the term independent. Are stock prices and trading volumes truly independent of one another? Are the opening, high, and low prices dependent upon the vagaries of the stock market and the whims of traders, and thus independent of one another? These questions mix the concepts of dependence and causality. Although it is likely that a higher opening price will result in a higher closing price, the causality is exogenous to the data itself. Exogenous data is data that is determined by factors beyond the control of any of the users of the database. The values of the data in the table are determined by data published by the NYSE after the exchange closes for the day. Therefore, the values of each of the attribute classes are independent of one another, and you can choose any of the three methods to structure your stock prices table(s).

Over-Normalizing Data and Performance Considerations

After you have determined that your data structure meets the third normal form, the most important consideration is to avoid over-normalization of your data. Over-normalization is the result of applying too strict an interpretation of dependency at the third normal stage. Creating separate tables for High, Low, and Close prices, as well as share-trading Volume, is overkill. You need to join three tables in a one-to-one relationship to display the four data values for a stock. This will be a very slow process unless you create indexes on the primary key of each table. You have four tables, so you need four indexes. Even after indexing, the performance of your stock prices database will not be as fast as a table that contains all of the values. Plus, the four indexes will be larger than a single index on a table that contains fields for all four attributes.

The rule for third normal form should have two corollary rules:

Combine all entities of an object class that can be uniquely identified by the primary (composite) key and whose nonkey values either are independent of one another or are exogenous to the database and all related databases into a single table, unless the combination violates the fourth normal form. Combining entities into a single table is called integrating data.
Decompose data into tables that require one-to-one relationships only when the relationship is optional or when you need to apply security measures to nonkey values, and your RDBMS does support column-level permissions. Decomposing data means breaking a table into two or more tables without destroying the meaningfulness of the data.

Therefore, the answer to the question of which structure is the best (which was posed in the preceding section) is answered by the suggested corollary rules for the third normal form. Create a new Stocks3 worksheet with fields for the High, Low, and Close prices, as well as for the trading Volume. Then paste the appropriate cells to Stocks3 and add field names. Figure 4.14 shows the result of this process.

Figure 4.14. The stock prices data in third normal form.

A more elegant method of transforming worksheet data with repeating data groups to third normal form is to use Excel's new Pivot feature, introduced with Excel 5.0, to perform some of the transformation operations for you. (The Pivot feature is related to the TRANSFORM and PIVOT statements of Jet SQL, which are discussed in Chapter 7.)

The Fourth Normal Form

The fourth normal form requires that independent data entities not be stored in the same table when many-to-many relations exist between these entities. If many-to-many relations exist between data entities, the entities are not truly independent; therefore, such tables usually fail the third normal form test. The fourth normal form requires that you create a relation table that contains any data entities that have many-to-many relations with other tables. The stock prices data does not contain data in a many-to-many relation, so this data cannot be used to demonstrate decomposition of tables to the fourth normal form.

The Fifth Normal Form

The fifth normal form requires that you be able to exactly reconstruct the original table from the new table(s) into which the original table was decomposed or transformed. Applying the fifth normal form to your resulting table is a good test to make sure you did not lose data in the process of decomposition or transformation. The Stocks3 worksheet contains every piece of data contained in the original STOCKS worksheet; therefore, with enough cutting, pasting, and sorting, you could restore it. It is often a tedious process to prove compliance with the fifth normal form. Fortunately, compliance with the fifth normal form rule does not require that you be able to use ANSI SQL statements to reconstruct the original table.

Indexing Tables for Performance and Domain Integrity

The primary purpose of adding indexes to tables is to increase the speed of searches for specific data values. If you want to display all persons named Smith in the LastName field of a table, creating an index on the LastName field results in a substantial improvement in the search performance. Without an index, the database manager must start at the beginning of the table and then test every record for the occurrence of "Smith" in the LastName field. If you create an index on the LastName field of the table, the searching operation uses the index, not the table itself, to find the first record where LastName = 'Smith'.

Joining two tables by the primary key field(s) of one table and the foreign key field(s) of another table is a special case of searching for records. When you join two tables, the search criterion becomes Table2.ForeignKey = Table1.PrimaryKey. The index must match every foreign key value with a primary key value. Without an index on both the primary key field(s) and the foreign key field(s), joining large tables can take a very long time.

The sections that follow describe the indexing methods in common use with today's desktop DBMSs and client/server RDBMSs, the structure of database tables and indexes, and how to choose the fields of tables to index so that you achieve optimum application performance.

Table Indexing Methods

An index, in simplified terms, consists of a table of pointers to records or groups of records. The records that contain pointer values, usually with an unsigned long integer data type, are organized in a binary hierarchy to reduce the number of tests required to find a record that matches the search criteria. Indexes traditionally refer to the three levels of the hierarchy as the root, branch, and leaf level. (Here again, the analogy to an inverted tree is used.) However, the number of levels in the branch hierarchy actually depends on the number of records in the indexed table. The root leads to one of two branches, and each branch leads to another branch until you reach the leaf level, which is indivisible. The leaf level of the index contains the pointers to the individual records or, in the case of Jet and most client/server databases, the pages that contain the records. The branches contain pointers to other branches in the index or to the leaves.

The exact method of indexing field values varies with the database manager you use; dBASE (.NDX and .MDX files), FoxPro (.IDX and .CDX), Clipper (.NTX), Paradox (.PX), Btrieve, and Jet indexes vary in structure. (Btrieve and Jet do not store indexes in separate files, so no file classifications are given for these two databases.) Regardless of the indexing method, indexing techniques reduce the number of records that must be searched to find the first record matching the search criteria. The most efficient indexes are those that find the first matching record with the fewest number of tests (passes) of the value of the indexed field.

Records and Data Pages

Traditional desktop DBMs store fixed-width records in individual files and store indexes on the fields of the file in one or more index files. FoxPro 2+ and dBASE IV enable you to store multiple indexes for a single table in a single .CDX and .MDX file, respectively. The table files used by these database managers have fixed-length records, so you can identify a record by its offset (its distance, in bytes) from the beginning of the data in the file, immediately after the header portion of the file. Therefore, pointers to records in these files consist of offset values.

Jet and the majority of client/server databases store indexes as special structures (not tables) within the database file. These database types support variable-length fields for Text (varchar), Memo (long varchar), OLE Object and Binary field data types (varbinary and long varbinary). To prevent the tables from becoming full of holes when you delete records or extend the length of a variable-length field, Jet and SQL Server databases use pages to store data, rather than records. Jet and SQL Server pages are 2K in length, corresponding to the standard size of a cluster on a fixed disk of moderate size formatted by Windows 95 or DOS. (As the size of fixed disk partitions grow, so does the size of the clusters. As an example, the cluster size of a 1G file using DOS's and Windows 95's FAT file system is 32K.) Therefore, if you increase the number of characters in a text field, the worst case condition is that the RDBMS must move 2K of data to make room for the data. If there is not enough empty space (called slack) in the page to hold the lengthened data, the RDBMS creates a new page and moves the data in the record to the new page. Figure 4.15 illustrates the structure of the 2K data pages employed by Jet and SQL Server databases with about 1.7K of variable-length records and roughly 350 bytes of slack.

Figure 4.15. A page in a Jet or SQL Server database with variable-length records.

The only drawback to the page-locking methodology is that you lock an entire page when updating a record in the page. If the record you are updating is very small, the lock can affect a number of records that you are not editing. If you use the optimistic locking technique offered by Jet and SQL Server, the lock is likely to be transparent to other database users, especially if your front-end application is very fast. dBASE and Paradox tables use record-locking, which affects only the single record being edited. Hopefully, future versions of Jet and Microsoft SQL server will support individual record-locking.

Btrieve table files include data pages and index pages. Data pages consist of separate fixed-length and variable-length pages. The size of all pages within a Btrieve file must be equal. You can specify the size of the pages in the range of from 512 to 4,096 bytes when you create the table. Choosing the correct page size for the type of data and the average length of records in the table can have a profound effect on the performance of Btrieve databases. A discussion of Btrieve table files is covered in more detail in Chapter 12, "Extending Your Use of the Data Access Object."

Balanced B-Tree Indexes

The most common method of indexing tables is the balanced binary tree (B-tree) method originally proposed by Russian mathematicians G. M. Adleson-Velski and E. M. Landis in 1963. Prior to the B-tree method, editing, inserting, and deleting indexed fields of records caused the index trees to become lopsided, increasing the number of passes required to find the record or page with a matching value. The balanced B-tree method reorganizes the tree to ensure that each branch connects either to two other branches or to a leaf. Therefore, the B-tree index needs to be reorganized each time you add or delete a record. B-tree indexes speed decision-support queries at the expense of transaction-processing performance. In a B-tree index structure, the length of a search path to a leaf is never more than 145 percent of the optimum path.

Choosing Fields to Index

There is a truism in the database industry regarding the indexing of the fields of tables: Index only the fields you need to index in order to enhance the performance of your database front-ends, and don't index any other fields. The more indexes you add to a table, the longer it takes to update entries that affect the value(s) of indexed field(s) and to add a new record, which requires updating all indexes. The problem here is knowing which fields improve application performance. The first step is to determine what your options are. The following list discusses how the database types, supported by the Jet database engine, handle the indexing of primary key fields:

Jet tables, for which you specify primary key field(s) in a TableDef object, have no-duplicates indexes that are automatically created by the Jet database engine on these field(s). Most client/server databases that you connect with the ODBC API also have no-duplicates indexes on primary key field(s), although the indexes usually aren't created automatically. A no-duplicates index prevents the addition of a record with duplicate values in the primary key field(s). You cannot remove the primary key index or specify "duplicates OK" without deleting the primary key designation.
Specify a clustered index for each table of client/server databases that support clustered indexes. (Microsoft and Sybase SQL Server offer clustered indexes.) Clustered indexes reorganize the pages of the database in the order of the clustered index. Chapter 20 discusses how to choose the field on which to add a clustered index.
Paradox tables also require no-duplicates indexes on the primary key field(s). If FILENAME.PX index file is not in the same directory as the corresponding FILENAME.DB file, your application can read the file but not update it. (If FILENAME.PX is missing, the Jet database engine creates a non-updatable Recordset object of the Snapshot type, rather than of the Dynaset type, from the table data.)
dBASE tables created by xBase applications do not support designating field(s) as primary key field(s), and you cannot create the equivalent of a no-duplicates index on dBASE tables. Even after you execute the xBase SET UNIQUE ON statement, xBase applications enable you to append as many records as you want with duplicate values in the indexed fields you use in lieu of primary key fields. (The index ignores records with duplicate indexed field values.) Your application needs to test for duplicate values and prevent the addition of records that would cause duplicated indexed field values. (You need the index so you can quickly find whether a duplicate value exists.) You can update data in dBASE table files with the Jet database engine, because no primary key can be specified.
Btrieve files support key fields, but they do not designate primary key field(s). Btrieve does not provide a no-duplicates option. (You can create a no-duplicates key field by using Btrieve's auto-increment extended key field type to insert a consecutive long integer value to the records you add.) Unlike dBASE, Btrieve adds 8 bytes of data to the field to indicate the chronological order of the addition of fields with duplicate key values. Your application needs to test for existing key values to prevent duplication.

After you've determined whether you need to create a (primary) key field(s) index, you need to consider what other fields to index. The following list provides some suggestions that apply to all database types:

Use short codes to identify entities that do not have properties designed to identify the entity, such as part numbers for products. Creating indexes on long text fields, such as product names, unnecessarily increases the size of the index table, slows performance, and wastes disk space.
Indexes based on numeric values usually have better performance than indexes on character fields. Using auto-incrementing fields—Jet 3.0's AutoIncrement (formerly Counter) field data type, for example—as a primary key field sometimes is feasible when you import existing data.
Index the foreign key field(s) that participate in joins with other tables.
Index the foreign key fields that your client will search most often.
Do not create a separate index for the indexed fields of the composite primary key. Almost all database management systems enable searches on partial key matches, so such an index would duplicate the existing primary key index.
Prior to Jet 2.0, it was necessary to use Table objects, not Dynaset or Snapshot objects, and the Seek method if you wanted to take advantage of indexes other than the primary key index in your Visual Basic 4.0 applications. When you use the Find. . . methods on Jet 2+ tables or queries, Jet will use the appropriate index, if such an index exists. Jet 2+ indexes use Rushmore technology (adopted from FoxPro), which provides a major improvement in indexed searches. 32-bit Jet 3.0 provides a major boost in the speed of indexing and indexed searches compared to Jet 2.0 and 2.5.
Avoid using the Like "*Criteria" statements in Jet SQL and LIKE '%Criteria' statements in ANSI SQL. Queries that contain these statements cannot use indexes.
Don't try to create indexes on fields of the long data types—long varchar (Jet Memo fields) or long varbinary (Jet OLE Object fields). Neither the Jet database engine or the client/server RDBMSs are capable of creating indexes on these field data types.

The Jet database engine uses query optimization techniques to choose which indexes to use when your application processes an SQL query. Query optimization is discussed in Chapter 2.

If you follow the rules in the preceding list, you are not likely to go too far wrong in choosing the fields of your tables to index. If you are using Jet tables and have a copy of Microsoft Access, comparing the performance of queries with and without foreign key indexes is a simple process. Jet indexes, as a whole, are much smaller than dBASE and FoxPro 2+ indexes. Therefore, you can relegate disk space issues to a lower priority when determining how many indexes to add to an Jet table. Index size comparisons between Jet and dBASE tables appear in Chapter 13, "Designing Online Transaction-Processing Systems."

You need at least several hundred records to test the effect of indexing foreign key fields on your application's performance. The more records, the better the test. If you or your client observe that the performance of your Visual Basic 4.0 front-end deteriorates as the number of records in the database increases, you may have failed to add an index to an important foreign key. You won't need to change your application's code to utilize the new indexes, except where you use Seek operations on Recordset object of the Table type, which you can now safely replace with Find. . . methods on Recordset objects of the Dynaset and Snapshot types. The Jet database engine's query optimizer automatically uses the new index when the new index aids a query's performance.

Summary

This chapter introduced you to the methodology for designing efficient relational database structures, including modeling tools for Jet databases and how to normalize tables that you create from existing data sources. Entire books are devoted to database design techniques, and at least one full-size text covers only Object-Role Modeling design methods. If you're interested in the theoretical side of relational databases, including relational algebra, check out a copy of E. F. Codd's or Chris Date's books on the subject at your local library. Methods of indexing tables created by desktop DBMs and client/server RDBMSs received only cursory treatment here. Donald E. Knuth's Searching and Sorting, mentioned early in the chapter, provides an excellent introduction to indexing methods, despite the age of the title. The suggestions at the end of this chapter on how to optimize the indexing of tables are useful with any type of database.

The next chapter introduces you to Structured Query Language (SQL), the third language you need to understand to fully utilize this book and Visual Basic 4.0 as a database front-end generator. If you are an Access developer, you're probably used to cutting and pasting SQL statements from the query design mode's SQL window into the RecordSource property of combo boxes or into your Access Basic code. Visual Basic 4.0 doesn't have a query design mode or an SQL window, so you really do need to learn SQL—specifically the Jet dialect of ANSI SQL—to create commercial database applications with Visual Basic 4.0.

- 4 - Optimizing the Design of Relational Databases