Teach Yourself Database Programming
with Visual C++ 6 in 21 days


Day 7
      Database Design



A good database design pays big dividends. A database that is properly designed is easy to write code for, provides good performance, and furnishes useful information that you had not anticipated needing at the time the database was first designed.

Today you will learn

Designing a good database is harder than it looks. It is also more important than many developers realize. In other words, designing a database takes time and effort, but it is time and effort well spent.

Database Normalization

Even though designing a database can be difficult, a few rules of thumb simplify the process. In addition, there are some sound scientific principles for designing relational databases. These scientific principles of database design are called the normal forms of relational databases.

Normal forms are database design rules that specify levels of conformance to the relational model. There are six levels of conformance, beginning at the first normal form (1NF), progressing through the fifth (5NF), and concluding with the highest level of conformance, which is the domain/key normal form (DKNF).

Database normalization is the process of designing the tables in a database so that they conform to the normal forms of the relational database model.

The normal forms are essentially a measure of how well the tables in your database conform to the relational model. The normal forms are nested. If a table in your database conforms to the 3NF, it automatically conforms to the 2NF and the 1NF as well.

Building a database that conforms to the normal forms of the relational model takes effort, but it is a worthwhile effort. If your database conforms to the relational model, everyone who uses your database (now and in the future) will be assured of having access to the data in a way that makes the database open and therefore valuable.

Rules of Thumb for Relational Database Design

Before delving into the normal forms, I will explain the rules of thumb that simplify the process of database design. There are three of them, and I call them Robison's Rules of Database Design, or R2D2 for short.

TIP
R2D2 #1 (the first rule): The number of records in your database should mirror the number of objects in real life.

If one instance of an object exists in real life (IRL), one and only one record (one row in a table) should exist in your database. If exactly twenty instances exist IRL, exactly twenty records should exist in your database.

For example, if your database stores information about customers, the database should have one, and only one, record for each customer. The database should not hold more than one customer per record, nor should it split a customer across multiple records.

R2D2 #1 is fundamental to designing a relational database and will make your application more valuable because it helps ensure that your database is open and accessible. I worked on a commercial application that, to optimize performance, stored more than one IRL object per record in the database. The result was a marginal gain in the performance of the database. However, because of its lack of conformance to the relational model, the database could not be accessed outside the application. This limitation proved to be a detriment to the application's commercial success.

TIP
R2D2 #2: The fields in each record should represent the attributes of the objects in real life.

If you use R2D2 #1 and make each record represent an object IRL, you can use R2D2 #2 to figure out what fields those records should contain. You can also deduce what the data types of those attributes should be.

TIP
R2D2 #3: The relationships between objects in real life should be mirrored in the relationships between records in the database.

If a customer can place more than one order, this is a one-to-many relationship. The database must be built to store more than one order record for each customer record, thus mirroring the one-to-many relationship between customers and orders IRL. The other types of relationships are one to one and many to many.

The relationships between records in various tables in a relational database are called the entity relationships.

There are tools and techniques for implementing one-to-one, one-to-many, and many-to-many relationships in relational database. You will learn about these tools and techniques later today.

Normal Forms of the Relational Database Model

In addition to Robison's Rules for Database Design are the normal forms of the rela-tional model. Based on sound scientific principles and ensure that your database will be accessible and valuable, now and in the future.

The First Normal Form

The first normal form (1NF) requires that, in a given table, the data type of each field must not change from record to record. In C++ parlance, a database table must be like an array of structures; the data structure does not vary from element to element in the array. A database table must not be like an array of unions, in which the data structure can vary from element to element.

Each column must have a unique name. Each field in each record must contain a single value, meaning it must describe a single attribute and cannot contain a compound data type that holds more than one attribute.

There can be no repeating fields. A record cannot contain any repeating data, such as multiple fields in the record that contain the same type of attribute. (This would be a one-to-many relationship and should be represented in two tables.)

Each record must be unique; there can be no duplicate records in the table. Creating a primary key for the table (such as a Social Security number for people, a part number for products, and so on) usually ensures the uniqueness of records in a table. The primary key cannot contain a NULL in any records.

Sometimes it's necessary to create a composite key, which is made up of two or more fields in the record.

A composite key is a key that consists of two or more fields in a database table.

For example, you might have a table that records the dates that products were shipped. You might specify the part number field and the ship date field as the primary key. The fields are separate fields, but the combination of the two fields constitutes a composite key.

In a nutshell, 1NF requires that your tables be simple two-dimensional tables with no repeating fields and with the fields containing no compound data types.

The Second Normal Form

The second normal form (2NF) requires that all the fields in the database contain data that depends on the entire primary key. If a table uses a single field as its key and is in 1NF, it is automatically in 2NF.

If you were to apply 2NF to a table with a composite key of the part number field and the ship date field, you couldn't have any fields in the table that apply only to the ship date field or only to the part number field. For instance, in this table you wouldn't want a field for the total number of all products shipped that day, such as the Quantity field shown in Figure 7.1. In this example, three products were actually shipped on 11/16/98 (one each of the three 8-tracks).

Figure 7.1 : A table showing the quantity of all products shipped daily.

The Quantity field contains the total of all products shipped that day. Supposedly, you could select any record that has a ship date you are looking for and use the Quantity field from that record to discover the total number of products shipped that day. However, the Quantity field violates the 2NF. The Quantity field applies only to the ship date field, not to the part number field. This results in duplicate data (multiple records with 3 for the quantity for 11/16/98).

All the non-key fields in the record must apply to the unique combination of ship date and part number. Therefore, you could have a field in the table that contains the total number of each particular product that was shipped that day. This field would depend on both the part number and the ship date, as the Quantity field now does in Figure 7.2.

Figure 7.2 : A table showing the quantity of each product shipped daily.

The Quantity field shown in Figure 7.2 applies to both the part number field and the ship date field. The part number field and the ship date field make up the entire key, so the table is in 2NF. One way to discover the total number of products shipped on a given day is to use an aggregate function to add up the quantity field of all the records for that date. You will learn more about aggregate functions in the next few days.

The Third Normal Form

The third normal form (3NF) requires that there be no transitive dependencies, in which one field depends on another field, which in turn depends on another field. When a table violates 3NF, lack of records in one table can result in loss of information.

For example, look at the following table with the PartNo as the key:

PartNoDescriptionArtist Gender
0000018-TrackTapeHendrix Male

The Gender field depends on a field (Artist) rather than the PartNo key.

Technically, if you have the zip code, city, and state in an address record, it probably isn't in 3NF, because an argument can be made that the city and state depend on the zip code.

Lack of normalization does not necessarily ruin the design, however. The higher levels of normalization (4NF, 5NF, and DKNF) prevent any loss of information. As you progress to higher levels of normalization, you end up creating more and more specialized tables. However, conforming to the higher levels of normalization can have a negative effect on performance because of the increasing number of tables and the complexity of the SQL joins you have to write.

You should design your tables to conform to the highest normal form as is practical. Violating the normal forms should be the exception rather than the rule in your database designs. The optimum database design is often slightly denormalized (but only slightly).

SQL Data Definition Language

Open your ADOMFC1 project, select the Data View, and open a Query window on the Customers table. You will see that the Customers table does conform to R2D2 #1; there is one customer per record. Does it conform to the normal forms? Well, almost. There are two address fields. Strictly speaking, this is a violation of 1NF because the records have repeating fields. However, you could make the case that the two address fields are not repeating data but are two distinct elements that make up a street address. With this possible exception, the Customers table conforms to the 1NF, 2NF, and 3NF.

Open the Products table. You will see that it conforms to the R2D2s and the 3NF as well.

Open the Orders table. You will see that it does not conform to R2D2 #1.

Figure 7.3 : The Orders table.

This table is supposed to store orders, but as you can see in Figure 7.3, there are multiple records in this table for single orders IRL. Notice that there are two records for order number 2 and three records for order number 4.

The primary key for the Orders table is the RecordNumber field. This field has no real relevance to an order; a record number is not an attribute of an order. This Orders table does not conform to the normal forms.

To bring the database into conformance with the relational model, you need to create an Orders table that contains orders as single records.

You need to identify the single attributes of an order. These single attributes would include an order number, an order date, the payment method, and the customer number for the customer who placed the order. You might want to store the shipping address as well. You might be able to obtain the shipping address from the customer's address field(s) in the Customers table. However, the address to which each order was shipped is actually an attribute of the order and could be different than the customer's address.

Multiple products can be purchased in a single order. That means the products for the orders need to be moved to a separate table.

The new table that contains products for orders could be called the ProductsPurchased table. Each product purchased would have a single record in this table. The attributes of each product purchased would be the product number, the order number under which this product was purchased, the price that the product sold for, the quantity of the product purchased, and the shipping charge for that product.

You could make a case for not including the price in the ProductsPurchased table. You might be able to obtain the price by using the product number and looking up the price in the Products table. However, the price in the ProductsPurchased table could change. This would cause the price the product sold for on past orders to be lost. Therefore, it's best to treat the price that the product sold for as an attribute of the products purchased and make it a field in the ProductsPurchased table.

The question of which table should contain the shipping charge field depends on how the company assesses shipping charges. If a shipping charge is dependent on each product shipped, the shipping charge should be a field in the ProductsPurchased table. If the shipping charge is a flat fee for each order, it should be a field in the Orders table. In the sample application, the shipping charge is assessed for each product purchased. Therefore it is a field in the ProductsPurchased table.

One place the shipping charge should not be stored is in your application source code. You might assume that the shipping charge is a fee that's always added to each product. You might hard-code the shipping charge into your application source code and not store it in the database. That would be a bad idea because the shipping charge might change. The logic to add the shipping charges to the price of the order should be written into your application source code. However, the amount of the shipping charge should be stored in the database because it could change over time.

TIP
Keep the business formulas separate from the business variables. In your application source code, place the formulas you use to make calculations. Place the variables for those formulas in your database.

Perhaps a more precise name for the ProductsPurchased table would be the OrderLineItems table because each product purchased IRL doesn't necessarily have its own single record in this table (as specified in R2D2 #1). If a customer purchased three of a particular item in one order, only one record would be in the table, and that record would contain a quantity of three. However, I prefer to call the table ProductsPurchased because that name denotes that you can use it to obtain information on what products were purchased and when.

Click the SQL button with the Orders table open. Change the SQL statement so that it looks like the code in Listing 7.1.


Listing 7.1.  The CREATE TABLE Statement for the ProductsPurchased Table

 1:  CREATE TABLE ProductsPurchased(OrderNumber INTEGER,
 2:  PartNumber varchar(10), Price CURRENCY,
 3:  Quantity INTEGER, ShippingAndHandling CURRENCY)

Data Definition Language (DDL) consists of those SQL statements that create or alter the structure of a database. This structure consists of database tables, indexes, constraints, and so on.

The structure of the entire database is called the schema of the database.

The DDL code in Listing 7.1 is a statement that will create a table called ProductsPurchased. Notice that the fields are listed in the CREATE TABLE statement, followed by their data type. Microsoft Access supports the CURRENCY data type. Other databases might not have this data type but will have other type(s) that can store decimal numbers such as monetary values. Consult your database documentation for information on the specific data types that it supports.

Run this statement against your database by clicking the Run (!) button. Click the minus sign by Tables in the Data View to contract the list of tables. Click the plus sign to expand the list of tables, and you will see your new table in the Data View.

Now you need to move the data from the Orders table into your new ProductsPurchased table. Close the Orders table and open it again so that you get a SELECT statement for the Orders table. Modify the SELECT statement so that it looks like Listing 7.2.


Listing 7.2.  The INSERT INTO Statement for the ProductsPurchased Table

 1:  INSERT INTO productspurchased
 2:      (ordernumber, partnumber, price, quantity,
 3:      shippingandhandling)
 4:  SELECT ordernumber, partnumber, price, 1,
 5:      shippingandhandling
 6:  FROM Orders

Execute the statement in Listing 7.2. It should insert the six records from the Orders table into the ProductsPurchased table. Open the ProductsPurchased table to make sure.

Now you need to normalize the Orders table. SQL makes it easy to add new columns to a table. However, most relational databases do not enable you to delete columns. The surest course is to create a new table. You could call the new table NewOrders. Enter the DDL code shown in Listing 7.3 into a Query window and execute it to build the NewOrders table.


Listing 7.3.  The CREATE TABLE Statement for the NewOrders Table

 1:  CREATE TABLE NewOrders(OrderNumber INTEGER,
 2:  OrderDate DATETIME, CustomerNumber INTEGER,
 3:  PaymentMethod VARCHAR(50))

Your next task is to move the appropriate records from the Orders table into the NewOrders table. You can do this by using an INSERT INTO, and SELECT, statement, as shown in Listing 7.4.


Listing 7.4.  The INSERT INTO Statement for the NewOrders Table

 1: INSERT INTO NewOrders
 2:      (ordernumber, orderdate, customernumber,
 3:      paymentmethod)
 4:  SELECT DISTINCT 
 5:      ordernumber, orderdate, customernumber,
 6:      paymentmethod
 7:  FROM orders

Line 4 of Listing 7.4 uses the DISTINCT modifier with the SELECT statement. This causes only unique records to be returned by the SELECT statement. In other words, no duplicate records will be returned by the SELECT statement. Without the DISTINCT modifier, all the records in the Orders table (six records) would be inserted into the NewOrders table. Using the DISTINCT modifier, only three records are inserted, which is the actual number of orders IRL. Figure 7.4 shows the records inserted into the NewOrders table.

Figure 7.4 : The NewOrders table.

You can see in Figure 7.4 that there were actually three orders IRL: order numbers 1, 2, and 4. The NewOrders table conforms to R2D2 #1 and to the 3NF.

Now that you have split the data from the Orders table into two normalized tables, you can get rid of the Orders table. This is done with the following DROP TABLE statement:

DROP TABLE Orders

Open a Query window and execute this statement to delete the Orders table. In the Data View, contract and expand the list of tables or right-click the data source and select the Refresh menu to see that the Orders table is now just a memory.

Using Constraints and Indexes in a Relational Database

Relational databases have built-in mechanisms to ensure the integrity of the data in the database. One of these mechanisms is called a constraint.

Constraints are rules for valid data that the database enforces for you.

You can place different kinds of constraints on the database. For instance, you can place a primary key constraint on a field to enforce the primary key. The constraint makes sure that the data in the primary key field(s) is unique. In other words, it prevents duplicate records in the table by not allowing new records to have the same data in the key field as other records.

You need to specify primary keys in the new tables you added to the database. Add a primary key to the NewOrders table by issuing the SQL statement shown in Listing 7.5.


Listing 7.5.  The Primary Key Constraint for the NewOrders Table

 1:  ALTER TABLE NewOrders
 2:  ADD CONSTRAINT OrderNumberIndex
 3:  PRIMARY KEY (ordernumber)

Line 1 in Listing 7.5 uses the ALTER TABLE statement and specifies the NewOrders table. Line 2 tells the database to add a constraint called OrderNumberIndex. Line 3 specifies this is a primary key constraint on the OrderNumber field. This makes the OrderNumber field the primary key in the NewOrders table. The constraint will enforce the uniqueness of the OrderNumber field.

Add a primary key to the ProductsPurchased table by issuing the SQL statement shown in Listing 7.6.


Listing 7.6.  The Primary Key Constraint for the ProductsPurchased Table

 1:  ALTER TABLE ProductsPurchased
 2:  ADD CONSTRAINT ProductsPurchasedIndex
 3:  PRIMARY KEY (ordernumber, partnumber)

The code in Listing 7.6 adds a primary key constraint to the ProductsPurchased table. The primary key is a composite key consisting of the OrderNumber field and the PartNumber field. The constraint will prevent duplicate OrderNumber/PartNumber combinations among the records in the table.

Relational databases use indexes to optimize the performance of data access operations. If you merely create tables and do not use indexes, the database will be forced to perform table scans. The database will start at the beginning of the table and sequentially look at every record until it finds the record(s) it needs. If, on the other hand, you create indexes for your tables, the database can look up the value it is searching for in the index and move directly to the appropriate record(s).

The primary key is indexed. When you specify a primary key on a table, the database creates an index for the table using the primary key.

If you will frequently use other fields in queries, such as in the WHERE clause or the ORDER BY clause of SELECT statements, you will probably want to create indexes for those fields as well. You can create as many indexes as you need for each table (within practical limits). The following is the syntax for creating an index:

CREATE INDEX myIndex ON myTable (myField)

You should use indexes only where they are needed. They will reduce insert, update, and delete performance because every time you change an indexed field in a record, the database has to update the index as well.

Tools and Techniques for Managing Relationships in a Relational Database

You learned earlier today that you should carefully identify the one-to-one, one-to-many, and many-to-many relationships in your database designs (see R2D2 #3).

To model a one-to-one relationship in your database, use primary keys and foreign keys as you learned in Day 2, "Tools for Database Development in Visual C++ Developer Studio," and Day 3, "Retrieving Data Through Structured Query Language (SQL)." For every instance of the primary key in one table, you will have no more than one instance of the foreign key in the foreign table.

To model a one-to-many relationship, use primary keys and foreign keys as you learned in Day 2 and Day 3. For every instance of the primary key in one table, you can have any number of instances of the foreign key in the foreign table.

Modeling many-to-many relationships requires that you create a third table. The two tables that you want to relate will contain their primary keys (as you would expect). The third table, called the link table, will contain the foreign keys from both primary tables. This is best understood through an example.

You will recall that the design of the original Orders table contained the product number and the customer number as foreign keys (see Figure 7.5).

The original Orders table was a link table that facilitated a many-to-many relationship between customers and products. You could perform a join between these three tables and find out which customers bought which products. This is an excellent example of a many-to-many relationship because a single customer could buy many products and many customers could buy a single product.

Figure 7.5 is a simple entity relationship diagram (ER diagram) that shows the relationship between these three tables. The Customers table and the Products table contain the primary keys. The Orders table contains the foreign keys, so it is the link table.

You can see that lines run between the primary and foreign key fields. A 1 is next to the primary keys and an infinity sign next to the foreign key fields. This is due to the one-to-many relationship between the primary keys in the Customers and Products tables and the foreign keys in the link (Orders) table. When the one-to-many relationships are combined in the link table, it produces a many-to-many relationship between the Customers and Products tables.

Figure 7.5 : Many-to-many relationships.

Using Constraints to Enforce Relationships

You can place constraints on the database that enforce the relationships. These constraints prevent a user from deleting a record whose primary key constitutes a foreign key in another table.

Referential integrity constraints are constraints that ensure that the data in one table is consistent with data in other tables in the database.

A referential integrity constraint will prevent you from deleting a product from the Products table that is listed in the ProductsPurchased table. You will recall that you encountered a constraint like this in Day 6, "Harnessing the Power of Relational Database Servers," when you tried to delete all the 8-track products from the Products table. This is because that delete operation would have left orphaned records in the Orders table.

To create a referential integrity constraint, you can use the ALTER TABLE statement with the ADD CONSTRAINT clause, as shown in Listing 7.7.


Listing 7.7.  The Foreign Key Constraint Between the ProductsPurchased and Products Tables

 1:  ALTER TABLE ProductsPurchased
 2:  ADD CONSTRAINT fk_partnumber
 3:  FOREIGN KEY (PartNumber)
 4:  REFERENCES Products (PartNumber)

Listing 7.7 creates a constraint to enforce the referential integrity between the ProductsPurchased table and the PartNumber table. Line 2 in Listing 7.7 names the constraint (in case you want to drop it later). Line 3 specifies that the PartNumber field in the ProductsPurchased table is a foreign key. Line 4 tells the database where the primary key is that matches this foreign key.

Listing 7.8 creates a constraint to enforce the referential integrity between the ProductsPurchased table and the NewOrders table.


Listing 7.8.  The Foreign Key Constraint Between the ProductsPurchased and NewOrders Tables

 1:  ALTER TABLE ProductsPurchased
 2:  ADD CONSTRAINT fk_ordernumber
 3:  FOREIGN KEY (OrderNumber)
 4:  REFERENCES NewOrders (OrderNumber)

Listing 7.9 creates a constraint to enforce the referential integrity between the NewOrders table and the Customers table.


Listing 7.9.  The Foreign Key Constraint Between the NewOrders and Customers Tables

 1:  ALTER TABLE NewOrders
 2:  ADD CONSTRAINT fk_custnumber
 3:  FOREIGN KEY (CustomerNumber)
 4:  REFERENCES Customers (CustNumber)

Open a Query window in Visual Studio and issue these SQL statements to add the constraints. With these constraints in place, users of the database will not be able to make modifications to the data that would cause the data in one table to be out of sync with the data in the other tables.

Summary

Designing your database to conform to the relational model is important and can be difficult. The process of designing a relational database is made easier by using the intuition-based R2D2s and the science-based normal forms. The process of normalizing your database typically involves separating tables in your database into more specialized tables.

Use SQL Data Definition Language (DDL) to build the schema of your database, which includes tables, indexes, and constraints. Indexes enable better query performance. Constraints help ensure the integrity of the data inside the database.

Q&A

Q
If I know my database stores data in 2KB pages, wouldn't it make sense to structure my database tables so that each record is 2KB, also?
A
That approach would certainly optimize your database for speed. However, it would be impossible to make such a database conform to the relational model. Your database might be marginally faster, but it would be incompatible with all the other database software and data access tools in the Universe. Your database would be a closed, proprietary system with no value outside your application. This ultimately would lessen the value of your application.
Q
Is there a typical level of conformance to the relational model?
A
No. However, if your database conforms to the 3NF, you can be well assured of its usability and its compatibility with relational data access tools.
Q
Do all relational database systems support the same DDL statements?
A
Support for DDL statements varies among relational database vendors. Check your database software documentation for specifics.
Q
Is it necessary to add constraints to my database?
A
Primary key constraints are necessary for a relational database to function reliably. Other constraints might not be necessary but are a great help to you in maintaining your database's value and usefulness. Rather than look for ways to avoid constraints, you should look for places to use constraints wherever possible. They will protect the integrity and validity of the information stored in your database.

Workshop

The Workshop quiz questions test your understanding of today's material. (The answers appear in Appendix F, "Answers.") The exercises encourage you to apply the information you learned today to real-life situations.

Quiz

  1. What is the highest normal form in the relational database model?
  2. What are entity relationships?
  3. How can you guarantee that a table conforms to the second normal form?
  4. What is the proper term for the structure (the tables, indexes, constraints, and so on) of a relational database?
  5. What does a referential integrity constraint do?

Exercises

  1. Write a SELECT statement that shows all the products purchased on each order. Hint: The SELECT statement should perform a join between the NewOrders, ProductsPurchased, and Products tables.
  2. Write a SELECT statement showing the products purchased by each customer.

© Copyright, Sams Publishing. All rights reserved.