In the previous article, we created a spreadsheet acting as a database. This poorly designed database invited insert, update and delete anomalies. We also looked at the problems that null values can cause. In this article, we will attempt to remedy some of the issues we have with this poor excuse of a database.

The official name for fixing the design is called Database Normalization. I find most of the explanations of normalization confusing and sometimes conflicting, so I will just talk you through the process without getting technical at all.

We will start with the following version of the database:

Row #	Name	Department	Job Title
1	John Smith	IT	Programmer
2	Peter King	Information Technology	Network Administrator
3	Jane Peters	IT	DBA
4	Sue Jenkins	I.T.	Database Administrator
5	J. Smith	IT	Programmer
6	Peters, Jane	InfoTech	D.B.A.
7	Jenkins, S	IT	Databse Administrator
8	Bob Hansen	IT

First Iteration - Cleanup

During this iteration, we removed the duplicates, Jane Peters & Sue Jenkins, leaving us with the following version of the database:

Name	Department	Job Title
John Smith	IT	Programmer
Peter King	Information Technology	Network Administrator
Jane Peters	IT	DBA
Sue Jenkins	I.T.	Database Administrator
J. Smith	IT	Programmer
Bob Hansen	IT

Second Iteration - Identification

In the previous article, we determined that John Smith and J. Smith were, in fact, two different people. So we need a way to distinguish between the two, especially if J. Smith also turns out to be a John. The way we differentiate between rows is to ensure that each row in our database is unique. We accomplish this by giving each row a unique key. We call this a Primary Key.

So, by adding a primary key to each row, we end up with the following version of our database:

ID	Name	Department	Job Title
SMI001	John Smith	IT	Programmer
KIN001	Peter King	Information Technology	Network Administrator
PET001	Jane Peters	IT	DBA
JEN001	Sue Jenkins	I.T.	Database Administrator
SMI002	J. Smith	IT	Programmer
HAN003	Bob Hansen	IT

Naming Conventions

If you look at the ID column carefully, you will see some naming convention happening here. We use the first three characters of the employee's last name and append 001 to the first and 002 to the second employee with the same three last name characters. For example, my last name is Slabbert, and my family and I will be SLA001, SLA002, SLA003 & SLA004. And James Sladowski will be SLA005, and so on.

So is this naming convention a good one, and does it serve any purpose other than uniquely identifying a row in our database? Sometimes they serve a secondary goal. For example, during my days as a pharmacist, we used a dispensing application that used naming conventions for identifying items. For example, paracetamol came in tablets, capsules, and liquids and was identified with PAR100, PAR200, and PAR300, respectively.

However, these naming conventions have several shortcomings. What about a drug called paraldehyde? Paraldehyde is administered via intramuscular injection. Will it be identified as PAR400 in the dispensing application? What if there is a paracetamol injection, which by the way, there isn't? We cannot allow two items to have the same primary key.

The first lesson to learn when assigning IDs to rows is to be careful to give it any secondary purpose. The rule of thumb is that an ID means nothing more than uniquely defining a row. That way, there is no confusion.

Replacing our current naming convention of IDs with simple numbers solves that problem and renders the following version of the database:

ID	Name	Department	Job Title
1	John Smith	IT	Programmer
2	Peter King	Information Technology	Network Administrator
3	Jane Peters	IT	DBA
4	Sue Jenkins	I.T.	Database Administrator
5	J. Smith	IT	Programmer
6	Bob Hansen	IT

Third Iteration - Atomicity

The rule is a simple one. Each column value must be atomic. Thus, each column contains a single value, not a set of values. Have a look at the Name column:

ID	Name
1	John Smith
2	Peter King
3	Jane Peters
4	Sue Jenkins
5	J. Smith
6	Bob Hansen

Here we have three values: Last Name, First Name, and Initials. Thus our Name column is not Atomic. Let’s fix it:

ID	Last Name	First Name	Initials	Department	Job Title
1	Smith	John	J	IT	Programmer
2	King	Peter	P	Information Technology	Network Administrator
3	Peters	Jane	JS	IT	DBA
4	Jenkins	Sue	S	I.T.	Database Administrator
5	Smith	Judy	J	IT	Programmer
6	Hansen	Bob	BJ	IT

Immediately, we have a much better database design that dictates how the employee's name gets captured. Replacing a single Name column with three columns allows for much flexibility. For example, we can sort the data by Last Name:

ID	Last Name	First Name	Initials	Department	Job Title
6	Hansen	Bob	BJ	IT
3	Peters	Jane	JS	IT	DBA
1	Smith	John	J	IT	Programmer
5	Smith	Judy	J	IT	Programmer
2	King	Peter	P	Information Technology	Network Administrator
4	Jenkins	Sue	S	I.T.	Database Administrator

We can sort the data by First Name:

ID	Last Name	First Name	Initials	Department	Job Title
6	Hansen	Bob	BJ	IT
3	Peters	Jane	JS	IT	DBA
1	Smith	John	J	IT	Programmer
5	Smith	Judy	J	IT	Programmer
2	King	Peter	P	Information Technology	Network Administrator
4	Jenkins	Sue	S	I.T.	Database Administrator

We can filter the data by the Last Name Smith:

ID	Last Name	First Name	Initials	Department	Job Title
1	Smith	John	J	IT	Programmer
5	Smith	Judy	J	IT	Programmer

We can do automated emails, for example, that address John Smith with “Dear John” and computerized pay slips that address the same person as “J Smith.” I think you get the picture.

Fourth Iteration - Divide and Conquer

Now we need to address the Departments and Job Titles. We do not want all this duplication of data, so we create separate tables for these two columns, each with their unique IDs.

So we have a Departments table:

ID	Name
1	Finance
2	HR
3	Sales
4	IT

And we have a Job Titles table:

ID	Name
1	Programmer
2	Network Administrator
3	Database Administrator
4	Representative

Again, as you can see, the ID columns use numeric values. Sometimes, databases use things such as postal codes or zip codes and social security numbers for ID numbers. But, not all countries use postal codes in the same way. If I am not mistaken social security numbers are specific to Americans. Also, what if a country decided to change the postal code system because they ran out of numbers? Yip, that happens. The world ran out of IPv4 addresses hence the new IPv6 system.

By the way, I do not say that all Primary Keys should be numerical values. I say that the value should not have any business meaning besides indicating a unique row. For example, you can use an alpha-numeric system such as ABC123, ABC124, ..., ABC999, ABD000. Just do not give it a business meaning as well.

An insurance system I built in the '90s assigned policy numbers that consisted of 4 parts, like so: 123/345/12345/0694 which meant:

123 is the product
345 is the broker code
12345 is the policy number
0694 is the inception month and year (June 1994)

Guess that happened when a new broker with code 1000 joined. Guess what happened when policy no 100000 needed to be loaded. And guess who had to go and fix the mess after they had been warned for years that it is a bad idea to give primary keys business meaning?

So, finally, we replace the Department column with DepartmentID and the JobTitle column with JobTitleID:

ID	LastName	FirstName	Initials	DepartmentID	JobTitleID
1	Smith	John	J	4	1
2	King	Peter	P	4	2
3	Peters	Jane	JS	4	3
4	Jenkins	Sue	S	4	3
5	Smith	Judy	J	4	1
6	Hansen	Bob	BJ	3	4

Now we learn that there are three other departments apart from IT. We also understand that Bob is a Sales Representative, not IT, as the database initially indicated.

We call these new ID columns Foreign Keys. A foreign key in one table refers to the primary key of another table. Remember this! For example, the DepartmentID column is a foreign key in the Employee table and refers to an entry that matches the primary key in the Department table.

Last Integration - Make it Pretty

Lastly, we want to make things easy for the users of this database, and after a little effort we came up with the following version of our Employee database:

We now have a nice-looking front with dropdown options for the Department and Job Title columns:

Also, say for example, that we decided to change IT to InfoTech (which is a terrible idea by the way):

ID	Name
1	Finance
2	HR
3	Sales
4	InfoTech

This single change will result in our whole database to be updated immediately and automatically:

ID	LastName	FirstName	Initials	Department	JobTitle
1	Smith	John	J	InfoTech	Programmer
2	King	Peter	P	InfoTech	Network Administrator
3	Peters	Jane	JS	InfoTech	Database Administrator
4	Jenkins	Sue	S	InfoTech	Database Administrator
5	Smith	Judy	J	InfoTech	Programmer
6	Hansen	Bob	BJ	Sales	Representative

Conclusion

Not too bad for a day’s work, is it? Although still a spreadsheet, we have a much better organized and unambiguous database. All rows are unique, all columns are atomic, and lookup tables are used where possible.

Now it is time to grow up, move away from the mighty spreadsheet, and play with the adults. So, see you next time when we start learning about real databases!

Database Design Gone Better

Database Design Fundamentals