Database Design Gone Better

Database Design Gone Better

Database Design Fundamentals

In the previous article, we created a spreadsheet acting as a database. This poorly designed database invited insert, update and delete anomalies. We also looked at the problems that null values can cause. In this article, we will attempt to remedy some of the issues we have with this poor excuse of a database.

The official name for fixing the design is called Database Normalization. I find most of the explanations of normalization confusing and sometimes conflicting, so I will just talk you through the process without getting technical at all.

We will start with the following version of the database:

Row #NameDepartmentJob Title
1John SmithITProgrammer
2Peter KingInformation TechnologyNetwork Administrator
3Jane PetersITDBA
4Sue JenkinsI.T.Database Administrator
5J. SmithITProgrammer
6Peters, JaneInfoTechD.B.A.
7Jenkins, SITDatabse Administrator
8Bob HansenIT

First Iteration - Cleanup

During this iteration, we removed the duplicates, Jane Peters & Sue Jenkins, leaving us with the following version of the database:

NameDepartmentJob Title
John SmithITProgrammer
Peter KingInformation TechnologyNetwork Administrator
Jane PetersITDBA
Sue JenkinsI.T.Database Administrator
J. SmithITProgrammer
Bob HansenIT

Second Iteration - Identification

In the previous article, we determined that John Smith and J. Smith were, in fact, two different people. So we need a way to distinguish between the two, especially if J. Smith also turns out to be a John. The way we differentiate between rows is to ensure that each row in our database is unique. We accomplish this by giving each row a unique key. We call this a Primary Key.

So, by adding a primary key to each row, we end up with the following version of our database:

IDNameDepartmentJob Title
SMI001John SmithITProgrammer
KIN001Peter KingInformation TechnologyNetwork Administrator
PET001Jane PetersITDBA
JEN001Sue JenkinsI.T.Database Administrator
SMI002J. SmithITProgrammer
HAN003Bob HansenIT

Naming Conventions

If you look at the ID column carefully, you will see some naming convention happening here. We use the first three characters of the employee's last name and append 001 to the first and 002 to the second employee with the same three last name characters. For example, my last name is Slabbert, and my family and I will be SLA001, SLA002, SLA003 & SLA004. And James Sladowski will be SLA005, and so on.

So is this naming convention a good one, and does it serve any purpose other than uniquely identifying a row in our database? Sometimes they serve a secondary goal. For example, during my days as a pharmacist, we used a dispensing application that used naming conventions for identifying items. For example, paracetamol came in tablets, capsules, and liquids and was identified with PAR100, PAR200, and PAR300, respectively.

However, these naming conventions have several shortcomings. What about a drug called paraldehyde? Paraldehyde is administered via intramuscular injection. Will it be identified as PAR400 in the dispensing application? What if there is a paracetamol injection, which by the way, there isn't? We cannot allow two items to have the same primary key.

The first lesson to learn when assigning IDs to rows is to be careful to give it any secondary purpose. The rule of thumb is that an ID means nothing more than uniquely defining a row. That way, there is no confusion.

Replacing our current naming convention of IDs with simple numbers solves that problem and renders the following version of the database:

IDNameDepartmentJob Title
1John SmithITProgrammer
2Peter KingInformation TechnologyNetwork Administrator
3Jane PetersITDBA
4Sue JenkinsI.T.Database Administrator
5J. SmithITProgrammer
6Bob HansenIT

Third Iteration - Atomicity

The rule is a simple one. Each column value must be atomic. Thus, each column contains a single value, not a set of values. Have a look at the Name column:

IDName
1John Smith
2Peter King
3Jane Peters
4Sue Jenkins
5J. Smith
6Bob Hansen

Here we have three values: Last Name, First Name, and Initials. Thus our Name column is not Atomic. Let’s fix it:

IDLast NameFirst NameInitialsDepartmentJob Title
1SmithJohnJITProgrammer
2KingPeterPInformation TechnologyNetwork Administrator
3PetersJaneJSITDBA
4JenkinsSueSI.T.Database Administrator
5SmithJudyJITProgrammer
6HansenBobBJIT

Immediately, we have a much better database design that dictates how the employee's name gets captured. Replacing a single Name column with three columns allows for much flexibility. For example, we can sort the data by Last Name:

IDLast NameFirst NameInitialsDepartmentJob Title
6HansenBobBJIT
3PetersJaneJSITDBA
1SmithJohnJITProgrammer
5SmithJudyJITProgrammer
2KingPeterPInformation TechnologyNetwork Administrator
4JenkinsSueSI.T.Database Administrator

We can sort the data by First Name:

IDLast NameFirst NameInitialsDepartmentJob Title
6HansenBobBJIT
3PetersJaneJSITDBA
1SmithJohnJITProgrammer
5SmithJudyJITProgrammer
2KingPeterPInformation TechnologyNetwork Administrator
4JenkinsSueSI.T.Database Administrator

We can filter the data by the Last Name Smith:

IDLast NameFirst NameInitialsDepartmentJob Title
1SmithJohnJITProgrammer
5SmithJudyJITProgrammer

We can do automated emails, for example, that address John Smith with “Dear John” and computerized pay slips that address the same person as “J Smith.” I think you get the picture.

Fourth Iteration - Divide and Conquer

Now we need to address the Departments and Job Titles. We do not want all this duplication of data, so we create separate tables for these two columns, each with their unique IDs.

So we have a Departments table:

IDName
1Finance
2HR
3Sales
4IT

And we have a Job Titles table:

IDName
1Programmer
2Network Administrator
3Database Administrator
4Representative

Again, as you can see, the ID columns use numeric values. Sometimes, databases use things such as postal codes or zip codes and social security numbers for ID numbers. But, not all countries use postal codes in the same way. If I am not mistaken social security numbers are specific to Americans. Also, what if a country decided to change the postal code system because they ran out of numbers? Yip, that happens. The world ran out of IPv4 addresses hence the new IPv6 system.

By the way, I do not say that all Primary Keys should be numerical values. I say that the value should not have any business meaning besides indicating a unique row. For example, you can use an alpha-numeric system such as ABC123, ABC124, ..., ABC999, ABD000. Just do not give it a business meaning as well.

An insurance system I built in the '90s assigned policy numbers that consisted of 4 parts, like so: 123/345/12345/0694 which meant:

  • 123 is the product

  • 345 is the broker code

  • 12345 is the policy number

  • 0694 is the inception month and year (June 1994)

Guess that happened when a new broker with code 1000 joined. Guess what happened when policy no 100000 needed to be loaded. And guess who had to go and fix the mess after they had been warned for years that it is a bad idea to give primary keys business meaning?

So, finally, we replace the Department column with DepartmentID and the JobTitle column with JobTitleID:

IDLastNameFirstNameInitialsDepartmentIDJobTitleID
1SmithJohnJ41
2KingPeterP42
3PetersJaneJS43
4JenkinsSueS43
5SmithJudyJ41
6HansenBobBJ34

Now we learn that there are three other departments apart from IT. We also understand that Bob is a Sales Representative, not IT, as the database initially indicated.

We call these new ID columns Foreign Keys. A foreign key in one table refers to the primary key of another table. Remember this! For example, the DepartmentID column is a foreign key in the Employee table and refers to an entry that matches the primary key in the Department table.

Last Integration - Make it Pretty

Lastly, we want to make things easy for the users of this database, and after a little effort we came up with the following version of our Employee database:

We now have a nice-looking front with dropdown options for the Department and Job Title columns:

Also, say for example, that we decided to change IT to InfoTech (which is a terrible idea by the way):

IDName
1Finance
2HR
3Sales
4InfoTech

This single change will result in our whole database to be updated immediately and automatically:

IDLastNameFirstNameInitialsDepartmentJobTitle
1SmithJohnJInfoTechProgrammer
2KingPeterPInfoTechNetwork Administrator
3PetersJaneJSInfoTechDatabase Administrator
4JenkinsSueSInfoTechDatabase Administrator
5SmithJudyJInfoTechProgrammer
6HansenBobBJSalesRepresentative

Conclusion

Not too bad for a day’s work, is it? Although still a spreadsheet, we have a much better organized and unambiguous database. All rows are unique, all columns are atomic, and lookup tables are used where possible.

Now it is time to grow up, move away from the mighty spreadsheet, and play with the adults. So, see you next time when we start learning about real databases!