It is important that you understand the contents of this article. Together with the next article, it forms the knowledge you need to do good database designs. So, please make sure you understand the things discussed here before you proceed.

Data Redundancy

Database design aims to minimize redundancy without losing data. Redundancy is where the same piece of data exists in more than one place in the database. This cause all sort of data quality issues.

Data redundancy is caused by anomalies. An anomaly is something that deviates from an expected standard. I know it does not say much, but by the end of the article, you will understand what an anomaly is.

We can drastically improve the quality of our data by preventing these anomalies. In this article we are going to look at three types of anomalies:

insert anomalies,
delete anomalies, and
update anomalies.

We will also look at Null values.

The Mighty Spreadsheet

The spreadsheet is still the number one option when developing simple but powerful databases. We will follow suit and look at an Employee database built with Google Sheets or Microsoft Excel. After six months of hard work, HR (human resources) ended up with the example below:

Row #	Name	Department	Job Title
1	John Smith	IT	Programmer
2	Peter King	Information Technology	Network Administrator
3	Jane Peters	IT	DBA
4	Sue Jenkins	I.T.	Database Administrator
5	J. Smith	IT	Programmer
6	Peters, Jane	InfoTech	D.B.A.
7	Jenkins, S	IT	Databse Administrator

We have been called to assist in making this database better (whatever better means at this point). So let us work through this database, row by row.

Insert Anomalies

After a quick look, we realized that the database was a total mess! Let's see why.

Row 1

Row #	Name	Department	Job Title
1	John Smith	IT	Programmer

This is the first entry, so everything seems fine for now.

Row 2

Row #	Name	Department	Job Title
2	Peter King	Information Technology	Network Administrator

Why does the department say Information Technology and not IT like in row 1? And if Information Technology is the preferred term, why does Row 1 still say IT? This is an example of an Insert Anomaly.

Row 3

Row #	Name	Department	Job Title
3	Jane Peters	IT	DBA

Everything seems fine on the condition that IT and not Information Technology, as in row 2, is the standard.

Row 4

Row #	Name	Department	Job Title
4	Sue Jenkins	I.T.	Database Administrator

In row 4, we have I.T. instead of IT or Information Technology. This is again an example of an Insert Anomaly. But, hang on, DBA in row 3 is short for Database Administrator in row 4. Another Insert Anomaly.

Row 5

Row #	Name	Department	Job Title
5	J. Smith	IT	Programmer

Is J. Smith in row 5 and John Smith in row 1 by any chance the same person entered twice? I mean, both work in IT and both are programmers. Another Insert Anomaly maybe. However, never assume it is a duplicate entry. Confirm with HR!

Row 6

Row #	Name	Department	Job Title
6	Peters, Jane	InfoTech	D.B.A.

By now, everything seems to fall apart. Row 6 and row 3 probably refer to the same person. Also, now IT is called InfoTech, and DBA is D.B.A.

Row 7

Row #	Name	Department	Job Title
7	Jenkins, S	IT	Databse Administrator

Rows 4 and 7 refer to the same person. Also, the Job Title has a spelling error.

Update Anomalies

Data change over time. For example, after our initial assessment, HR made some corrections. Also, people move to other departments and the database needs to be updated accordingly. We have once again been called in to assess the database. This is what we saw:

Row #	Name	Department	Job Title
1	John Smith	IT	Programmer
2	Peter King	IT	Network Administrator
3	Jane Peters	IT	Database Administrator
4	Sue Jenkins	I.T.	Database Administrator
5	J. Smith	IT	Programmer
6	Peters, Jane	IT	Database Administrator
7	Jenkins, S	IT	Database Administrator

The company decided to standardize the term IT, which by the way is an important improvement. Always standardize terms! HR also updated rows 2 and 6 used to read Information Technology and InfoTech respectively. However, they missed row 4, which still reads I.T. This is an example of an update anomaly. Our brain knows that IT and I.T. is most probably the same thing. The computer does not know that though. So, this is going to become a problem at some point in the future and needs fixing.

Jane Peters and Sue Jenkins were entered twice, so HR deleted rows 3 and 7, leaving us with Sue Jenkins (row 3) and Peters, Jane (row 5).

Row #	Name	Department	Job Title
1	John Smith	IT	Programmer
2	Peter King	IT	Network Administrator
3	Sue Jenkins	IT	Database Administrator
4	J. Smith	IT	Programmer
5	Peters, Jane	IT	Database Administrator

We now have one record for each employee, the Department is standardized on IT, and there are no typos in the Job Title anymore. Things have certainly improved.

However, there is inconsistency with the naming conventions of the Name column. For example, in rows 1, 2 and 3, the order is First Name, Last Name. In row 5, the order is Last Name, First Name. In row 4, the order is Initials and Last Name.

Also, after confirming with HR, it turned out that John Smith in row 1 and J. Smith in row 4 refer to different people. As they both work in IT and are both Programmers, we need to be able to distinguish between them. We will fix this problem in the next article.

Delete Anomalies

One day Peter King decided he had had it and resigned. HR removed him from the database and ended up with the example below:

Row #	Name	Department	Job Title
1	John Smith	IT	Programmer
2	Sue Jenkins	IT	Database Administrator
3	J. Smith	IT	Programmer
4	Peters, Jane	IT	Database Administrator

This leaves the company with a potential problem though. According to the database, Peter King never worked for the company. Also, looking at the database, there is no indication anymore that the company had a Network Administrator. Again, we will fix this problem in a future article.

I will give you a hint for now. Never delete data!

Null Values

Mystery Employee

Bob Hansen joined our company, and after HR updated the database, we ended up with the example below:

Row #	Name	Department	Job Title
1	John Smith	IT	Programmer
2	Sue Jenkins	IT	Database Administrator
3	J. Smith	IT	Programmer
4	Peters, Jane	IT	Database Administrator
5	Bob Hansen	IT

So what exactly does Bob do in IT? We don’t know because HR left the Job Title blank. We call this absence of a value a Null Value.

The word null has a particular meaning in programming and databases. It means there is nothing, absolutely nothing! A good example is when you have an active short-term insurance policy. There are two important dates, the Inception Date and the Cancellation Date. Because the policy is active, the Inception Date will have a value, and, because the policy is still active, the Cancellation Date will not have a value. Thus the Cancelation Date will be null. Once the policy is canceled, the Cancellation Date will be updated and will not be null anymore.

Null is Not Zero!

Let's look at another thing that sometimes confuses newcomers to null. Here we have a little database used by a local grocer to sell fruits:

Fruit	Price
Apples	$2.50
Pears	$0.00
Bananas

Apples cost $2.50. Pears are free. However, we have not determined what bananas cost, as the price is not zero. It is null. There is a big difference between zero and null in the programming world. As we progress, the concept of null will become clear.

Conclusion

It is clear from this short discussion that data can become close to useless if you do not follow a proper database design philosophy. We looked at some things that can (and will) go wrong. In the next post, we will improve the design of this database. See you there!

Database Design Gone Bad

Database Design Fundametals