Database structuring: the normalization
In the previous article we treated the important concept of the primary key. Today we are going to see in theory what are the main measures (which are called normal forms) to be implemented in the database to make them more efficient and maintainable and to ensure data consistency. Instead in the last article of this series we’ll see how to apply these rules.
Before we begin let’s see a definition of normalization:
Normalization is a process of elimination of redundancy and the risk of inconsistency from the database. There are several levels of normalization ( normal forms) that certify the quality of the database schema. This process is based on a simple criterion: if a report has more than one independent concepts, the relationship breaks down in smaller ones, one for each concept. This type of process is not always applicable to all the tables, since in some cases this can cause a loss of information. (Source: Wikipedia)
What do you mean for redundancy?
Redundancy is a sworn enemy of the databases and we must avoid it like the plague. Imagine if the table that contains the articles of this blog was like this:
Now, when Maurizio has written 1000 articles (he’s nearly there), there will be 2000 useless data in this database.
It’s clear that the data of the author (name, e-mail address, etc…) will feature in the table authors. To put in relationship an article to its author, all one needs to do is post in the table articles the primary key that leads to the author.
The first normal form (1NF)
A database is defined in first normal form when:
- Each column refers to atomic data
This means that each column must contain non dividable values. Therefore, a name_and _surname column would not permit the table to be in first normal form. The consequences of this error are clear. I couldn’t, for example, do a specific research by surname or by name. Or I couldn’t put the data in order by name or surname.
The non atomicity of data can have more complicated forms, imagine for example a database containing the list of students and the courses they enrolled to.
I only used the student’s name, but consider that the table would contain all the personal data. Now, this table is not in first normal form. The courses the student has enrolled to aren’t shown in an atomic form. To avoid this problem, anyone would be attempted to do like shown in the image below:
But this wouldn’t be a good solution. Every new course introduced oblige us to add a new column to the table with disastrous effects on the maintainability. Moreover this is an example of fat table, this term describes the tendency (wrong) to want to create one single table that contains data that don’t have a logical connection.
Instead logic tells us that personal data go in one table, while the list of the courses goes in another one. To describe which courses a student enrolled to, we’re going to create another table that will put in relationship students with courses. We’ll see how in the next article.
The second normal form (2NF)
A database is defined in second normal form when:
- it is in first normal form;
- the non key fields depend on the whole primary key and not on part of it.
On a part of it? Yes, the second normal form is applied to tables with compound keys, so we can say that non key field cannot depend on a part of the fields that form the primary key (we can implicitly consider a table with a single primary key as in second normal form because it is impossible that it violates the principles).
Let’s do an example. We take in consideration a sport event where the athletes do three competitions. We can give the following description in our database:
In this table we can use the primary key formed of the code_athlete/number_competition.
But it’s not in second normal form.
In fact the non key field name_ athlete depends on a part of the primary key code_athlete and not on the whole key.
To put this table in second normal form we need to break it in two tables.
The third normal form (3NF)
A database is defined in third normal form when:
- it’s in second normal form (and therefore implicitly even in first);
- the non key fields do not depend on other non key fields.
It’s much more simpler than it seems. Practically it means that if a data can be calculated/ deduced/rebuilt from another data, you don’t need to put it in the table. Let’s do an example of a database of product orders:
The field price depends on the fields unit_price and quantity and it’s calculable multiplying these last two fields. Therefore the field price is unnecessary and must be eliminated. Another simple example could be the database of an e-commerce. In our store we want to indicate the prices in euro and in dollars for all our products. Now, it would be wrong a table like this:
The price in dollars depends in fact on the price in euro (or vice versa). Therefore a column gets eliminated. This will make the database more efficient and rational. Otherwise we need to modify periodically all the values to update the change.
We can calculate the price in dollars at the extraction of the data, perhaps using the API of an on-line quotation website, having this way the change always updated.
At this point you have at your disposal theoretic basis with which you can create a database correctly set up. In the next article we’ll see in practice how to establish logic relations between the tables in order to obtain the best efficiency of the database in terms of reliability, maintainability and rationality.
L'immagine principale dell'articolo è stata fornita da @Fotolia