What is database normalization?

I was chatting with some programmer friends when I heard the term "database normalization". Although I have heard this term a few times before, I could not understand what it was about.

Some doubts arose:

  • What does this term mean?

  • Can any type of database (be it SQL or NoSQL) use this such normalization?

Another thing is that when searching on the internet, I found the terms " data normalization "and" database normalization". Is there any difference between the two things ("data normalization" and "database normalization")? Is there the right term or can it be spoken in both ways?

I no longer want to be ashamed to hear such a term in a meeting and be misunderstood :)

Author: Victor Stafusa, 2016-09-06

4 answers

What is normalization?

Database normalization is the process of transformations in the structure of a database that aims to eliminate redundancies and eliminate insert, update, and delete anomalies.

By performing the normalization process, the data registered in the database will be organized in a better way and will also take up less physical space in most cases. However, the normalization process also always increases the number of tables and in many cases it can be a difficult task to be accomplished. In addition, normalized databases beyond what is required may have poor performance and/or excessive complexity.

Anomalies

The main purpose of the normalization process is to eliminate insert, update, and delete anomalies. The anomaly occurs when there is no way to register some certain information without that some other information is also directly registered. By example, imagine that you have a table funcionário with the following data: codigo, nome, projeto, where the column projeto corresponds to the name of the project in which an employee was allocated. And then you have the following data:

code name project
1 Pedro sales
2 Maria sales
3 Carlos customer registration

And then a new project arose: that of issuing invoices. How do you register this new project? The answer is that you can not register, because to do that you would have to have some employee in that project - that is, we have an insertion anomaly.

If in the previous example, the employee Carlos was disconnected from the company and we remove him from the table, the information about the client registration project is lost. This is an unwanted side effect-it's the exclusion anomaly. If, however, it were only moved to the new invoice project, we would also lose the information about the existence of the registration project of customers - this is the anomaly of change.

The problem that gives rise to these anomalies is the fact that the project information is all within the Employee table, which is not its place. If we have two related tables ( 1-to-N) - employees and projects-the anomalies disappear.

Anomalies are also related to the concept of referential integrity.

First normal form (1NF)

The first rule to eliminate anomalies is:

There should be no multi-valued columns.

What is a multivalued column? It is a column in which you can store more than one value per record. For example, imagine that you have columns in your customer Table codigo, nome and telefones, filled in as follows:

code name phones
1 Paul 99567-4289, 3605-4900
2 Maria 97223-0444
3 Alan -
4 Juliana 2853-0266, 96610-5480, 2410-9941

Note that the phone column is multi-valued. Inserting, updating, and deleting phones in this scheme is a tricky thing to do. Searching for some specific phone number is also a tricky thing. The solution to this is split in two if all multivalued columns are eliminated, the database reaches a structural form called first normal form, or simply 1fn.

Candidate keys and super keys

Before proceeding with the other normal forms, it is necessary to introduce the concepts of candidate keys and superkeys.

The primary key ( primary key , or just PK ) is that set of columns that serves to identify the tuple in a single form(it can be only one column, or it can be two or more). It is important that the database designer knows how to identify which columns are most appropriate to be elected as part of the primary key. Often, a column with a sequential numeric code can serve to be the primary key, but there are several situations where this is not the case.

Sometimes there is more than one set of columns that could be primary key. Each of these sets is called key candidate. For example, in a table Pessoa that has the fields CPF, RG, Estado and Nome, both CPF and RG along with Estado are candidate keys. Thus, you can get to Nome from CPF, but you can also get to Nome from RG and Estado.

Any set of columns that has as a subset, a candidate key is called a superkey.

Second normal form (2fn)

The second normal form it is the one that says:

All columns must have functional dependency with the totality of each candidate key.

In most cases " each candidate key "means" with the primary key", unless there is more than one candidate key.

Also, for a table to be in the second normal form, it must be first of all, in the first normal form.

A column is in dependency functional with the primary key when it is determined in the application domain by means of the primary key. A column has no functional dependency on the primary key when it is defined independently of the primary key or when it is defined from some column other than the primary key.

A functional dependency can be said to be in the entirety of the primary key when all the fields of the primary key are needed to establish the relationship of dependence. In the event that the primary key is composed, it is possible if it has a partial dependence.

To give an example of 2FN, imagine that your company has sales representatives acting on customers and you want to represent the relationship of which sales representatives acting on which customers. In this table (let's call it representação), we have the columns nome_cli for the customer name and nome_repr for the representative name as primary keys and we also have the columns end_repr for the representative address, end_cli to customer address and valor_contrato. That is:

Representation:

cli_name (PK) repr_name (PK) end_repr end_cli met_value
Louis Cleyton Street A 888 Street X 123 R 5 500,00
Paula Lara 999 B Street Street Y 345 R R 900,00
Paula Cleyton Street A 888 Street Y 345 R R 650,00

Note that the columns end_repr and end_cli depend on the primary key. But they do not depend on all the primary key, each depends only on part of it. The solution in this case is to have a table of customers (with the address of the customer), a table of sales representatives (with his address too) and leave in the table of acting the respective foreign keys with the value of the contract. That is:

Customers:

cli_name (PK) end_cli
Louis Street X 123
Paula Street Y 345

Representatives:

repr_name (PK) end_repr
Cleyton Street A 888
Lara 999 B Street

Representation:

cli_name (PK) repr_name (PK) met_value
Louis Cleyton R 5 500,00
Paula Lara R 9 900,00
Paula Cleyton R R 650,00

Third normal form (3FN)

The third normal form is one that says that:

All columns must have functional dependency with the totality of each candidate key and nothing more than these candidate keys.

Again, in most cases "each candidate key" means "with the primary key", unless there is more than one candidate key. If the only existing candidate key is the primary key, this would be like this:

All columns must have functional dependency with the entirety of the primary key and nothing else but the primary key.

Also, for a table to be in the third normal form, it must first be in the second normal form (and also in the first). The part of depending on the totality of each candidate key is addressed in the second normal form, so the focus here is to depend on nothing but those keys.

For example. Picture a table of cars with columns placa (primary key), cor, nome_proprietário, endereço_proprietário:

plate (PK) color owner's name owner_address
ABX-1234 blue Joseph X Street, 123
NNU-5566 Green mark Street example, 5678
SGH-7210 black Maria Test Avenue, 7743
ERT-6902 red Joseph X Street, 123
BGH-5431 black Joseph X Street, 123

Note the address of the owner - it is a violation of the third normal form. Note that the address of the owner is set as a result of who is the owner, and not as a consequence of the car plate. If Jose changes address and we update the address of only one of his cars, the database will be inconsistent (there is change anomaly). If Maria buys another new car and we add it with another address, it will also be inconsistent (insert anomaly). If Marcos sells his car, his address will be forgotten (exclusion anomaly). The solution again, is to separate in two tables:

Car:

plate (PK) color owner
ABX-1234 blue 1
NNU-5566 Green 2
SGH-7210 black 3
ERT-6902 red 1
BGH-5431 black 1

Owner:

code (PK) name address
1 Joseph X Street, 123
2 mark example Street, 5678
3 Maria Test Avenue, 7743

Normal form of Boyce-Codd (BCNF)

There is a normal form that is slightly stronger than the third normal form, but that is not a requirement necessary to reach the fourth (or even the sixth). this is the normal form of Boyce-Codd (BCNF), also sometimes called 3.5 NF or the normal form of Boyce-Codd-Heath.

It is quite rare to find cases of tables that are in the third normal form, but not in the normal form of Boyce-Codd. In addition, there are cases (unusual) where this normal form is impossible to achieve (unlike the forms between the first and sixth, which can always be achieved).

The difference appears when there is more than one candidate key and they have some intersection. The idea is that the fields that are candidate keys always determine the other fields, and never be determined by those. Therefore, in the normal form of Boyce-Codd it is not allowed to arrive at a candidate key based on some other candidate key through functional dependencies.

Taking the example borrowed from a Soen response , let's say that each pizza can have several different toppings, each of a different type and you have two types of pizzas with these toppings:

Pizza coverage type of coverage
1 mozzarella cheese
1 pepperoni meat
1 olives vegetable
2 mozzarella meat
2 sausage cheese
2 pepper vegetable

There are two candidate keys there: Pizza and topping type define the topping used. Another key candidate is that from the pizza and the topping used, we can define the type of the topping used.

Note that for whatever candidate key we look at, the rest of the column depends on the totality of the candidate key and nothing more than the candidate key, so the third normal form has been reached.

Well, there's something wrong there, for sausage is not a cheese and mozzarella cannot be meat and cheese at the same time-despite the fact that we have reached the third normal form, we still have anomalies.

The solution is again to divide into two tables:

Pizza (PK) coverage (PK)
1 mozzarella
1 pepperoni
1 olives
2 mozzarella
2 sausage
2 pepper
coverage (PK) type da coverage
mozzarella cheese
pepperoni meat
olives vegetable
sausage meat
pepper vegetable

Fourth normal form (4FN)

It is rare to find cases of tables that are in the third normal form, but not in the fourth. The fourth form concerns in anomalies the relationship between different columns of the primary key, and only applies in tables with primary keys composed of three columns or more.

Tables representing ternary, quaternary or N-muitos relationships from many-to-many are places where it is worth taking a look at possible violations of 4FN.

For a table to be in the fourth normal form, it must first be also in the third normal form. The normal form of Boyce-Codd is not necessary.

It's kind of hard to explain, but imagine that we have the case of business representatives acting on clients, where multiple representatives can act on multiple clients, and that we have service contracts where multiple representatives act on multiple contracts. And then we have the following table, where all columns are key primary:

representative contract client
Geraldo 1 clothing store
Geraldo 1 hospital
Geraldo 2 clothing store
Geraldo 2 hospital
Marta 1 store toys
Marta 1 hospital
Marta 3 toy shop
Marta 3 hospital
Louis 2 clothing store
Louis 2 toy shop
Louis 4 store clothes
Louis 4 toy shop

Note that we have many-to-many relationships between representatives and contracts and between representatives and customers. But we do not have between customers and contracts! Here is the violation of the normal form and the possibility of having some kind of anomaly. The ideal is to separate into two tables, one with the relationship between representatives and customers and the other with the representation between representatives and contracts:

representatives contracts
Geraldo 1
Geraldo 2
Marta 1
Marta 3
Louis 2
Louis 4
representatives customers
Geraldo store of clothes
Geraldo hospital
Marta toy shop
Marta hospital
Louis clothing store

Note that in the original table, we have redundant information. For example, the information that Marta is in contract 1 appears twice. If we deleted the MARTA-1-hospital tuple, we would have a deletion anomaly, instead that Marta works in the hospital (contract 3) and Marta is in contract 1 (in the toy store). We also have an insertion anomaly in the event that we have a representative allocated to some contract in which there is not yet any customer.

Fifth normal form (5FN)

The fifth normal form is more restricted than the fourth, and also applies to tables with 3 or more columns in the primary key. Again, in order for the fifth normal form to be reached, it is necessary to reach the fourth normal shape first.

Let's imagine a case similar to the one I used in the fourth normal form, but that this time in addition to the relationship between representatives and customers and between representatives and contracts, we also have the relationship between customers and contracts. Thus, we have this table:

representative contract client
Geraldo 1 store clothes
Geraldo 1 hospital
Geraldo 2 clothing store
Marta 1 hospital
Marta 3 toy shop
Marta 3 hospital
Louis 2 clothing store
Louis 4 store toys

Note that this table is different from the one we use in the fourth normal form. There are some redundancies, as for example she says twice that Geraldo is in the clothing store and that contract 1 is applied to the hospital. The technique used to put us in the fourth normal form does not apply, since for any two columns obtained from the key, there is a relationship (i.e. we have the contract-customer relationship as well). Thus, this relationship is already in the fourth form normal, but there is still room for normalization. By decomposing it into three different relationships, we arrive at the fifth form normal:

representatives contracts
Geraldo 1
Geraldo 2
Marta 1
Marta 3
Louis 2
Louis 4
representatives customers
Geraldo store of clothes
Geraldo hospital
Marta toy shop
Marta hospital
Louis clothing store
Louis toy shop
contracts customers
1 store clothes
1 hospital
2 clothing store
3 toy shop
3 hospital
4 toy shop

Sixth normal form (6FN)

For a table to reach the sixth normal form, it must first be in the fifth normal form. She says that:

All table only it can have one or no columns that are not part of the primary key.

Finding cases where the fifth normal form has been hit but the sixth has not, is easy: any table with two or more columns that are not primary key is a violation of the sixth normal form.

However, finding cases where the sixth normal form makes sense to be applied and brings some real benefit from it is very rare - in most cases applying it turns out to be crazy, because the application in this normal way tends to produce an explosion in the number of tables without bringing any or almost no benefit from it. The sixth normal form was not defined to be something useful in practice, it has more the purpose of being a theoretical limit that defines what is the end point from which no other normalization would be possible.

To see what the process would look like here (and also doubt any real benefit this brings), imagine the table pessoa with the fields codigo (Primary Key), nome (NOT NULL), sexo (NOT NULL) e cor_favorita (NULLABLE), and let's have these records:

code (PK) name sex cor_favorite
1 Marcela F Green
2 Rodolfo M -
3 Tiago M yellow

Ela it would be decomposed like this:

Person-name:

code (PK) name
1 Marcela
2 Rodolfo
3 Tiago

Person-gender:

code (PK) sex
1 F
2 M
3 M

Favorite person-color:

code (PK) cor_favorite
1 Green
3 yellow

Note that each column that was not part of Primary Key ended going to stop in a separate table. Also note that the table pessoa-cor-favorita does not have a record for Element 2 (The Rodolfo). This form of normalization eliminates the need to have NULLABLE in columns, and all resulting columns are NOT NULL. Cases of columns that were null simply ended up being omitted from the resulting records.

Normal form of key domain (DKNF)

The normal form of key domain is the one that dictates that:

All restrictions of database integrity must be imposed either by key constraints or by domain value constraints.

By key constraints, primary keys and foreign keys are meant. Domain values are the set of valid values for each given column.

This normal form is quite strong, being stronger than the fifth normal form and the normal Boyce-Codd form together. However, it is not stronger than the sixth, since it allows the the existence of multiple columns that are not part of the primary key in the same table and not even the sixth is stronger than it, since there is no mechanism in 6FN that ensures that all integrity constraints are modeled as required by DKNF.

This normal form is nirvana, utopia, the ideal and perfect state of normalization. However, in practice, it is almost impossible to achieve it because any database with business rules with some complexity will probably have some kind of consistency rule that can't be modeled just as key constraints or domain values. However, even when unattainable, an effort to get close to it is valid by eliminating several possibilities of anomalies.

Other Normal Forms

There is a published algorithm (Barnstein's algorithm) to normalize tables and bring them to the third normal form. However, research carried out later found that this algorithm is a bit more rigid than the third normal form requires, but still does not satisfy the normal form of Boyce-Codd or the fourth normal form. For this reason, the normal form achieved by this algorithm was called Elementary Key normal form (EKNF).

Another existing normal form, stronger than the fourth normal form and that the normal form of Boyce-Codd together, but weaker than the fifth normal form and the normal form of Boyce-Codd together is the normal essential tuple form (ETNF) . The authors who defined this normal form put it as an alternative to the fifth standard normal form, which although weaker, would be as effective as. The ETNF is reached when the BCNF is reached and at least one candidate Key has only one field. The authors also refer to other works that define the normal form of superkeys (superkey normal form-SKNF), a normal form free of redundancies (redundancy-free normal form-RFNF ) and full key normal form (key-complete normal form-KCNF ). 5NF + BCNF is stronger than SKNF which is stronger than RFNF which is stronger than ETNF which is stronger than 4NF+BCNF. It is also shown that KCNF is equal to RFNF, although they have been defined by different people and in different ways.

Oh, and of course, we also have the non-normalized form (Unnormalized form ), which is that form that does not fit even the first normal form.

How far to normalize?

In general, many are satisfied with reaching the third normal form and do not care much about the other normal forms higher than that because:

  • Apply to rare cases, for when the third normal form is hit, almost always the Boyce-Codd, the fourth and the fifth were also by luck or accident. It it is because it is difficult to have a case in 3NF that is not in BCNF and also because tables with three or more columns in the primary key and that have dependency relationships between these columns are something very rare.

  • Since the third normal form is reached (and probably, by luck or accident, Boyce-Codd's, the fourth and fifth as well), there are very few possibilities that allow the introduction of anomalies. Seeking to achieve the sixth normal form is in general madness and does not bring any practical benefit. Seeking to reach the normal form of key domain is almost always impossible, although this search may still reveal some possibility of anomaly that can be eliminated.

  • Sometimes, to improve the performance of the database or simplify its structure, the recommendation ends up being to denormalize some things. Many systems focused on dataware house or business intelligence are normalized only up to the second normal form, often being conceived from structures that were already in the third normal form and that have undergone a denormalization.

  • Incidentally, the sixth normal form is more useful when thinking about denormalization rather than normalization. The idea would be to look for some of the tables that have 1-to-1 relationships and then unify them, thus effecting a denormalization. After all, when the data relating to a record of a particular concept of the application domain are scattered in several tables with 1-to-1 relationships, it is an indication that perhaps they should be in the same table.

Where can I apply the concept of normalization?

The concept of normalization and normal forms apply only to relational databases. Other types of databases that are not relational (for example, an application that save data in files organized in folders), can also have concepts analogous to normalization to eliminate redundancies, improve the structure and reduce the possibility of anomalies. However, in this field, the concepts of normalization are not as well defined as in the case of relational databases and each case will have its particularities.

The case of NoSQL in particular is very interesting, because its idea is to model data that do not need to have a strong consistency (what is called eventual consistency), and often your data comes from places where there is not a very strong and well-defined structure for the data. For this reason, it does not make as much sense to talk about normalization in NoSQL bases as it does with relational SQL, because the idea of NoSQL is to accept, tolerate and know how to deal with possible anomalies and be able to turn even if they arise. NoSQL databases sacrifice consistency in exchange for scalability, and for this reason, anomalies have to be tolerated. despite this, there are still some concepts regarding normalization in NoSQL, but they are not as well defined or matured. See a bit about this here .

Data normalization and database normalization

To close the answer to this question, database normalization consists of normalizing the database structure, while data normalization corresponds to the normalization of the data already existing in the tables.

However, especially in case of restructuring of databases that are already in production, the two concepts go so together and mixed that it does not even make sense to talk about one of them without also talking about the other. Hence, in many situations that happen in practice, they end up being placed as if they were synonymous.

 83
Author: Victor Stafusa, 2020-12-24 18:24:29

Data normalization is a set of rules applied to relational database tables in order to maintain data consistency, avoid duplication/redundancy, and problems with record removals or updates.

The Normal Forms are 1FN, 2fn, 3FN, BCNF, 4FN and 5fn. Tables are usually normalized to the third form, the fourth and fifth normal forms deal with specific problems.

 13
Author: rray, 2016-09-07 16:05:01

The question already has a very good answer, almost a chapter of a book giving all the details, but I missed something important in it that I realize that many people confuse or do not understand why normalization is necessary, and so has a "summary" form to understand the subject without having to see all the normal forms.

improper Denormalization

Especially in NoSQL times (particularly in document template) I see normalization being attacked improperly, often by people who do not understand it (I will disregard bad faith to try to give a reason for their preferred technology to exist and be ubiquitous, which NoSQL is not and should not be).

I am also not saying that normalization even if it is adequate should always be applied, just be clear that at the moment you do not apply will have the burden of dealing with it. A rule being consciously ignored can be helpful. One of the problems with the document template is that it "requires" the denormalization in "everything", then the rule happens to be to disregard the rule and the exception to do the appropriate.

Canonicality

Problems that do not require normalization should not be normalized, but what should be, almost always should for a basic reason, it is necessary that an information be canonical, that is, it should only exist in one place in your database. It does not mean that there can be no repetition, this is a common mistake. I explain this in question DRY is to avoid redundancies, right? which is a concept misunderstood by most people thinking the problem is repetition. The same as for code goes for data, it is important to have a unique reference about that knowledge.

Every time the same knowledge is replicated elsewhere your code has to be responsible for managing that replication. Is this what you want for your code, having to deal with that burden? What happens if he (you) fail at that? inconsistency . You can have the same information modified in one place and not the other, and you have the same information in two different states and depending on who is accessing one or the other will show something inconsistent.

Note that there are cases that the same information at a certain time needs to be frozen and becomes another information, when this occurs no matter the repetition, on the contrary, it must repeat itself, because this repetition is circumstantial, at some point potentially they will be different and should be even, because they have passed represent something different.

Just because the data is the same in two places doesn't mean it's the same information. Just like just because there are two José da Silva that are the same person. Or if two univitelin twin brothers are the same person. Repetition is not the reason to normalize, being Canonical is.

False premise

A common mistake I see out there, and it should be the fault of teachers and books that taught wrong, is to say that normalization serves to reduce consumed space. This is just a pleasant and always desirable side effect, but it is not the reason for normalizing, the reason is the canonicality of information.

Nor is it correct to say that space today is no problem. Taking up more space makes you access more mass storage (HDD or SSD) and have less data in performance storage (RAM), this is harmful to the performance.

Of course normalizing can bring some performance difficulties in some situations as well, but you have to think carefully about whether it's worth exchanging the difficulty of maintaining the data and the risk of inconsistency for slightly better performance.

Interestingly the denormalization can help avoid JOIN Reading , but if used in exaggeration can create a kind of JOIN writing that is much worse making it easier to create deadlocks and others harmful phenomena.

The evolution of hardware (NVRAM) will make the cost of making a JOIN laughable and despicable. Nothing will evolve to have to deal well with non-canonical information.

Correct normalization

Understand that I am not discussing the wrong normalization. Today in all relational databases mainstream has ways to use flexible data models that can avoid unnecessary normalization.

It's interesting to take a look at What is the difference between Association, aggregation and composition in OOP?. Generally compositing cases should not be normalized , in database this is easier and better to do because of the way data is accessed (unless the Access is in fixed line size, rare nowadays). Aggregation is clear case for normalization. The association would also be, but it has cases that it is possible to turn it into composition for database purposes, so it is I need to pay attention to this as optimization (aggregation can also, but it is rarer).

One of the reasons I criticize microservices is that destroying normalization in the name of an architecture is almost always not necessary. Instead of having a system with canonical information you have several subsystems each with its canonical information (possibly), but on the whole you violate the DRY completely and it becomes a nightmare to take care of it.

Watch out for the rule by the rule, all of them can and they should be broken if it is more important to do so. To know when to break it takes a lot of mastery of what you are doing. Without breaking rules will already be doing wrong by definition, and will only get it right by coincidence. Always remembering:

Fiat 147 all detonated walking the streets

Can you tolerate anomaly in your data? NoSQL preaches this.

Having these things more clearly, reread Victor Stafusa's answer paying more attention to its details. Had rendered strong attention to this phrase?

The main purpose of the normalization process is to eliminate insert, update, and delete anomalies.

In addition to normal forms

One point that the accepted answer does not touch is about not-so-obvious normalization. The most common example I always cite is you have a customer and a supplier, but both are the same entity, if you change the address you have to change in two different places. Or it should have only one entity and treat only the relationship of customer or supply as something separate (normalized or not) or else if you do not want to normalize right (a normal form not existing in the books on the subject) should ensure that changing in the customer table changes in the supplier table. Virtually no system I know of does it right. Curiously it is easier to do this in the document model, but because almost everyone who adopts this model still thinks like in the relational do not do right As well.

Learn to model and normalize

Modeling data is probably the most important feature a developer should have, and most, even experienced, are pretty bad at it(I'm still after 36 years doing it). With this in mind start preparing.

You need to have a huge capacity for interpreting text, and a text that does not even exist, and mathematical understanding, especially sets, you know what you saw starting from the 3rd. series (in my time) and that almost no one gives ball and that many teachers do not even understand why they are teaching that? If these things are missing, start all over again, without foundation nothing to stand.

Examples of normalization

I have given several Reviews about modeling and normalization. Reading these examples can help gain experience. Just do not take such things as a rule. It has other people too, but it's harder for me to find.

I am impressed by this question having only about 4 thousand views so far as I wrote this answer, which indicates a lot because it has so much problematic application out there, people are not worried about the right things.

 13
Author: Maniero, 2019-09-25 14:39:51

Normalization is important and should be brought up to 3FN, but for performance reasons it may also be necessary to proceed with denormalization. Sometimes, after making the data model all cute, you do the load test and the database arrrrraaasstaaa up. It may be necessary to reverse some of the normalization ;-) Common sense!

 -1
Author: Fausto Caveiro, 2019-06-04 17:34:30