When is it interesting to denormalize the database?

Question

When is it interesting to denormalize the database?

Is there a situation where having the database denormalized is more interesting than having it normalized ?

For example

Normalized

  Pessoas                  Cidades
| nome    | cidade_id |  | id | nome        | 
-----------------------  --------------------
| Fulano  |         1 |  |  1 | Nova Iorque |
| Ciclano |         1 |

Denormalized

  Pessoas
| nome    | cidade      |
-------------------------
| Fulano  | Nova Iorque |
| Ciclano | Nova Iorque |

At some point is the second case more appropriate than the first?

33

sql banco-de-dados modelagem normalização

Author: Maniero, 2014-06-21

Source

3 answers

Exists yes. It is at a very specific time, which is the creation of DATAMARTS for BI. It's what they call star schema versus snowflake schema.

I will give my view of star schema as snowflake is wrong in due context of DATAMARTS for BI. We denormalize to star schema, we do this denormalization consciously because there are techniques for this, to be able to save the joins and thus gain speed, purely gain speed and save cpu time, for the processing large amounts of data for famous BI reports.

This denormalization is worth a lot for the context and uses I cited above.

2

Author: Flávio Granato, 2014-06-21 18:04:00

Another example, which I personally disagree with as a model, is a market ERP that denormalizes tables because its search engine in the system screens does not allow a search in columns of secondary tables, for example in the title tables the name of the client is copied to this table when the normalization would suffice the code of the client table.To have a search by name in the titles the column is copied.

0

Author: Motta, 2014-06-21 18:23:35

score 27 · Accepted Answer

First let's understand that denormalizing is not the same thing as not normalizing. denormalize it is a conscious process with a clear goal. A well-normalized database is often faster than a non-normalized database . Each normal form is more extreme than the other. Reducing from one normal way to another would denormalize? Or would it just reduce normalization? Dunno.

One of the biggest advantages of normalization is to facilitate data integrity by eliminating redundancy. Denormalization needs to be done very carefully.

Normalization applies well to the relational model. I will not enter the cases that this model does not apply well, after all it would not be well the case to denormalize.

I just see one big interesting reason to denormalize a database.

Performance

After being measured the physical design is hindering the performance and consequently the scalability of the database, can be interesting denormalize it with criteria. It does not mean that the logical project should be denormalized .

OLAP X OLTP

The most common is to do in data or parts of the database or even in specific banks for OLAP. For essentially query, analysis, simulation and projection operations. For BI applications, date Warehousing , etc.

In simplified, local databases, running on low-performance and capacity equipment that performs specific satellite tasks (POS terminal, for example) can also benefit from performance gain.

Note that in both the denormalization ends up giving more in an auxiliary database. And when it occurs in the same database, it is done in parts that are inherently used for queries. In typical activities OLTP a denormalization is not usually such an interesting option.

But copying data for analysis really is denormalization? Does it affect data integrity?

Can not.

Does

Create redundant data?

I have doubts.

I did not find reliable canonical references that determine this. But something tells me it was never normalization's intention to work on top of OLAP activities. When she was raised, all she thought of was OLTP. My experience (yours may vary) is that denormalization in OLTP tends to disrupt performance more than help.

Performance improvement techniques

Are actually not so many cases that actually benefit from denormalization . It's that fashionable thing or understanding in half, or even applying something in the wrong context. Most of the time there are other techniques to lessen the impact on the database before you have to use denormalization . There are cases that denormalizing solves one problem and causes another.

Some techniques of denormalization formal:

A example in OS .

Simplification

Another reason that is not very good but may be valid and acceptable is the simplification of the give.

There is a difference between making your Manuel's bakery system and a big system for a big company. Allowing a list of phones in a system's registry may not be a problem. But controlling all the contacts of a complex company does not give much scope to register the data anyway. Imagine a partner with multiple contacts who changes phone number. You want to change this phone number, probably preserving the internal extensions of all contacts. Will you come out changing one by one? What if you need to search the phone numbers of contacts? It is better to have these data in a separate table.

Some simplifications can bring more problems than solutions. Ensuring the integrity of denormalized data can be complicated and risky. It is not real simplification.

Convenience

Some say that denormalizing makes reporting easier. But I think this falls into the OLAP issue. Whether have the easiest data for the report generator? Ok, generate an appropriate set of tables with the necessary data, possibly on a temporary basis, and make it available for the report. If you just want convenience for data access create views or use another technique that makes it easy for you. Denormalization cannot be used for convenience.

Taking your example

Imagine if New York resolves to call itself Old York. Gives to fix? Dunno. How many different cities New York are there in the register? New York is not the identity of information, it is a state of it. Let's say you can fix it. Wasn't it to simplify? Or was it just XGH ?

If you really need to simplify, violating the first normal form is what will cause the least difficulties. You can use the Type array of PostgreSQL or any technique that simulates this feature in other PostgreSQL systems. databases . It is only complicated if the database uses fixed row size. There can be a lot of waste. But I question whether it is really simplification. It takes work dealing with exceptions.

But it's worth the maxim that if you really know what you're doing, everything is valid. the problem is that those who do not know, also can not assess whether they know.

What looks like denormalization but it is not.

There are cases where you might think you're denormalizing but it's just tailoring the project to the domain.

A typical example.

An invoice usually has a reference to the business partner and the items on the invoice often have references to the linked products. It is normalized and I doubt anyone will find it good to denormalize the separation of note header items.

If a partner's address, a customer for example, changes after this note is issued, what happens to the address constant on the note? What if the price of a product sold on this note changes?

You have a beautiful of an inconsistency and are doing something illegal (at least in Brazil). Bringing customer or product data to the note is not denormalization . It is to ensure the correct archiving of information. The data of the general registration of the PARTNER has a purpose. The data of the partner's registration at the time of issuing the invoice has another purpose. They're different things. They may be coincident but synchronizing them, trying to get consistency, causes the opposite, makes them inconsistent.

The semantics of the partner data contained in the note determines that they are immutable after issuance, so you have to guarantee this. Copying the customer data to the invoice solves this (it is amazing how some systems, even known, neglect this). Copying the product data is not only necessary for immutability. In fact the data can take another way. The price may be special for this operation, the tax may be differentiated for some specific reason. And the description can also be customized (in a bid the note should come out with the description of the notice and not what the winner of the bid has in his / her product registration) becoming another information that used as a basis the product registration. The moment you copy the data it is decoupled from its origin. If the information is other, copying is not denormalize . Just because you having the number " 1 " in two different places doesn't mean they represent the same thing.

Even then I would suggest to have a normalization of this information. To ensure the immutability of the partner's data, perhaps changes to the partner's registration should be made in multiversion. There it is possible to refer directly to the Active version at the time of issue of the note. Creating multiple versions of the registry may seem like denormalization . But if you need the repeated information, if it is in the semantic of the mechanism, normalize it is impossible, any attempt in this direction is to destroy the mechanism.

This mechanism has advantages even for solving other problems and is useful in a number of cases. But it also has drawbacks. So another, purer normalization solution for the purpose may be more feasible in other cases.

Have a table with only the necessary data of the partners when they are used in an invoice. A new version of this data will only be registered if there are changes to the registration when issuing a new note to the partner. This avoids having copies of data that are not used on any invoice and having copies of data that do not change for years. This model seems so obvious to me that it seems embarrassing that the systems do not use some variation of it.

I've gone into too much detail about this. What matters is that the ideal is to change the normalization and not denormalize , not least because the system's requirement uses other information.

Calculated fields

Whether you are using accumulator or other calculated fields may be creating new information or not. There are cases that it is really redundant and fits a normalization and consequently a denormalization can be useful to improve performance (see first section). In extreme cases there are calculations that can take hours.

But many cases the field simply has a need of its own. Example: letting a monetary value be converted to another currency only when you need it is a normalization that can bring you trouble in accounting. This is a calculated field that generates new information, it cannot be normalized, so it cannot be denormalized.

We often think of denormalizing What was not normalized. This doesn't make sense.

Gambiarra

Yeah, let's go to be frank, sometimes we need to do gambiarras :) not interesting, but necessary. This is a reason for to denormalize .

Conclusion

I did not put more specific examples because I think it would become list, it would become too broad.

Want a good reason not to denormalize ? It is difficult to define well how to do this as opposed to normalization which is a relatively simple and well known concept. No good SQL or template mastery relational and measurement tools and ability to interpret results, it is highly likely that the denormalization will worsen the situation.

On the other hand, attaching too much to theory to one side or the other will bring more problems than solutions. I could say that doing X violates the normal Y shape and this can't. But it can bring advantages without causing any problems in the specific case.

I know programmers and DBAs who will say the opposite of everything it. I myself do not always follow what I said here.

It was nice to revisit the subject because I discovered things I didn't know. I just didn't post because I was out of scope.