How to organize unstructured data?

So, you need to organize unstructured data in such a way that on the basis of this data you can build an Entity-Relationship model (the "entity - relationship" model) - that is, learn how to connect entities by one criterion or their totality.

What is meant. For example, take an organization: LLC "Orion" is a certain entity that has a set of characteristics:

  1. unique properties, such as TIN 7710646874, OGRN 1067757813474
  2. non-unique properties, such as
  • Address: 108811, Moscow, kilometer Kievskoe shosse 22-th (p Moskovsky), dvld 6 building 1, floor / block 2/a105
  • Registration region: Moscow
  • Type of activity: Activity in the field of communication based on wireless technologies (61.20)
  1. other useful properties, such as, CEO, Phone number, email

Here looms the problem #1 - how to organize this data, because: On the one hand, all these properties can be included in a single entity - that is, LLC "Orion" and assign it all these properties: INN, OGRN, Address, Region, Type of activity, General Director, Phone number, email. On the other hand, some of these properties can be independent entities, for example, the General Director (he has, in turn, his own properties: TIN, Registration address, Phone number) and here the question is what properties to allocate in separate entities, and which ones are not.

The situation is complicated by the fact that in the classic case of an organization, we know an approximate set of properties inherent in it. In the case of more complex entities, such as a person, the set of properties can be unlimited and undefined in advance, as for example, we can only know: Name, Patronymic, Year of birth, that he is a fan of Spartak and Phone number.

Let's assume that the Organization's phone number from the first example and the Person's phone number match. And here is the result of our work - we identify the relationship between the entity "Organization" and "Person" on the basis of Phone. - this is a perfect example of

Also, and most often, there are abstract entities, that is, a set of characteristics that do not allow you to uniquely identify this entity as unique. How, for example, is it known about a Person?: The name is Anton, the year of birth, that he is a fan of Zenit and that he is a regular customer of the ABC of Taste on Leninsky Prospekt. So, if we select all of them first Zenit fans from Moscow, then we will take the customers of the Abc of Taste on Leninsky Prospekt with the name Anton from them - then there is a high probability that we will find exactly the same person and update the essence to a unique one with the necessary set of unique characteristics, full name, full date of birth and address where products are delivered to him. In extreme cases, we will have, say, 5 such "Antonov" and will be able to work with this data manually.

We must proceed from the fact that we have an indefinite the number of data sources with unstructured information - they need to be driven into the database, and then link this data to each other.

As a result, we must solve the following issues::

  1. Can we reliably establish a relationship between entities?
  2. How do we define the" independence " of elements, such as a phone number-is it an entity or a property?
  3. Can we complement (combine) the non-unique properties of one entity with the properties of another a non-unique entity based on the principle of the totality of information (as in the example with "Anton") ?
Author: Denis S, 2020-10-29

1 answers

All the voiced problems are caused by the fact that you go to the wrong side of the problem. The data and the relationships between them are not in a vacuum, and a particular model structure is determined by how the entities and the relationships between them will be used.

You can find many attributes and relationships both in the real world and in the " raw " data, but whether they need to be represented in the model is decided based on the problem.

There are abstract entities, that is, a set of characteristics that do not allow you to uniquely determine this entity as unique

Since you have many data sources and data from different sources may be incomplete, then you will have to represent this at the model level, i.e. you will have entities of the type человек-как-его-видит-источник-N, since you will need either an algorithm that merges everything or a UI where the user chooses what comes from and where, in any case, there will be business rules defining how to handle it, and business rules need models on which they operate.

And this is all at on the condition and under the assumption that there can be one correct model of a person in which this can be represented. And more often, different modules of the system will have their own vision of what a person is (I will not retell everything here, read about bound contexts in DDD). In this case, different models will exist not only to maintain the data merge, but also in the process of using them further. Then the models will be of the form человек-как-его-видит-модуть-покупок and человек-с-точки-зрения-бухгалтерии.

Therefore, the answer to all questions is the same: in general in this case, i.e. abstracting from how the entities will be used, we can not, but in a specific situation - of course we can. This is done in the process of analyzing the domain and usage scenarios.

 1
Author: Roman Konoval, 2020-10-29 19:23:29