Data has always been of major importance for businesses and organizations. However, one of the key opportunities that showed up in the last decades is the access to massive amounts of data, and the possibility of processing it through digital means.
Being able to understand habits, resources and patterns is nowadays a tipping point between success and failure in several industries. If you want to avoid the losses produced by unhealthy data and make the most of your investment in data, this article will be of great help.
The popularization of the internet –63.1% of the worldwide population uses it regularly- allowed different organizations to gather data as they never did before. But this possibility came along with new challenges: processing this data, understanding it, and making it useful implies a series of complex technical developments.
An important part of the process of making this data useful is verifying the quality and improving “unhealthy” data. But what does this mean?
Data quality (DQ) is the discipline of assuring incoming data’s correctness in order to make it ready to enter a pipeline, and finally deliver it to its user. Its impact on efficiency and effectiveness makes it impossible to overstate its importance.
This process is typically divided into dimensions in which data’s health can be tested. Let’s see which are the standard ones and how they work.
Standard data quality dimensions: why are they important?
There are several standard dimensions within the discipline of data quality, and each one of them involves an aspect in which data can be declared usable.
Essentially, data is a virtual representation of real objects, that can be digitally:
In order to do this correctly, the representation of reality must be precise, at least in the aspects we care about. Therefore, these actions require high-quality data. Otherwise, the relation with real objects is cut and their operativity fades away.
Usually, many people think only of accuracy when hearing the words data quality, but there are other aspects that are equally important if you want your data to be usable. In this article, we are focusing on the 5 standard dimensions, and commenting on some of the coming ones.
This first dimension is probably the most intuitive one, and it is understood as:
The closeness between a value V and a value V’, considered V as the correct representation of a real-world object.
There are two different kinds of accuracy dimensions we should take into account.
Syntactic accuracy is defined as the closeness of the value V’ to the elements of the particular domain D (for example, existing names or real numbers). In this dimension, we aren’t interested in the correspondence of V’ with the true value V; instead, we are trying to check if V’ is any of the correct values in our domain D, no matter which one.
To put it simply: “Paul” is a syntactically accurate value in a Name field. “Payl” or “France”, for instance, aren’t. This dimension aims to check if a field is “well-written”, in the sense of being a valid option within the demanded subject. It doesn’t matter here if the real name of the person for the input “Paul” was John.
The second type of accuracy is semantic (a.k.a. “correctness”), and its goal is checking the difference between V’ and the real V. For example, “Paul” is correct as long as the name of the person is actually Paul. For a person named John, “Paul” is syntactically correct, but semantically incorrect; and “Payl” would be both syntactically and semantically incorrect.
This second type of accuracy is naturally more difficult to contrast, as we must know the real information from a second source.
This dimension is defined on a general basis as the extent to which supplied information fulfills demanded real facts. In the case of a table of people that includes Name, Surname, and Email, an Email field with null input would be:
- Incomplete for someone who has an email address, which is unknown to us.
- Complete for someone who doesn’t have an email address.
- Unknown if we don’t know whether this person has an email address or not.
In the present situation of this discipline, web data represents a special kind with its respective problems and solutions.
Web data completeness
Web data has several particularities, among which time dynamics outstand. Information systems on the web, unlike printed-based media, are constantly being updated, transformed, and re-published.
For this reason, the traditional completeness dimension is static, while the web-based notion needs to include the concept of completability. This term refers to the required time for completing fields to come, and how fast this task can be done.
There are many time-related dimensions in DQ, which have subtle differences that must be taken into account. As we mentioned above, time dynamic is one of the main aspects of web data, so it’s very important for optimizing processes in this environment.
Timeliness may be the most important of the three options, as it refers to the availability of particular data for a certain task. For example, having the required number of products before sending a package.
As it is directly linked to actions, timeliness is an unavoidable dimension for several industries; without it, efficiency and effectiveness are drastically reduced.
This dimension refers to how promptly is information updated. If you are working with regularly modified data -such as web traffic, for example-, currency is a key dimension to cover.
Volatility describes the frequency with which data varies in time. Stable data such as birthdays have volatility equal to zero. Stock quotes, on the contrary, have a high degree of volatility as it changes in short periods of time.
The consistency dimension detects the violation of semantic rules defined over a set of data items. It is similar to accuracy but applies general laws that increase appropriateness for the desired fields. For example, “The input for Age must be a number between 0 and 120”, or “the input for Email must include an @”.
These kinds of rules can be general -as in the examples above- or relational. Relational laws link different fields of the same tuple. An example of this would be: “if the marital status is married, age cannot be under 14”.
Up to here, we have commented on the standard dimensions of DQ. However, these aren’t the only ones; there are many other dimensions that can be applied in more specific industries, or that are becoming more popular as measurement methods come up.
Let’s see two of the more popular examples.
Accessibility is one of the most demanded dimensions in recent years besides the standards. It measures the possibility of different users accessing the data depending on specific conditions:
- Physical ability
The World Wide Web Consortium provides these guidelines in order to make data accessible.
Another increasingly important dimension is trustability. As more and more data is appearing, the sources must be better verified in order to be usable. Some of the most valuable aspects are:
Priorities and measurement
Each dimension has a different priority for each industry or need, and this condition affects the trade-offs between them. Most commonly, time-based dimensions will have to be balanced with the others, since they reduce the possibility of checking as they are prioritized. Mainly consistency, accuracy, and completeness are confronted with timeliness. Consistency and completeness are other typical cases of compulsory hierarchization.
What does this tell us?
Each project generates its own needs and its own standard for prioritizing and modeling data dimensions. For this reason, each team has the goal of developing the ideal measurement method within its project.
Usually, measurement methods are achieved through:
- Digital testing tools
- Open Source libraries
- In-house developed solutions
If you are looking for solutions to improve the quality of your data, don’t hesitate to ask. Our teams are specialized in developing customized solutions for each project’s needs.