Data Profiling

It pays to know exactly what state your data is in before cleansing and migrating it to a new PIM or MDM system

Once the build phase of the PIM implementation project begins, it is crucial to carry out a deep dive into the current status of your existing data. This requires discovery of all data sources and analysis of to what degree they conform to thresholds of usability and quality. At Start with Data, we specialise in carrying out data profiling as a part of your PIM implementation projects.

What is data profiling?

Data profiling, as the name might suggest, is the process of gathering available data from an information source and compiling statistics or summarising information about that data. In terms of PIM and MDM, data profiling takes place during the discovery phase as a subset of the data migration project, itself a subset of a larger PIM implementation project.

Nowadays, data profiling uses technology tools to discover and examine data quality issues, such as duplication, completeness, inconsistencies, and accuracy (or lack of). The process is carried out by analysing data sources and gathering metadata which indicates the condition of the data to allow the profiler (usually the data steward) to trace the origin of data-related errors. From such investigations, data profiling tools can be used to provide statistical information, like degree of duplication or ratios of attribute value ratios, in graphical or tabular form. Essentially, data profiling sifts through all product data from existing sources to determine its quality. 

Benefits of data profiling

The value of your data depends on how well you profile it. Only a small percentage of company data meets minimum quality standards, which means that badly managed data costs millions in lost time, money, and untapped revenue potential.

Data is an asset, but only to the extent that it is usable and adds value. That is why profiling works on data to extract maximum value and generate a competitive advantage for product-centric companies.

While a data profiling application is running, it is constantly cleaning, updating, and analysing data to generate insights regarding the quality and credibility of the data in question. It can also make predictive decisions and be proactive if managing crises.

Manual data profiling

Although minimally used by any self-respecting company nowadays, traditional data profiling needed a skilled technician to manually query all data sources using Structured Query Language (SQL). Apart from being time-consuming, labour intensive and prone to human error, the main problem with manual profiling is the disconnect which often exists between the business user, who knows what the data should be, and the technician, who knows SQL but doesn’t know what the purpose of the data is. In a nutshell, manual data profiling is neither recommendable nor necessary, given the tools available on the market nowadays.

Data profiling steps

There are generally three component elements of data profiling:

Structure discovery

also known as structure analysis, this process determines whether data is correctly formatted and consistent by using basic statistics to extract information about validity

Content discovery

this focuses on data quality. Data needs to be promptly and efficiently formatted, standardized, and suitably integrated with existing data. For instance, if a customer address is wrongly formatted, that could result in incorrect delivery and an inability to contact the customer in question

Relationship discovery

his identifies the connections (relationships) among different sets of data

The difference between data profiling and data cleansing

As we have seen, data profiling is a powerful way to analyse vast amounts of data to identify errors, missing information, and other anomalies affecting quality of information. By profiling data, you get to see all the underlying problems with your data that you would otherwise not be able to see.

Data cleansing, on the other hand, is the step after profiling in a data migration project. You have identified the flaws you need to eliminate in your data (profiling) and can now take the necessary measures to cleanse your data of those impurities (cleansing).

The difference between data mining and data profiling

We already know that the reasons for data profiling are to discover information and to assess the quality of data to identify anomalies in a given data set. The end goal is to develop a knowledge bank of accurate information about your existing data.to best prepare for subsequent cleansing and migration and testing.

On the other hand, data mining as a process covers two areas. Firstly, predictive mining, involving the use of certain variables in a data set to gain information on potential future values of other variables. The second area is descriptive data mining, focusing on generating new insights from information based on available data sets.

As such, the two activities have entirely different purposes. Profiling is preventative – it is primarily to ensure all data is as good as possible before migration. Mining is primarily a marketing task, where the aim is to gain actionable insights.

Data profiling tools

There are several options available to assist the complex process of data profiling. Obviously, the more you pay, the greater the scope of functionalities included. Used well, these tools can work efficiently on the data profiling landscape, minimising resource use, widening the scope of discovery and enhancing consistency across all initiatives on data quality.

As the volume of cloud-based commercial data rises, more effective data profiling is becoming increasingly critical. Use of data lakes to store massive amounts of data, plus the use of the Internet of Things is making it possible to satisfy the enormous appetite for product data by collecting them from an increasing and changing variety of sources

 

Data Profiling Software

Data profiling software has become increasing automated in recent years, adding to its effectiveness as a tool.

Three examples of data profiling

As a component of a PIM implementation project, the value of data profiling can be seen in the following brief examples.

Data Warehousing / Business Intelligence (DW/BI) projects

These projects generally involve collecting and collating data from various systems for reporting and analysis. Data profiling helps by:

  • Identifying data quality problems in the source system which require attention
  • Pinpointing issues which can be rectified during ETL processing
  • Flagging major risks requiring a significant rethink of the project as a whole

Data conversion / Migration projects

We know of the inherent risks involved when moving data from a legacy system to a new one. This project risk can be mitigated by:

  • Identifying data quality issues early, so that they can be handled in the code used to move data from the old to the new system.
  • Identifying issues which could make a change to the target system necessary.

Source System Data Quality Initiative

The aim of these initiatives is to evaluate then enhance data quality in a given source system. It is used to repair existing problems as well as preventing their reoccurrence by using predictive insights. As such, the cost of profiling is offset by the ROI generated by:

  • Pinpointing and highlighting the system locating which is suffering most from severe and/or multiple data quality problems
  • Isolating problems emerging from bad user input or faulty system interfaces

Data profiling services

As part of our services, product data profiling is a key element of the data migration strategy we create and develop. Before mapping, cleansing, and transforming product data, we carry out a thorough profiling process. With these interconnected steps, we ensure your new data model and go-live PIM solution has the highest quality data to deploy.

Get in touch with us to have a more in-depth conversation about your data profiling needs.

Find out more

If you would like to find out more about how product data management, PIM and MDM can create value for your business, we’d love to hear from you – Ben Adams, CEO Start with Data

Case Study

“Start with Data are helping transform product data management, laying scalable technology and data governance foundations”