Data cleaning techniques for product data
If there is a key element of a PIM or MDM implementation project, it is data quality. Without a de facto high-level quality threshold, the project will be plagued with the problems that dirty data compounds. The preventative measures at the discovery phase of the project include a data profiling phase which identifies a range of potential problems. Once substandard data has been identified, the process of how to clean data (also known as cleansing) can take place.
Why is data cleaning necessary?
If your business aspires to being data-driven, recognise that you are sitting on a very valuable asset. However, data is a double-edged sword – whereas clean, high-quality data enhances business performance and enables goal achievement, dirty data can cause serious damage, impacting on your reputation, efficiency, finances, operations, and competitiveness.
Data cleaning techniques
Removing unwanted observations
This includes deleting duplicate/ redundant or irrelevant values from your dataset.
Correcting structural errors
Errors can emerge when measuring or transferring data. These are known as structural errors. They may include typos in the product data (such as names of features, or the same attribute using a different name, mislabelled classes, and inconsistent capitalisation.
Managing Unwanted outliers
An outlier is an observation which is an abnormal distance from other values in a random sample from a dataset. Overall, it is better not to remove outliers until there is a legitimate reason. Removing them may improve performance, but not always.
Handling missing data
Missing data can be a tricky issue, especially when using machine learning. Missing data may simply be ignored or removed, as the absence may be an indication of something significant. Therefore, involvement of the relevant subject matter expert is essential.
“‘missingness’ is almost always informative in itself, and you should tell your algorithm if a value was missing”.
Imputing missing values
Imputing missing values means assuming an approximate value calculated using linear regression or median. Of course, this method has risks, as we cannot be certain that it is the genuine value.
Converting data types
Data types should be uniform across datasets. There are several things to bear in mind when converting data types:
- numeric values should be kept as numeric
- If you can’t convert a specific data value can’t be converted, it is best to enter ‘NA value’. With a warning that this particular value could be wrong
Data cleaning and data mining
Data mining is the process of pulling valuable insights to inform business decisions and strategy, while data cleaning is the process of removing bad data, organizing the raw data, and making it fit for use. Essentially, data cleaning prepares data for mining, which is when the most valuable information can be deployed from the data set.
Data cleaning steps in data mining are time-consuming, and this has historically created a dilemma for data specialists in terms of having sufficient staff or time to clean the data. However, without high quality, the deployment and insightfulness of those data will certainly be severely compromised by inaccuracy, inconsistency, and other issues.
Data cleansing tools
There is a wide range of data cleaning tools on the market. Whether they are suitable or not for requirements is a question of getting expert advice and doing research. The most commonly used data cleansing tools offer some or all of the following features:
- Auditing capability: having an overview of where and when changes were made to a record is essential for internal and external auditing and compliance
- Compatibility and integrations: a tool able to work with all data sources used by your business for operational activities
- Cloud vs on-premise: cloud-based cleaning tools offer more choice and affordability for businesses which have limited resources
- Metadata support: full and complete metadata helps provide full insight powered by valuable analytical data for data scientists and other business users
- Compatibility with different sources: where is your data being extracted from? Are there multiple sources? These questions impact on how long it takes to prepare for and run processes
- Batch processing capabilities: Being able to program regular bulk data cleaning in advance can help in guaranteeing ongoing data
Data cleaning and machine learning
When we come to how to clean data for machine learning, the use of artificial intelligence and machine learning tools for PIM and MDM systems has massive benefits. It allows us to organise product information using large volumes of product metadata, it can analyse data and match it with the statistical rules governing compliance and quality and it can obviously speed up a wide range of processes, saving time and money.
However, machine learning is not ‘intelligent’ as such. It can only function effectively if the ‘raw material’ with which it is working is of a high quality. All users of product data and metadata are entering prices, data assets, fact sheets and parametres of channel, legal and regulatory compliance. If any of these are incorrect, inaccurate, incomplete, or invalid, the machine learning tools are working on a false premise and problems will inevitably emerge. We cannot, therefore, skimp on data cleaning simply because we assume our algorithms will put it all right. The algorithms we use can be powerful, but without the relevant or right data training, a system will most likely fail to provide optimal performance.
Selecting the right data cleansing tool can appear difficult, but if you do research and take the advice of a trusted 3rd-party expert like Start with Data, it ends up being a tremendously effective way of achieving high quality data and ensuring your MDM solution hits the ground running after go-live.
Find out more
If you would like to find out more about how product data management, PIM and MDM can create value for your business, we’d love to hear from you – Ben Adams, CEO Start with Data