Skip to content
Home » Insight » Why product data clean-up takes longer than expected

Why product data clean-up takes longer than expected

In a nutshell, product data clean-up takes you longer than you thought it would because the errors you can see (such as duplicates and missing fields) are only a small part of the overall problem. The real drag factors are:

  • The long tail[1]
  • Inconsistent attributes
  • Legacy rules
  • Supplier variation
  • Exception handling

It’s hard to capture all these in your initial time estimate. Nowadays, overrunning clean-up work is an operational issue because it blocks trading, drains the efforts of internal teams, and perpetuates the poison of bad data flowing into channels.

We’ve written this article to outline just why timelines slip, what’s usually missed at the scoping stage, and how to structure your clean-up work so projects no longer keep expanding mid-delivery.

Why time lags happen

Data failure isn’t just a matter of poor formatting or missing values. It’s the existence of accumulated inconsistency across your product records, source systems, and operating teams.

The operational consequence? Constant rework, where teams consume time fixing obvious fields but then find themselves pulled into attribute mapping, supplier clarification, approval delays, and further manual checks.

It ain’t rocket science to imagine the commercial impact. Other projects get sidelined or run over time and budget, product onboarding remains slow and cumbersome, channel errors persist, and your business sees the benefits expected from the clean-up getting delayed…yet again.

Most digital merchants scope clean-up projects around things which are easily (and quickly) countable. Low-hanging fruit like missing dimensions, duplicate values, bad titles, and empty fields. Granted, real issues, but rarely the full story. Once remedial work starts, your teams find that identical attributes are being defined differently by category, or that the same value exists in multiple formats, and certain similar-looking records behave differently downstream.

A thousand mini-problems become one BIG problem

It’s the long tail of data complexity – not a single big problem, but thousands of little ones.

Look at these examples:

  • A “material” field might hold the following information: cotton, Cotton, 100% cotton, textile blend, supplier code text, and an obsolete internal label.
  • A product family uses one size model in the ERP, another in the PIM, and a third in marketplace templates.
  • Supplier spreadsheets appear to be complete, but certain key values fail validation because the accepted format isn’t documented.

In themselves, none of these issues looks unmanageable in isolation. However, aggregate them and, before you know it, the hours needed to fix them consume weeks.

In fact, this is why the first category often takes far longer than planned. It turns into the place where your business discovers that its own rules are actually incomplete.

The burden of the past

Major causes of hidden problems are legacy decisions. Your product data has passed through multiple systems, imports, and team members’ hands over years. On occasions, data fields were repurposed, validation rules bypassed, or attribute definitions changed without a controlled update to templates or enrichment workflows. The clean-up initiative exposes all these anomalies at once, turning what you thought was a straightforward correction task into a rather more daunting reconstruction task.

Who actually owns all this data?

Who knows? Ownership ambiguity makes things worse. Clean-up requires decisions, and an awful lot of business have no clear ownership trail upon which to base those decisions:

  • Merchandising might own commercial copy (but check that)
  • Compliance are responsible for regulated attributes (we think)
  • eCommerce owns category structure, right?
  • It’s operations who manage supplier onboarding

There’s no agreed, single approval gate for data quality thresholds, so every exception stalls. Teams end up spending more time arguing about who has the final say than applying the fix.

Suppliers, AI, and quality: More complicating factors

Factor in the issues raised by supplier data regarding extended timelines. Most businesses aren’t just cleaning one catalogue. Rather, they’re attempting to reconcile many versions of the truth. Each supplier template may carry different column structures, naming conventions, units, completeness thresholds, and interpretations of required fields. The only way to harmonise this chaotic input into a standardised model is doing it by human hand – that is, unless attribute definitions, validation rules, and ingestion controls already exist.

Nowadays, the obvious solution is automation but beware! Automation doesn’t remove all this complexity at the start. Rules-based engines and AI can only accelerate repetitive corrections once the underlying data quality standard is stable. If your taxonomy is inconsistent, or attribute values are ambiguous, or exception logic is undefined, your automated correction simply applies existing errors more efficiently and rapidly. The teams touching all this data then stop trusting this source of data, they add manual workarounds and reviewing, and we’re back to sluggish clean-ups.

Another reason timelines slip is that quality thresholds rise during delivery. At the beginning, the goal is usually framed as “tidying up.” However, once key stakeholders see these corrected records, expectations rise. They ask for:

  • Richer content
  • Better filter coverage
  • Cleaner digital assets
  • Stronger SEO fields
  • Improved marketplace acceptance
  • Fewer exceptions downstream

All reasonable expectations, right? But they expand the fundamental aim of the project from correction to the larger issues of standardisation and enforcement. This is the point when a lot of clean-ups lose control over scope. While the business still treats the effort as a finite clean-up, the actual requirement has turned into an operational redesign.

The remedy: SSG – Get your data stable, standardised, and governed

1. Stabilise the highest-risk data

Prioritise the categories, attributes, and suppliers causing the most rework, feed failures, onboarding delays, or customer-facing errors. Then:

  • Freeze unnecessary variation
  • Remove duplicate fields
  • Isolate records requiring manual adjudication

This reduces the ‘background noise’ and gives your teams a usable working baseline.

2. standardise the model

Now you’re in a position to define:

  • Approved attribute definitions
  • Accepted values (such as units of measure)
  • Formatting rules
  • Supplier templates
  • Enrichment workflows
  • Approval gates

Critically important is to document what ‘good’ actually is. Additionally, make exception criteria explicit (which is where the long tail becomes manageable, because edge cases can be classified instead of repeatedly debated).

3. Enforce these standards at entry

Apply validation rules to the PIM, your supplier submission process, and internal workflows so as to prevent bad data from getting anywhere near re-entering the core systems.

Clean-up projects lacking any enforcement simply mean it’ll happen again. Rules and gatekeepers turn what was normally a one-off project into a part of your controlled operations.

What you gain

Because you are much more confident that the core product data is reliable, the measurable outcomes are gainful:

  • Greatly reduced need for manual rework
  • Faster onboarding for new SKUs
  • Fewer channel rejection errors
  • Better search and filter performance (so, better CX)
  • More consistent marketplace submissions (leading to far fewer rejections)
  • A more robust foundation for implementing AI-driven enrichment and automation tasks

Final words – What next?

Many businesses, especially distributors, grossly underestimate the impact on data management of the long tail while overestimating how much they can fix with bulk rules alone. The issue isn’t effort, more a case of misplaced effort due to ungoverned complexity.

The next step is to assess the true scope of the issue before more time is lost. Reach out to us today at Start with Data and book a data assessment. We’ll put our expertise to use to identify hidden complexities, prioritise what needs fixing, and how to approach it correctly, and then we can work with you to build a realistic remediation plan.