Automating 550 Retailer Product Feeds: What I Learned Building an Ecommerce Data Pipeline

A few years into my time at a large ecommerce platform serving the Gulf region, one of my main mandates was understanding how product data actually got onto the site. The answer, as I found out quickly, was "mostly by hand." A team of data entry staff would log into retailer portals, copy product information, map it to the platform's catalogue fields, and paste it in. For a platform serving millions of users with inventory from hundreds of retailers across the Middle East, this was not going to scale.

Over the following months, I built the automation pipeline that replaced most of that manual process. By the time it was running properly, we had reduced manual data entry by about 40% and were keeping product feeds from over 550 retailers continuously updated. This is what I learned — and what I'd do differently if I were starting it again.

The Problem Was Actually Three Problems

The first instinct is to see a data import problem as a single technical challenge: write code that reads data from source A and puts it into database B. But it's almost never that clean.

In this case, the actual problem was three separate problems stacked on top of each other. First, the data came in wildly different formats — CSV, XML, JSON, proprietary APIs, scraped HTML, and in a handful of cases, literal FTP folders where retailers would drop Excel files. Second, the data quality varied enormously. Some retailers had clean, consistent SKUs and properly structured product titles. Others had duplicates, missing required fields, inconsistent category names, and Arabic text mixed with English in the same spreadsheet without proper encoding. Third, the platform's own Magento catalogue had a specific structure — attribute sets, configurable products, custom options — that didn't map cleanly to most retailers' flat catalogue formats.

Any solution had to address all three. I've seen teams try to solve the first problem (format ingestion) and ignore the second (data quality), and the result is a very fast pipeline that floods your database with garbage.

The Architecture I Ended Up With

I built the pipeline in PHP with a Laravel-based job system, running on scheduled tasks via the Artisan scheduler. Each retailer had a dedicated "connector" class implementing a common interface. The connector's job was one thing only: fetch the raw data from wherever the retailer put it and return it as a normalised PHP array with a fixed schema.

That normalised array then went through a validation and transformation layer. This is where we enforced required fields (SKU, title, price, category, stock status) and applied the retailer-specific mapping rules. For example, Retailer A might call their category column "dept" and use codes like "EL-TV" for televisions, while Retailer B has a text column "category_name" with "Televisions & AV." The mapping layer translated both into the platform's internal category IDs.

Only after passing validation did records queue for import into Magento. Failed records went to a review queue with a reason code so the data team could investigate without stopping the entire feed.

The Normalisation Problem (and Why It Takes Longer Than You Think)

The most time-consuming part of the project wasn't writing code — it was building the mapping rules for each retailer. Every connector took one to three days to develop, and we had over 550 to do. I couldn't do that alone, so I built a simple admin interface where the data team could draft mapping rules without touching code, and I would review and deploy them.

Arabic text handling was a consistent pain point. PHP's string functions don't think about RTL text, and when you're building product titles that might contain both Arabic and English, you have to be deliberate about encoding at every step. We ended up standardising on UTF-8 throughout with explicit charset declarations in every database connection, which sounds obvious but wasn't documented anywhere in the existing codebase.

Duplicate detection was the other significant challenge. Retailers often sell the same product — same brand, same model number — and both would appear in our database as separate listings if we weren't careful. I implemented a fuzzy matching step that compared normalised product names and EAN codes (when available) against existing catalogue entries before inserting new records. It wasn't perfect, but it reduced the duplicate insert rate from roughly 12% to under 2%.

What the 40% Reduction Actually Looked Like

I want to be specific about this because "40% reduction in manual work" sounds like a marketing claim. Here's what it actually meant in practice.

Before the pipeline, the data team spent roughly 6 hours per day on feed updates — pulling prices and stock levels from retailer portals, updating records in Magento, and chasing retailers for updated files. After the pipeline, that dropped to about 3.5 hours per day — primarily handling exceptions, reviewing failed records, and onboarding new retailers. The pipeline handled routine updates automatically every four hours for high-volume retailers and every 24 hours for smaller ones.

The business impact was visible in catalogue freshness. Before automation, price updates from retailers could take two to three days to appear on the platform. After, they typically showed within four hours. For a platform where price competitiveness matters, that gap is significant.

The Mistakes I Made

I underestimated feed stability. I assumed retailers would maintain their feed formats once we'd integrated them. They don't. Retailers change their export formats, rename columns, restructure their category trees, and occasionally stop providing feeds entirely without warning. I needed to build monitoring into the pipeline from day one — alerts when a feed fails to update for more than X hours, when record counts drop significantly, or when a previously reliable field starts coming through empty.

I added that monitoring in week four, after a retailer's feed had silently failed for two days and we only noticed because their products started showing as out of stock. Add monitoring before you go live, not after.

I also over-engineered the connector interface initially. My first design had connectors responsible for transformation logic as well as data fetching, which made them large, hard to test, and difficult to hand off to junior developers. Splitting the concern — connectors fetch, transformers transform — made everything cleaner and significantly easier to maintain.

What I'd Do If I Were Starting Over

Start with three connectors, not fifty. Pick the three messiest retailers — different formats, worst data quality — and build the full pipeline end to end for just those three. You'll discover the real architectural decisions before you've committed to anything. The patterns that emerge from your hardest cases will design your architecture better than any amount of abstract planning.

Build the monitoring and exception dashboard on day one. Not as an afterthought. Your data team needs visibility into what's failing and why, and that visibility buys back hours every week.

Don't conflate ingestion speed with import speed. Fetching a retailer's feed is fast. The Magento import — creating configurable products, assigning attribute values, reindexing — is slow. Separate those into different job queues with different priorities, and consider whether you need all records to import synchronously or whether near-realtime (four to six hours) is perfectly acceptable for your use case.

If you're building something similar — retail data pipelines, catalogue automation, multi-source ETL in PHP — I'm happy to compare notes. Reach out here.