🐾Do you really need microservices for your ML workload?🐾

Sep 29, 2025

🤓 In data and ML projects, monolithic pipelines can become slow, fragile, and hard to scale, which often leads to their migration to a microservices architecture. Over the years, I’ve seen enough migrations to know that success depends more on reasoning and readiness than on the technology itself. Here are a few things I’ve noticed while working with teams on their ML pipelines.

Don’t migrate for prestige — solve real problems first

One ML project I worked on had a monolithic pipeline running everything from data ingestion to model training in a single ECS service. The team was frustrated: deployments were slow, and when anything failed, the entire pipeline had to restart from scratch. They’d heard that microservices solve these problems: better isolation, independent deployments, and easier scaling. So they planned to split the pipeline into separate services for ingestion, feature engineering, training, and validation.

But when we profiled the actual bottleneck, we found that one heavy feature computation step was taking 45 minutes while everything else completed in under 10. We extracted only that part into a separate ECS service reading from S3 and writing results back to another bucket. Pipeline throughput improved by 60%, failures became isolated, and we avoided the complexity of managing four separate services.

Let actual pain points drive decomposition, not trends. The right solution is often simpler, cheaper, and faster to implement than a full architectural overhaul.

Team size and readiness matter

A five-person team I worked with was dealing with a slow ML workflow—model retraining took hours, and any code change meant redeploying the entire pipeline. They wanted faster iteration cycles and the ability to update individual components independently. Microservices seemed like the answer. They split their workflow into separate ECS services: one for ingestion, one for feature prep, one for training, and one for serving. Each piece could now deploy independently and scale on its own.

The problem? The team had three data scientists, one ML engineer, and one backend developer—no dedicated DevOps resources. Suddenly, they were managing four CI/CD pipelines instead of one, debugging IAM role permissions across services, and coordinating deployments when schema changes touched multiple services. What should have been a one-day model update turned into a week of infrastructure firefighting.

Microservices multiply operational overhead. Your architecture should match your team’s size and skillset. If you’re spending more time on infrastructure than on your actual product, you’ve over-engineered.

Beware of the distributed monolith

A team I worked with had split their ML pipeline into separate services — ingestion, feature engineering, validation, training, and serving. They wanted independent deployments and better isolation.

The problem was shared dependencies. All services used a common Python library for data processing and expected the same schema from S3. When the data science team wanted to add a new feature, they had to update the shared library, modify the schema, and coordinate deployments across services in a specific order. One schema change could take a week to roll out because services would fail if deployed out of sequence.

They had a microservices architecture with monolith coupling—all the operational complexity of distributed systems, none of the independence.

We restructured around clear ownership and contracts. Ingestion owns its raw data format. Feature engineering owns its output format and is committed to backward compatibility — it could add fields but not break existing ones. Training depended on the contract, not the implementation. Services communicated readiness through SQS events, not by assuming data structure.

True independence requires clear ownership and contracts between services. If every change needs coordinated deployments, you’ve built a distributed monolith.

Plan observability and ownership from day one

A team I worked with had split their pipeline into four separate services and felt good about the architecture. Then they hit their first production data quality issue — bad records were appearing in training data, but no one could tell which service introduced them.

The team spent three days debugging. They had basic CloudWatch logs, but each service logged differently—some used JSON, others plain text, some included timestamps, others didn’t. Engineers were manually checking each service’s logs, trying to trace records through the pipeline. They’d find a suspicious entry in one service, then have to guess the timestamp and search another service’s logs. No correlation IDs, no consistent format, no clear way to follow a record’s journey.

Worse, when they finally found the bug in the feature engineering service, no one was sure who owned it. The ML engineer who wrote it had moved to another project, and the code had been touched by three different people since.

Without consistent observability and clear ownership, debugging microservices becomes archaeological work. Set up structured logging and assign owners before you go to production, not after your first incident.

Thank you for reading, let’s chat 💬

💬 Have you migrated an ML or data pipeline from monolith to microservices?
💬 Have you ever run into a “distributed monolith”?
💬 If you had to pick just one — would you optimize for observability or modularity first?

I love hearing from readers 🫶🏻 Please feel free to drop comments, questions, and opinions below👇🏻