🐾Lambda is a worker, not a Boss🐾

Dec 08, 2025

🤓 Does your Lambda function call Service A, check the response, retry with delays if it fails, then call Service B, while writing the current state to DynamoDB after each step? Here’s what’s actually happening: you’ve built an orchestrator inside your Lambda. It looks simple at first — one function, one file, easy to understand. But you’re managing workflow state manually, coding retry logic from scratch, and paying for compute time during every error handling routine. There’s a better way.

The Trap: The Lambda orchestration anti-pattern

Developers frequently fall into this trap because writing sequential logic in code is fast and familiar. However, this approach turns your business flow into fragile “spaghetti code” that is impossible to maintain and costly to run.

The three primary failure points of using Lambda for orchestration:

The 15-Minute Timeout Wall: Your Lambda execution is capped at 15 minutes. If your workflow involves waiting for an external API, a human approval, or a long-running process, the function will simply time out and fail the entire process. No graceful pause. No resume. Just failure.
Manual State Management: Lambda is stateless by design. To coordinate steps, you are forced to write and maintain complex custom logic that updates external state (e.g., DynamoDB) after every step. You’re essentially building your own orchestration engine from scratch — poorly.
Expensive Retries: You must code error handling and retry logic into your function. When a service fails, you pay for your code to run the error logic, instead of having the orchestration layer handle it almost for free. Worse, if your Lambda times out mid-retry, you’ve lost all context about where you were in the process.

The Architect’s choice: Step Functions

AWS Step Functions is the dedicated, stateful orchestrator. It solves all of the above problems by defining your workflow declaratively using the Amazon States Language (ASL).

Year-Long Workflows: Standard workflows can run for up to one year. You pay for the compute (Lambda) only when a task is executing. Waiting for days or weeks for human approval is cost-effective, unlike an idle Lambda that would timeout long before anyone even opens the approval request.
Built-in Resilience: Error retries, exponential backoff, complex branching, and fallbacks are defined visually in the state machine. This logic is robust, versionable, and you don’t write a single line of code for it. When a downstream service fails, Step Functions automatically retries with configurable intervals — no billable Lambda invocations during the wait.
Clear Observability: The visual workflow console shows you the precise status of every step in a clear graph, making debugging a failure instantly obvious. Instead of searching through scattered CloudWatch logs trying to reconstruct what happened, you see exactly which state failed, what input it received, and what error it threw.

The Solution: What to do about it

Map your current workflows: Identify Lambda functions that contain more than one business logic step. These are your refactoring candidates.
Start with the long-running ones: If you have any workflow that needs to wait more than 5 minutes for anything—an API, a human, a batch job—move it to Step Functions immediately.
Use the .waitForTaskToken pattern: For human approvals or external dependencies, leverage Step Functions’ callback pattern. Your workflow pauses indefinitely (up to one year) without any compute cost, then resumes exactly where it left off when the external event completes.
Let Step Functions handle the boring parts: Error handling, retries, timeouts, and branching logic should live in your state machine definition, not scattered across Lambda code.

Lambda is exceptional at executing discrete tasks. Step Functions is exceptional at coordinating them. Use each for what it does best, and your workflows become more reliable, maintainable, and cheaper to run.

Thank you for reading, let’s chat 💬

💬 What’s the longest workflow you’re currently running in a single Lambda function?
💬 Have you ever lost workflow state due to a Lambda timeout?
💬 What’s stopping you from adopting Step Functions for orchestration?

I love hearing from readers 🫶🏻 Please feel free to drop comments, questions, and opinions below👇🏻

Discussion about this post

Ready for more?