🐾Lambda as ML inference: Why, How, and Limitations🐾
🤓 Machine learning inference doesn't always require complex infrastructure. While AWS SageMaker is a popular choice, sometimes simpler solutions can be more effective. You're probably already using AWS Lambda for MLOps automation. But did you know it can also serve ML models?
Why you may want to use Lambda?
SageMaker Serverless Endpoints are great, but they come with limitations:
Timeout: Up to 5 minutes
Maximum concurrent invocations: 200 executions
Memory configurations between 1’024 MB and 6’144 MB
Must use SageMaker-supported frameworks/containers
Lambda is more flexible:
Timeout: Up to 15 minutes
Maximum concurrent invocations: up to 10’000 executions
Memory configurations between 128 MB and 10’240 MB
Framework flexibility: Run any ML library that fits within size limits
The cost factor is particularly interesting. For infrequent, quick inferences with smaller models, Lambda's pay-per-use model often works out cheaper than maintaining SageMaker endpoints.
⚠️ SageMaker Serverless Endpoints and Lambda functions do not support GPU.
How you can use Lambda as an inference?
The key to successful Lambda-based inference is containerization. Here's why:
Container functions bypass Lambda's standard package size limitations
You can pre-load models during container initialization
Dependencies are easier to manage in a containerized environment
Even small LLMs can run on Lambda, as demonstrated by projects like PicoLLM and Practical Guide for LLM on Lambda. These implementations show that with careful optimization, you can serve sophisticated models within Lambda's constraints.
What are the limitations you should know?
Before jumping in, consider these crucial factors:
Response time requirements: Larger containers mean longer cold starts
Memory constraints: While Lambda offers up to 10GB RAM, it's still limited for larger models
No GPU support: CPU-only execution can impact performance
For more latency-sensitive applications, you'll need strategies to manage cold starts, perhaps through provisioned concurrency and keeping your container size as minimal as possible.
Thank you for reading, let’s chat 💬
💬 Have you tried using Lambda functions as ML inference?
💬 Do you encounter any other limitations for Serverless endpoints?
💬 Which compute do you use as ML inference?
I love hearing from readers 🫶🏻 Please feel free to drop comments, questions, and opinions below👇🏻