🐾How to deploy Falcon 40B with SageMaker and Terraform🐾
❓Have you ever wondered how you can deploy Falcon 40B or other LLMs to your AWS cloud using Terraform?
🤗 Thanks to Hugging Face it becomes easier, now you can use their Docker image to do that. To deploy Falcon 40B or other LLMs you will need to create the following resources:
1️⃣ aws_sagemaker_model
For this resource, you need the Hugging Face image URI and model ID. To get a proper image you can use either AWS list of images or the following Python command:
from sagemaker.huggingface import get_huggingface_llm_image_uri get_huggingface_llm_image_uri("huggingface", version="0.8.2")
To find the Hugging Face model ID, and check all supported LLMs, visit Hugging Face blogpost. Also, don’t forget that you need to create a role for your endpoint.
2️⃣ aws_sagemaker_endpoint_configuration
For this resource, you should choose the correct instance type, for example for Falcon 40B it’s fine to use ml.g5.12xlarge instance. Also, it’s important to set up proper health check timeout, so the model can be deployed in time. If you make it too short, model can fail to deploy.
3️⃣ aws_sagemaker_endpoint
The simplest resource to create, only name and reference on endpoint config are required.
resource "aws_sagemaker_endpoint" "endpoint" { name = "llm-endpoint" endpoint_config_name = aws_sagemaker_endpoint_configuration.config.name } resource "aws_sagemaker_endpoint_configuration" "config" { name = "endpoint-config" production_variants { variant_name = "AllTraffic" model_name = aws_sagemaker_model.model.name initial_instance_count = 1 instance_type = "ml.g5.12xlarge" container_startup_health_check_timeout_in_seconds = 600 } resource "aws_sagemaker_model" "model" { name = "model" execution_role_arn = aws_iam_role.role.arn primary_container { image = "763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04" environment = { HF_MODEL_ID = "tiiuae/falcon-40b-instruct" SM_NUM_GPUS = 4 HF_MODEL_TRUST_REMOTE_CODE = "true" } } }
⚠️ Don’t forget to delete your endpoints after experiments, as it can be quite expensive to forget them running. One hour of ml.g5.12xlarge instance would cost you around $5.7, depending on the region.
If you like this post, you can share APAWS newsletter with friends: