🐾How to deploy Falcon 40B with SageMaker and Terraform🐾

Aug 07, 2023

❓Have you ever wondered how you can deploy Falcon 40B or other LLMs to your AWS cloud using Terraform?

🤗 Thanks to Hugging Face it becomes easier, now you can use their Docker image to do that. To deploy Falcon 40B or other LLMs you will need to create the following resources:

1️⃣ aws_sagemaker_model
For this resource, you need the Hugging Face image URI and model ID. To get a proper image you can use either AWS list of images or the following Python command:

Text within this block will maintain its original spacing when published

from sagemaker.huggingface import get_huggingface_llm_image_uri
get_huggingface_llm_image_uri("huggingface", version="0.8.2")

To find the Hugging Face model ID, and check all supported LLMs, visit Hugging Face blogpost. Also, don’t forget that you need to create a role for your endpoint.

2️⃣ aws_sagemaker_endpoint_configuration
For this resource, you should choose the correct instance type, for example for Falcon 40B it’s fine to use ml.g5.12xlarge instance. Also, it’s important to set up proper health check timeout, so the model can be deployed in time. If you make it too short, model can fail to deploy.

3️⃣ aws_sagemaker_endpoint
The simplest resource to create, only name and reference on endpoint config are required.

Text within this block will maintain its original spacing when published

resource "aws_sagemaker_endpoint" "endpoint" {
     name = "llm-endpoint" 
     endpoint_config_name = aws_sagemaker_endpoint_configuration.config.name 
}
resource "aws_sagemaker_endpoint_configuration" "config" {
     name = "endpoint-config" 
     production_variants {
          variant_name = "AllTraffic" 
          model_name = aws_sagemaker_model.model.name 
          initial_instance_count = 1 
          instance_type = "ml.g5.12xlarge"
          container_startup_health_check_timeout_in_seconds = 600
      }
}
resource "aws_sagemaker_model" "model" {
     name = "model" 
     execution_role_arn = aws_iam_role.role.arn 
     primary_container {
          image = "763104351884.dkr.ecr.eu-central-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04" 
          environment = {
               HF_MODEL_ID = "tiiuae/falcon-40b-instruct"
               SM_NUM_GPUS = 4
               HF_MODEL_TRUST_REMOTE_CODE = "true" 
          }
     }
}

⚠️ Don’t forget to delete your endpoints after experiments, as it can be quite expensive to forget them running. One hour of ml.g5.12xlarge instance would cost you around $5.7, depending on the region.

If you like this post, you can share APAWS newsletter with friends:

Share APAWS

🐾How to deploy Falcon 40B with SageMaker and Terraform🐾

Discussion about this post