Operating Lambda: Application design — Scaling and concurrency
In this post, we will discuss the scaling and concurrency of Lambda and the different behaviors of on-demand and Provisioned Concurrency.
Scaling and concurrency in Lambda
AWS Lambda provides scaling as part of the service i.e. as traffic increases, Lambda increases the number of concurrent executions of our functions without the need of threading or any custom engineering in our code.
When a request comes, Lambda creates an instance of the function and runs the handler method to process the event. After finishing the execution, this Lambda instance remains available for some period for time for subsequent events. If multiple requests come while existing Lambda functions are busy, new instances of Lambda functions are created to handle the request.
Lambda can support an initial bust of cumulative concurrency between 500 to 3000 per minute based on region. After the initial burst, the function can scale by the addition of 500 instances per minute. Any additional requests fail with a throttling error.
Request and Concurrency
One Lambda instance can handle only one request at a time. However, for asynchronous invocation like SQS, we can batch the messages in a single Lambda request that is processed together.
Lambda can easily scale for scattered synchronous requests but if our service is expecting a burst of requests at the same time, it is recommended to have a buffer like the SQS queue in between to prevent throttling.
Lambda supports two types of scaling
Concurrent instances of Lambda are created based on the volume of requests received. For each new request, either an existing free environment is reused or a new environment is provisioned that will experience a cold start.
This example explains how Lambda handles a burst of on-demand traffic. The concurrency limit of the account is 10000 and each request takes 15 seconds to process. Lambda function receives 10000 burst requests with the following condition
#1 All request arrives at the same time: 3000 executions are handled by new execution environment and rest 7000 requests are throttled.
#2 Request arrives over 2 minutes: 3000 requests are handled in the first minute and 2000 requests are throttled. In minute two, another 500 instances are created along with initial 3000 instances and 1500 requests are throttled.
#3 Request arrives over 3 minutes: In the first minute, 3000 requests are processed and 333 are throttled. In the second minute, 500 instances are created and 3000 instances are reused handling all 3333 requests. In the 3rd minute, all requests are handled by warm instances.
#4 Requests arrive over 4 minutes: In the first minute, 2500 instances are created and are reused in subsequent minutes.
For asynchronous workloads, the default scaling and concurrency limits provide a reasonable tradeoff between throughput and configuration management overhead. However, for synchronous workload with expected traffic more than the default burst capacity or for a use case where double-digit latency is desirable, Provisioned concurrency is recommended.
If Provisioned Concurrency is defined, concurrent Lambda execution environments are prepared in advance before invocations. So, if the rate of requests to your service is greater than the default bust capacity of Lambda, provisioned capacity is available to handle the load. Also, Provisioned Concurrency provides a predictable low latency response as the Lambda environments are pre-provisioned and there is no cold start.
Provisioned Capacity can be configured and does not require any code change.
For the same scenario as above with 7000 provisioned concurrency, the following will be the capacity consumed:
#1 All 10000 requests arrive at the same time: 7000 requests will be handled by the provisioned environment with no cold start. Rest 3000 environments will be handled by new on-demand execution environments with cold start
#2 For other cases, where no of request at a time is less than 7000, all requests will be handled by provisioned capacity with no cold start.
Using service integrations and asynchronous processing
Integration of Lambda with Synchronous Services can be improved by providing an asynchronous experience especially for the case when the service can tolerate latency, the expected traffic pattern is unpredictable and throttling customer requests need to be avoided.
Instead of invoking Lambda directly from API Gateway, a queue like SQS can be introduced in between. API Gateway will push the requests in the queue that has unlimited throughput. Lambda can pull the message from the queue and process it at its regular rate without being overwhelmed by API Gateway requests and throttling them. The Lambda can update the status of each request in a DynamoDB that can be surfaced to a user via a different API endpoint or using an SNS notification or similar services.
All the Lambda functions in your account share the same bucket of concurrency limits. This creates a coupling between the functions and if one function consumes all the limits, other functions will fail. This can be prevented by setting a reserved concurrency limit on high volume and critical Lambdas making sure they have the required limits.
Reserved Concurrency consumes the Concurrency limit of your account.
When you set a reserved limit for a function, it is the maximum concurrency limit of that function and any requests arriving in excess of reserved capacity are throttled. This allows us to limit the processing rate especially when Lambda is invoked asynchronously or using an internal poller like S3, SQS, or DynamoDB.
For example, consider a Lambda has a reserve capacity of 10 and in processing batch messages from an SQS queue with a batch size of 10. If 1000 messages are received by the queue, only 100 messages will be processed at a time limiting the processing rate. This prevents the Lambda to consume the limits of downstream services like DynamoDB write units because of excess concurrency.
This post discusses the different approaches and designs that can be used to better architect your service in serverless. Also, refers to this article to understand the limits of Lambdas and how to architect a service around it.