You can build load balancers for your application using elastic load balancing (ELB) without managing the servers that do the load balancing. However, because it’s a managed service, using conventional monitoring tools gives you less visibility. Utilizing the available AWS monitoring tools is, therefore, even more crucial.

By deploying more servers as necessary, ELB (Elastic Load Balancer) enables websites and online services to handle more user requests. Your website may go down or bog down due to an unhealthy ELB. The correct dashboards and useful metrics offer insights to resolve problems more quickly, and a robust analytics engine makes warnings more intelligent.

Related article: Common CMMS Software Mistakes in implementation

A load balancer spreads the load among all your servers to ensure even capacity utilization while considering the services each server provides, its overall health, and the level of demand. Load balancing offers fault tolerance for your website, one of its main advantages. The load balancer will stop routing to any servers that are sick or have critical errors and instead send requests to healthy servers. Being able to adjust to mistakes and still provide a positive user experience increases the reliability of your application or website.

Key AWS Elastic Load Balancer Metrics:

Elastic Load Balancing publishes a variety of metrics to AWS, and examining these metrics for outliers and limits is an excellent approach to identifying problems with your ELBs. Some measurements, however, serve as more accurate gauges of ELB health than others. We will talk about the metrics to watch in this section. Additionally, we’ll include the metrics relevant to network load balancers, application load balancers, and classic load balancers.

A load balancer distributes load among all your servers to maintain even capacity utilization while considering each server’s services, overall health, and demand. Load balancing offers fault tolerance for your website, one of its main advantages. The load balancer will stop routing to any server that is unhealthy or experiences a critical error and will instead deliver requests to healthy servers. As a result, your website or application can adapt to errors and still provide a positive user experience, increasing its reliability.

You can now organize and visualize data for your ELB nodes thanks to Infraon’s support for ELB. Infraon supports the following ELB metrics:

Request Count:

Request Count measures the number of ELB requests made. A spike or dip in requests could indicate a problem with your load balancer’s clients. It could also suggest that your ELB is throwing a lot of issues, and clients are retrying.

You’ll be looking for irregularities when monitoring Request Count. Because CloudWatch does not enable anomaly detection, you will have to rely on averages to identify problems. To do so:

  1. Examine a week’s worth of data for the metric and then calculate the average high.
  2. Create a 10% window above and below the average to get a decent idea of what appears healthy for your application.
  3. Create a CloudWatch alarm on the metric’s Sum aggregation whenever Request Count exceeds the threshold for more than five data points.

Latency and Application-Specific Target Response Time:

Elastic load balancing

The latency and target response time metrics in CloudWatch track how long your servers respond through the load balancer. Response times that balloon is nearly always a sign that your application is having problems. Because they cause your clients to wait longer for the resources they request from the load balancer, these errors are especially crucial to catch because they cascade via services.

Again, the best method to track this parameter is anomaly detection. To identify spikes in latency, use the technique outlined in the request count. Use the Average aggregate this time while creating CloudWatch alarms.

Target Connection Error Count & Backend Connection Errors:

Whenever your ELB encounters a connection error with the hosts supporting the load balancer, the backend connection error statistic is increased. If this keeps happening, your hosts are overworked and unable to accept connections, or traffic might be sent to an unreachable port.

Even though arbitrary connection problems can happen, if this value constantly reads nonzero, you should be aware of it. When this measure is nonzero for five consecutive data points, create a CloudWatch alarm on the sum aggregate.

Surge Queue Length:

In a traditional load balancer, the surge queue length is the total amount of pending requests to a healthy instance. They will be declined when there are more requests than the allotted 1,024 limit. When your backend instances cannot handle the volume of incoming application requests, the surge queue length metric will rise. Several causes exist, but your backend instances’ capacity limitations rank high on the list. Scale up your instances to boost computing capacity and ensure that your surge queue stays as long as it can handle.

Create a CloudWatch alarm on the maximum aggregate to notify you when surge queue length surpasses 768 for five successive data points to keep track of this value. You should generally have a stricter alarm if your application is severely performance constrained, either by reducing the threshold or monitoring for nonzero numbers over a longer period, say 15 minutes.

HTTP Code Target 4XX Count & HTTP Code Backend 4XX:

A statistic for 4xx faults returned by the hosts behind the load balancers exists for both Classic Load Balancers and Application Load Balancers. If this statistic suddenly increases, there is probably a problem with customers sending requests to your load balancer.

You must use the same anomaly detection technique mentioned in request count to identify problems in your system, as some 4xx mistakes are expected in any distributed system. To create the CloudWatch alarm, you will use the Sum aggregation again.

Healthy and Unhealthy Host Count:

Your ELB periodically checks EC2 instances using ping checks or requests and classifies them as in service or out of service, which is referred to as an ELB health check in AWS. The load balancer divides the unhealthy instances from the healthy ones. Ensure your health check interval is short enough to avoid the ELB removing healthy models from the pool and raising the unhealthy host count value. To guarantee there are enough healthy backend EC2 instances, optimize response timeout and health check interval options.

Monitoring these metrics gives you a solid starting point for identifying and troubleshooting load balancer-related failures, regardless of whether you are an experienced AWS user or are just getting started.

Related article: AWS Monitoring Tools

Final Thought:

Regardless of which load balancer you use, AWS does a fantastic job routing traffic for an application and storing the related data. But it’s up to you how to proceed. You are responsible for gaining access to, gathering, and modifying this data, whether for analytical purposes, diagnostics, troubleshooting, or future development. 

Utilizing the multiple data points provided by user requests for your application can help you strategically monitor trends and stay on top of the game. Infraon Infinity can help you monitor over 100 different types of apps from a single console, including on-premises, virtualization, containers, and cloud, giving you complete visibility into your stack.

FAQs:

How do I check my ELB metrics?

You can observe metrics from the source accounts connected to the monitoring account when browsing in an account configured as a monitoring account in CloudWatch cross-account observability. Metrics from source accounts are provided together with the ID or label of the account they come from. Using the console, view available metrics by namespace and dimension in the CloudWatch console.

How does Elastic load balancer scale?

By altering the elastic load balancer’s node count, scaling allows you to adapt a load balancer’s performance to the workload. To accommodate your traffic needs, you can scale load balancers up or down at any time, and they can be scaled to adapt performance to workload.

Table of Contents