Reliable AWS Serverless Monitoring

Alexey Horey
5 min readSep 29, 2023

--

Monitoring in the cloud has some challenges. One of them is trusting your Monitoring system works as expected. Below you can see a working concept providing satisfying solution for this need.

  • Introduction
  • Sample use case
  • Making AlertSystem reliable (TL/DR jump here)
  • AlertSystem Safety
  • AlertSystem Implementation

Introduction

We monitor the Cloud using provided monitoring tools- Cloudwatch Metrics, Alarms, SNS and Lambdas. Common fault handling implementation would look like:

Cloudwatch metric -> triggers Cloudwatch Alarm -> sends notification to SNS Topic -> starts Alert Lambda execution -> notifies Slack, Email, SMS, Opsgenie, Jira etc.

This chain of events looks impressive, however considering:

  • A chain is only as strong as the weakest link.
  • Alerts must stay silent unless something wrong happens.

We would like to be confident that:

  • Monitoring infrastructure as a code was not broken: IAM Roles/Policies, ARNs, SecurityGroups etc…
  • The Alert Lambda code was not broken: API tokens/SSL certificates were not expired etc…

This article is about answering the question: Quis custodiet ipsos custodes [Who will watch the watchmen]?

Sample use case

1) Production_service writes logs to the Cloudwatch Logs service.

2) Cloudwatch Log Filter watches the service logs for “[ERROR]” string.

3) On filter match- alarm triggered which sends a notification to the AlertSystemSNSTopic.

4) SNS Subscription starts `AlertSystemLambda` execution which receives the notification event.

5) AlertSystemLambda makes a decision where and how to send the user facing notification.

CI/CD has two major flows:

1) Production_service deployment: GitHub->Jenkins->Docker Image->ECR->ECS.

2) AlertSystemLambda & infrastructure deployment: GitHub->Jenkins->Infrastructure as a code->AWS Services.

Making AlertSystem reliable

In a nutshell: AlertSystemLambda will watch both production_service and itself!

Did you notice “System” in the AlertSystemLambda name? This is because we are going to enhance the Lambda to handle multiple kinds of messages.

  • Cloudwatch Alarm has the “alarm_description” field. AlertSystem uses this this field to mark self generated Alarms with “alert_system_monitoring” tag.
  • Simplicity: Handling messages with “alert_system_monitoring” is the core of the whole system. AlertSystemLambda has a very strict and clear logic for dispatching and routing these messages.
  • Isolation: Handling “alert_system_monitoring” is independent from the production_service monitoring logic. There are many different production_services: SQS, Lambda, Cloudwatch Logs, Opensearch, even Zabbix and Grafana send the alerts to AlertSystem. However the “alert_system_monitoring” tag is handled the same way in all of the AlertSystem deployments.

Following components provide the AlertSystem with “self monitoring” capabilities:

1) Cloudwatch metric “Lambda Errors” watches the AlertSystemLambda. In case there is an Error it sends “alert_system_monitoring” message to AlertSystemLambda. Yes to itself!

2) Cloudwatch metric “Lambda Duration” watches the AlertSystemLambda. In case the duration>3 minutes it sends “alert_system_monitoring” message to AlertSystemLambda. Yes to itself!

3) Cloudwatch logs filter metric. Watches the AlertSystemLambdaLogGroup. In case it finds the (python logger generated) “[ERROR]” string it sends “alert_system_monitoring” message to AlertSystemLambda. Yes to itself!

4) Cloudwatch logs filter metric. Watches the AlertSystemLambdaLogGroup. In case it finds the (AWS Lambda runtime generated) “Task timed out after” string it sends “alert_system_monitoring” message to AlertSystemLambda. Yes to itself!

  • You can see the visualization -> here.

AlertSystem Safety

We can break anything but the last thing we want to break is the AlertSystem…

  1. Rule of thumb: if the AlertSystemLambda triggered it MUST(!) send a notification. Any kind of notification, to any notification channel, but at least one MUST(!) reach human eyes. In the worst case — it sends a traceback to the Infrastucture Team.
  2. Simulating misconfiguration: AlertSystem CI/CD flow’s last step is the self monitoring mechanism synthetic testing. Malformed event sent to the AlertSystemLambda. It ensures a code can “digest” even a broken text or no text at all. I send an empty SNS notification- alert system then fails with Exception and triggers two alarms — one for Cloudwatch Logs “[ERROR]” metric and another for Cloudwatch Lambda Errors metric. You do not deploy the AlertSystem too often, and receiving two extra notifications once in a while — a fair deal. I also trigger all of the 4 self monitoring alerts at the earliest point possible: User managed Cloudwatch metrics (Log filters)- I am inserting a new metric data simulating a failure. AWS Managed metrics (Lambda service metrics)- we are not allowed to put metric data, so I change the Cloudwatch Alarm state.
  3. Simulating valid failure: Production_service monitored by AlertSystem has now new last step in its infrastructure as a code CI/CD process: simulate a fault situation. Trigger an AlertSystem monitored event. In our Sample Use Case above it would be a log line “[ERROR] some text here” in the production_service log group’s last stream. The same as in (2)- you do not change the production_service infrastructure to often. But when you do, you want to make sure it was not broken.
  4. Order enforcement: Production_service CI/CD has an “expiration date” on it’s infrastructure. Once in 30 days production_service Infrastructure as a code CI/CD must run. This limits the impact of the “human factor”- manual breaking changes. I implemented it using an “infrastructure_last_update_time” tag indicating the latest infrastructure deployment date. After 30 days all builds for this service are failing with “User friendly exception” until the developer runs infra as a code CI/CD for this service. In my case it’s a flag “[V] Provision_Infrastructure” as the latest build parameter turned off by default.
  5. Impact limitation: Disregarding the above protection I wanted to eliminate the possibility for one production_service infrastructure deployment to harm other service’s monitoring. This is impossible in a shared resources architecture. So the decision was to make AlertSystem part of each production_service. Yes, each service has it’s own AlertSystemSNSTopic, AlertSystemLambda Cloudwatch metrics and Alarms. Other way each infrastructure deployment would cause me to trigger all services’ alarms.
  6. Safety switch: This step is not as crucial as others but easy to implement and nice to have. AlertSystemLambda dynamic code analysis is a final step before publishing the new Lambda package in AWS. Once the Lambda package (Zip) created you can locally extract and execute the code. It makes sure the scripts are not broken since code dependencies can be generated per environment/service. In my case it’s the production_service error handling logic, each service has it’s own separate logic: SQS alert handling differs from Jenkins jobs’ alert handling. Other example: in Dev env you don’t want to send an Opsgenie event or create a Jira ticket.

AlertSystem Implementation

Unfortunately for the reader there is no CloudFormation or Terraform ready configuration because I implement the real IaC per se — using python and cloud vendors’ SDKs. But you can find it interesting to go threw my code here:

https://github.com/AlexeyBeley/horey/blob/main/alert_system/horey/alert_system/alert_system.py

Function “provision” is the AlertSystem CI/CD entry point.

- Looks complicated, does it work?

- Yes, it works like a charm. I didn’t write this article until I saw the system is functioning for a year. It was deployed in > 50 services, monitoring production and management, sending build status notifications.

--

--

Alexey Horey
Alexey Horey

No responses yet