Monitor Everything
With so many moving pieces, it's crucial to monitor what's happening under the hood to understand what's going on. This includes gathering telemetry in the form of metrics and logs coming from your services and the underlying infrastructure. This data must be shipped somewhere to build dashboards and raise alerts that will escalate to the appropriate personnel. Depending on your business needs, you may also need to monitor for security and compliance against various technical benchmarks like PCI/DSS, CIS, ISO 27001, and others.
1 Set up Telemetry
Choose between Datadog or AWS-managed Prometheus and Grafana with Loki for gathering your telemetry. Datadog offers the most mature implementation, while AWS-managed Grafana and Prometheus provide lower-cost alternatives with various trade-offs, that make them a good fit for many organizations.
- Datadog
- AWS Managed Grafana and Prometheus
Datadog is our most comprehensive observability solution, offering a monitoring-as-code approach using YAML configuration fully managed with Terraform. This includes Datadog monitors, custom RBAC roles, synthetic tests, child organizations, and other resources.
We show how to define reusable Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for consistent implementation, helping to reduce alert fatigue by focusing on critical business-specific metrics and leveraging Datadog's advanced capabilities. Then integrating this with OpsGenie for incident management.
Get StartedAmazon Managed Grafana is a fully managed service by AWS in collaboration with Grafana Labs. Although it's significantly less expensive than Datadog, it is also more barebones in comparison.
- Managed Grafana allows you to query, visualize, and set alerts for your metrics, logs, and traces through a centralized dashboard where you can add multiple data sources.
- AWS Managed Prometheus together with
promtail
collects and queries metrics from your containerized applications. - Deploy Loki for efficient log collection from containerized applications (for EKS users)
2 Manage Incidents
With monitoring in place and alerts being emitted, it’s crucial to define what qualifies as an incident and escalate it to the appropriate people for action. We support OpsGenie, which will be natively integrated with Datadog.
3 Monitor for Security & Compliance
Monitoring for security and compliance is essential for organizations subject to industry regulations like HIPAA or for e-commerce companies aiming for PCI compliance. Our reference architecture includes comprehensive support for AWS's suite of security-oriented services, including:
- Security Hub: Centralized security view
- GuardDuty: Threat detection service
- Inspector: Automated security assessments
- Macie: Data security and privacy
- AWS Config: Resource configuration tracking
- IAM Access Analyzer: Policy monitoring and validation
- Shield: DDoS protection
- Audit Manager: Continuous audit and compliance
- CloudTrail: User activity and API usage
- WAF: Web application firewall