Job Title: SRE Lead
Location: Atlanta GA (Day 1 hybrid) –
Onsite SRE Lead (10 yrs)
Core Skillset
Client Consulting:
- Work with team to define SRE maturity model, observability strategy, identify gaps and AWS reliability roadmap.
- Translate business SLAs into SLIs/SLOs/Error Budgets.
Architecture & Design:
- Lead and implement AWS serverless reliability architecture (multi-region, failover, self-healing,).
- Define observability blueprints (logs, metrics, traces, UX telemetry).
- Define cost optimized Data Observability and Resiliency solutions
Reliability & Resilience
- Design and implement fault-tolerant, highly available AWS architectures.
- Experience in DynamoDB global tables , RDS Failovers, capacity planning
- Apply SRE principles: SLIs, SLOs, SLAs, error budgets, and toil reduction.
- Drive chaos engineering, disaster recovery, and capacity planning exercises.
Observability & Monitoring
- Experience in implementing end-to-end observability (logs, metrics, traces, events).
- Build cost optimized unified dashboards, custom metrics using Dynatrace, Cloudwatch
- Experience in implementing Data Observability and Resiliency solutions
- Automate alerts, anomaly detection, and incident response workflows.
Automation & Infrastructure
- Develop automation and custom tooling using Python and Node.js.
- Build infrastructure as code using AWS CDK and CloudFormation.
- Implement self-healing and auto-remediation solutions with AWS serverless Services
Operations & Incident Management
- Implement AI/ML-driven automation.
- Collaborate with developers for shift-left observability and performance optimization.
- Guide and Lead adoption of automation, proactive observability, and self-healing systems.
To unsubscribe from future emails or to update your email preferences click here