Responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services already in / going to
production.
Design, code, test and deliver software to automate manual operational work, develop self-service, auto-detection and healing
Develop software for reliability and scale, ensuring minimal refactoring or changes
Define, monitor and defend SLOs
Deploying closed-loop remediation – continuous testing and remediation—to fix problems in pre-production before software is released to production.
Build custom tooling from scratch to meet specific needs in the incident management workflow.
Complex incident resolution across public cloud, private cloud, 3rd party, and on-premise tech.
Leverage Chaos Engineering to find and prevent future problems and to confirm fixes from past incidents function as intended.
Focus on end-user experiences and partner with development teams to implement changes to increase uptime and performance based on empirical evidence.
Troubleshoot priority incidents, facilitate blameless post-incident evaluations and ensure permanent closure of incidents
Identify application patterns and analytics in support of better service level objectives
Design performance tests, identify bottlenecks and opportunities for optimization and capacity demands, and present solutions for continuous improvements
Design best in class monitoring frameworks to accomplish end-to-end flow monitoring and noiseless alerting
Design automated software and product upgrades, change management and release management solutions
Skills/Qualifications
Bachelor’s degree or equivalent experience in a software engineering discipline
2-3 years of SRE or System Engineering experience.
Expert in at least one technology stack designing, coding, testing, delivering software e.g., Java, Python, C++, Go, etc.
Deep knowledge of Internet protocols and web services technologies e.g., HTTP, DNS, TCP/UDP, SOAP, JSON, Apache, Tomcat and REST
Experience working with containers e.g., Docker, Kubernetes, Cloud Foundry, etc.
Experience in working with automation tools e.g., Ansible, Puppet, Selenium etc.
In-Depth OS Experience e.g., RHEL, Ubuntu, Windows Server with strong debugging, troubleshooting, and problem-solving skills
Testing and build automation with a continuous integration/continuous delivery (CI/CD) pipeline e.g., Travis CI, Maven, Gradle, Groovy, Git, Terraform, Jenkins etc.
Experience deploying and managing services on modern platforms e.g., AWS, GCP, Azure.
Strong experience in using industry standard monitoring tools e.g., AppDynamics, Dynatrace, APICA, Splunk, ELK, FluentD, Prometheus, Kibana, Elasticsearch, Grafana, Nagios, Datadog, New Relic, etc.
Advanced understanding of application monitoring stack (Logs, Events Metrics & Alerts) and ability to visualize and setup end-to-end observability
Certified in one or more cloud technology e.g., AWS, Azure, GCP or RedHat is a big plus.