
Cloud Destinations LLC
Job Title: SRE
Location: Bay Area (Prefers local to bay area and those candidates will be prioritized.) Should be ready to do Hybrid
Duration: 6+ months
On-prem infrastructure management
Manage Nvidia’s on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers.
Guard SLAs
Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches.
Observability
Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK.
Improve monitoring systems by adding custom alerts based on business needs.
Automation & Optimization
Help in capacity planning, optimization and better utilization efforts.
Day-to-Day Support
Support user reported issues & issues. Monitor alerts and take necessary action.
Actively participate in WAR room for critical issues
Collaboration & Documentation
Create and maintain documentation for operational procedures, configurations, and troubleshooting guides.
Tech stack
Baremetal data center machine management tools like IPMI, Redfish, KVM etc.
Automation using Jenkins, Python, Go, Bash.
Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK.
Any familiarity with Nvidia hardware like GPU & Tegras is a plus
To apply for this job email your details to dineshkumarr@clouddestinations.com