Job Title: SRE Consultant
Location: Santa Clara, CA (Onsite 5 days a week)
Terms: Long Term Contract
Client: Nvidia
Requirements/Skills:
• On-prem infrastructure management
o Manage Nvidia’s on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers.
• Guard SLAs
o Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches. • Observability o Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK. o Improve monitoring systems by adding custom alerts based on business needs. • Automation & Optimization o Help in capacity planning, optimization and better utilization efforts. • Day-to-Day Support
o Support user reported issues & issues. Monitor alerts and take necessary action.
o Actively participate in WAR room for critical issues
• Collaboration & Documentation
o Create and maintain documentation for operational procedures, configurations, and troubleshooting guides.
• Tech stack
o Baremetal data center machine management tools like IPMI, Redfish, KVM etc. o Automation using Jenkins, Python, Go, Bash. o Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK. o Any familiarity with Nvidia hardware like GPU & Tegras is a plus
Pratik Kumar | Senior Talent Acquisition Specialist
Amaze Systems Inc
USA: 8951 Cypress Waters Blvd, Suite 160, Dallas, TX 75019
Canada: 55 York Street, Suite 401, Toronto, ON M5J 1R7
