Hi,
Hope you’re doing well! I’m reaching out about a SRE Consultant opportunity with RelantoAI. We’re looking for someone with 10+ Years of experience in Bare-Metal Management.
Role: SRE Consultant
Location: Santa Clara, CA (3 Days Hybrid)
Experience: 8+ Years
Job Type: Contract
- On-prem infrastructure management:
Manage Nvidia’s on-prem infrastructure.
Maintain uptime, reliability, and readiness of on-prem engineering cloud spread across multiple data centers.
- Guard SLAs:
Guard service level agreements (SLAs) for critical engineering services.
Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets.
Perform root cause analysis and post-mortems of incidents for any threshold breaches.
- Observability:
Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance.
Maintain KPI pipelines using Jenkins, Python and ELK.
Improve monitoring systems by adding custom alerts based on business needs.
- Automation & Optimization:
Help in capacity planning, optimization, and better utilization efforts.
- Day-to-Day Support:
Support user reported issues & issues.
Monitor alerts and take necessary action.
Actively participate in WAR room for critical issues
- Collaboration & Documentation:
Create and maintain documentation for operational procedures, configurations, and troubleshooting guides.
Tech stack:
- Bare-Metal data center machine management tools like IPMI, Redfish, KVM etc.
- Automation using Jenkins, Python, Go, Bash.
- Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK.
- Any familiarity with hardware like GPU & Tegras
Thanks & Regards,
| |||||||||||||||||||||||||||
