Title: Site Reliability Engineering
Location: Plano, TX /
Software Engineer – Site Reliability Engineering
Required Skills:
• Excellent debugging and trouble shooting skills.
• Expert in performance monitoring and capacity management of large systems using various tools.
• Expert in at least one technology stack (Java/J2EE/Python) with designing, coding, testing, and delivering software.
• Expert in at least one of the relational databases (SQL Server, Oracle, DB2 etc.).
• Hands-on experience with cloud technologies (Cloud Foundry, Kubernetes, AWS).
• Hands-on experience with big data services (Hadoop, HDFS, Hive, Yarn, HBase, Kafka, Zookeeper).
• Working knowledge of Groovy, batch scripting, PowerShell or shell scripting.
• Experience developing, deploying and debugging distributed systems in a Linux, Hadoop environment.
• Experience with monitoring tools such as AppD, Splunk, ELK, Geneos.
• Analysis of SLI metrics and performance data. Interpreting and correlating it to SLOs and SLAs.
• Experience with deployment automation, CI/CD, DevOps, Jenkins, GIT, BitBucket.
• Experience with cloud/container environments, big data, analytical tools (Tableau, Alteryx).
• Expert practitioner in one or more technology domains, may be a cross-domain expert able to solve complex and mission critical problems within a business or across the firm.
• Working knowledge of infrastructure components like routers, load balancers and networks.
• Comfortable working in Agile mode and proficient in continuous integration and continuous delivery.
• Solid understanding of micro-service design methodologies.
• Solid analytical and problem solving skills.
• A proven team lead with excellent communications skills.
• Attention to detail and time-management skills.
• Is endlessly curious about applications and application stability.
Additional Pointers
• Core technology -Unix and SQL is important and needed, if they know Big data concepts, Hadoop, Alteryx et al would be added advantage
• Cloud experience and awareness is mandatory from create, Configure, and monitor perspective
• As a SRE they will Need to have worked on monitoring tools in SRE such as Data dog, Splunk-, they will work on Testing, creating environments, Support Production, deploying codes, CI/CD, Incident managements, troubleshoot
• This would include weekend rotation and hence they will also get a weekday comp off if they were handling weekend rotation