Site Reliability Engineer 4/5, Netflix Technology Services
- Design, implement, and maintain scalable and reliable infrastructure to support our services.
- Collaborate with engineering and product teams to integrate reliability considerations into the entire software development lifecycle.
- Develop and implement automation tools for monitoring, deployment, and incident response to ensure efficient and reliable operations.
- Conduct or participate in capacity planning, performance analysis, and system tuning to optimize system reliability.
- Participate in on-call rotations and contribute to incident response, diagnosis, and resolution.
- Implement and improve monitoring and alerting systems to proactively identify and address potential issues.
- Implement and maintain robust disaster recovery and business continuity plans.
- Continuously evaluate and recommend improvements to enhance system reliability and performance.
- Proactively identify sources of instability in distributed systems and analyze how complex systems fail from a reliability and resilience perspective.
- Engage with product teams to diagnose operational surprises and drive improvements.
- Implement and maintain a robust incident response framework, including blame-aware incident reviews to learn from operational surprises.
| SKILLS AND EXPERIENCE
- 3+ years of experience as a Site Reliability Engineer or in a similar role
- Experience with complex sociotechnical systems and their successful operations at scale
- Experience with incident management and response
- Experience with cloud platforms like AWS, microservices architecture, and enterprise software solutions like Slack & GSuite
- Excellent communication & collaboration skills and a continuous improvement mindset
- Proven ability to cultivate relationships through influence
- Proven ability to troubleshoot complex issues and implement effective solutions
- Familiarity with Human Factors Engineering
- Ability to grow expertise, influence & educate others