The purpose of the Site Reliability Engineer (SRE) role is to ensure the stability, scalability, and performance of production systems while driving improvements in overall system reliability and operational efficiency. By bridging the gap between development and operations, the SRE role focuses on creating a resilient infrastructure through automation, monitoring, and proactive incident management.
The SRE is responsible for designing and implementing tools and processes that enhance the reliability of applications, reduce downtime, and optimize system performance. They work to establish best practices for high availability, incident response, and continuous improvement, ensuring seamless user experiences and aligning system operations with business objectives. The SRE plays a critical role in both preventing and rapidly resolving issues, contributing to a stable, scalable, and reliable technology ecosystem.
Tasks
- Design, implement, and maintain highly available infrastructure, focusing on failover strategies, redundancy, and scalability.
- Develop and maintain Infrastructure as Code (IaC) scripts using tools like Terraform, Ansible, or CloudFormation.
- Set up and manage monitoring and alerting systems to proactively detect issues (using tools like Prometheus, Grafana, or Datadog).
- Automate repetitive tasks, deployments, and infrastructure provisioning to improve efficiency and reduce human error.
- Conduct performance tuning and optimizations across infrastructure, applications, and databases to improve responsiveness and reduce latency.
- Work closely with security teams to ensure compliance with regulatory standards and address vulnerabilities promptly and implement security best practices across infrastructure and applications to protect systems and data.
- Collaborate with development teams to optimize applications and integrate reliability into the software development lifecycle.
- Partner with DevOps to improve CI/CD pipelines, streamline releases, and enhance build and deployment automation.
- Advocate for Site Reliability Engineering principles and educate teams on reliability best practices, monitoring, and error handling
- Implement and track SLAs, SLOs, and error budgets, continuously assessing and improving reliability.
Requirements
- Infrastructure as Code (IaC): Proficiency with IaC tools such as Terraform, Ansible, CloudFormation, or similar for automating infrastructure provisioning.
- Cloud Platforms: Strong experience with cloud providers (Azure) and services such Kubernetes (EKS/GKE/AKS).
- Monitoring and Alerting: Hands-on experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, New Relic, or similar).
- Scripting and Automation: Proficiency in scripting languages like Python, Bash, or PowerShell for automation and tooling.
- CI/CD and DevOps: Familiarity with CI/CD pipelines and tools (Azure Devops, Bamboo or Octopus), and experience implementing continuous delivery and deployment practices.
- Incident Management: Experience with troubleshooting, root cause analysis, and leading incident response efforts.
- Strong skills of performance Optimization
- Ability to analyze complex systems
- Understanding security practices
Benefits
- Competitive salary synonymous with skills and experience
- Performance and bonus structure dependent on achievement of set targets and personal performance
- Consultancy contract (B2B) offering paid time off