About the Role
The Site Reliability Engineering (SRE) team at ESO is responsible for ensuring the reliability, scalability, and performance of our production systems. We operate at the intersection of engineering and operations, with a strong focus on automation, observability, and continuous improvement.
As a Site Reliability Engineer, you will work hands-on with cloud-native systems, supporting production and pre-production environments to maintain system health, improve resiliency, and optimize performance. You’ll partner closely with engineering, infrastructure, and database teams to troubleshoot complex issues, enhance automation, and ensure our services meet reliability and availability expectations.
This role is ideal for an engineer who enjoys solving challenging problems, digging into application and database behavior, and continuously improving how systems operate in a fast-paced, high-impact environment.
What You’ll Do
- Support and maintain production and non-production cloud environments (Cloud Azure/AWS).
- Troubleshoot complex, distributed, cloud-based applications to identify root causes and implement durable fixes.
- Monitor system health, performance, and reliability using observability tools (e.g., New Relic, ELK and Zabbix).
- Investigate application and database performance issues, including writing and optimizing SQL queries.
- Participate in incident response, debugging, and post-incident reviews focused on continuous improvement.
- Contribute to CI/CD pipelines (e.g., Azure DevOps) to improve automation, reliability, and deployment processes.
- Write and maintain automation scripts (PowerShell, bash, Python or similar) to streamline operational workflows.
- Collaborate with developers to understand code behavior and support troubleshooting efforts in C#/.NET-based systems.
- Help improve reliability standards, documentation, and operational best practices.
What We’re Looking For
- Hands-on experience working in a cloud environment (Microsoft Azure strongly preferred).
- Experience supporting and troubleshooting complex, cloud-native applications in production environments.
- Strong understanding of relational databases and solid experience writing and troubleshooting SQL queries.
- Ability to read and understand application code (preferably C#/.NET) to support debugging and issue resolution.
- Experience working with at least one CI/CD platform (e.g., Azure DevOps).
- Familiarity with monitoring and observability tools (e.g., New Relic) and core concepts such as logs, metrics, and traces.
- Experience with scripting/automation (PowerShell preferred).
- Strong analytical and problem-solving skills with attention to detail.
- Clear written and verbal communication skills.
Who You Are
- Passionate about reliability engineering and operational excellence.
- Curious and eager to learn, you actively seek feedback and continuously grow your technical skill set.
- Coachable and adaptable, able to thrive in a fast-paced and evolving environment.
- Comfortable navigating ambiguity and taking ownership of problems through to resolution.
- A collaborative team player who values accountability and continuous improvement.
Nice to Have
- Experience working with Linux-based systems.
- Experience working with Kubernetes and container systems.
- Exposure to infrastructure-as-code tools (e.g., Terraform).
- Familiarity with Git-based version control workflows.
At ESO, reliability is core to our customer experience. As part of the SRE team, your work will directly impact system stability, product performance, and the quality of service delivered to healthcare professionals and the communities they serve.
Applicant Privacy Notice – please click here to review the privacy policy which details how your data is collected, used and protected.