Job Title: Site Reliability Engineer (Observability)
Location: Hybrid
Contract Type: 6 months
About the Role: We need a Site Reliability Engineers (SREs) to build and maintain observability systems. You'll ensure our core services remain reliable, scalable, and high-performing.
Key Responsibilities:
- Deploy and manage observability tools using a Prometheus like metrics store and Grafana Enterprise.
- Automate monitoring, alerting, and incident response.
- Build Grafana dashboards for system insights.
- Apply Infrastructure as Code (IaC) principles.
- Develop tooling in Golang (preferred) or Python.
- Advocate for SRE principles like SLOs, SLIs, and error budgets.
- Integrate monitoring with incident management workflows.
Required Skills:
- SRE principles and reliability engineering expertise.
- Solid familiarity with Linux
- Strong experience in deploying and building containers using Podman or Docker
- Golang (preferred) or Python for automation and API integration.
- Experience with Grafana, VictoriaMetrics, and PromQL
- Experience with centralized logs solutions deployment and management
- Strong Infrastructure as Code (IaC) knowledge. Desirable Skills:
- OpenTelemetry experience.
- Terraform, Ansible, or CI/CD knowledge.
- Familiarity with VictoriaMetrics
- Background in datacentre and compute hardware services.
- AWS infrastructure configuration and deployment
- Familiarity with Kubernetes and cloud-native systems.
- Incident response automation expertise.
Why Join Us?
- Work on a greenfield observability project with full automation.
- Collaborate with top engineers in a high-ownership culture.