Job Title: Site Reliability Engineer (Observability)

Location: Hybrid

Contract Type: 6 months

About the Role: We need a Site Reliability Engineers (SREs) to build and maintain observability systems. You'll ensure our core services remain reliable, scalable, and high-performing.

Key Responsibilities:

Deploy and manage observability tools using a Prometheus like metrics store and Grafana Enterprise.
Automate monitoring, alerting, and incident response.
Build Grafana dashboards for system insights.
Apply Infrastructure as Code (IaC) principles.
Develop tooling in Golang (preferred) or Python.
Advocate for SRE principles like SLOs, SLIs, and error budgets.
Integrate monitoring with incident management workflows.

Required Skills:

SRE principles and reliability engineering expertise.
Solid familiarity with Linux
Strong experience in deploying and building containers using Podman or Docker
Golang (preferred) or Python for automation and API integration.
Experience with Grafana, VictoriaMetrics, and PromQL
Experience with centralized logs solutions deployment and management
Strong Infrastructure as Code (IaC) knowledge. Desirable Skills:
OpenTelemetry experience.
Terraform, Ansible, or CI/CD knowledge.
Familiarity with VictoriaMetrics
Background in datacentre and compute hardware services.
AWS infrastructure configuration and deployment
Familiarity with Kubernetes and cloud-native systems.
Incident response automation expertise.

Why Join Us?

Work on a greenfield observability project with full automation.
Collaborate with top engineers in a high-ownership culture.

Site Reliability Engineer/DevOps Engineer

Description

Skills

Industry Experience