smarterdx

Staff Site Reliability Engineer

Apply Now

At a Glance

Location
United States
Work Regime
remote
Experience
10+ years
Posted
2026-03-12T20:02:28-04:00

Key Requirements

Required Skills

AWSKubernetesPostgreSQLTerraform

Domain Knowledge

  • Engineering
  • Regulatory

Benefits & Perks

Time Off

re in the U.S. Unlimited PTO & 10 Holidays – So you can re

Requirements

10+ years of software and software reliability engineering experience, with significant time spent operating and scaling distributed systems in production environments.

3+ years of hands-on experience running cloud-native infrastructure in AWS, including deep familiarity with containers, Kubernetes, monitoring, and alerting in live production systems.

Proven experience defining and managing SLIs/SLOs, leading incident response, and driving postmortems and systemic reliability improvements.

Strong expertise with Terraform and infrastructure-as-code practices for managing production infrastructure safely and reproducibly.

Deep experience with Kubernetes architecture and operations, including workload reliability, cluster scaling, networking, and failure modes.

Experience working in security-conscious, compliance-oriented environments where reliability and data protection are first-class concerns.

Compensation & Benefits

$230K to $250K base salary

#LI-DNI

Benefits

Medical, Dental & Vision

– Comprehensive plans with leading insurance providers, covering 75% of your premiums, depending on the plan.

Paid Parental Leave

Responsibilities

Define and evolve reliability standards for the SmarterDx platform, including SLIs, SLOs, and error budgets that align engineering work with customer impact.

Implement a “reliability” platform using Terraform and infrastructure-as-code best practices.

Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).

Lead incident response, drive blameless postmortems, and implement systemic improvements to prevent recurrence.

Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.

Provide production support for the SmarterDx platform, applying SRE principles to ensure availability, performance, and data durability.