archer56

Senior Site Reliability Engineer (SRE)

Preview — apply on company site for full detailsApply Now

At a Glance

Location: San Jose, California, United States
Experience: 3+ years
Posted: 2026-02-27T11:59:26-05:00

Key Requirements

Required Skills

AWSBashCI/CDDevOpsDockerJenkinsKafkaKubernetesPython

Domain Knowledge

Engineering
Regulatory

Requirements

3+ years of experience

in Site Reliability Engineering, DevOps, or a similar role with a strong focus on operational excellence.

Deep expertise in Amazon EKS

, including cluster provisioning, management, and troubleshooting.

Extensive experience with observability tools and practices

, including Prometheus, Grafana, ELK stack, or similar.

Responsibilities

Implement and maintain the infrastructure and pipeline required for an internal LLM-powered chat service, potentially leveraging platforms like OpenRouter or similar alternatives.

implement and maintain highly available, scalable, and secure cloud-native infrastructure on Amazon Elastic Kubernetes Service (EKS).

Develop and implement comprehensive observability strategies, including monitoring, logging, and alerting, to ensure the health and performance of our systems.

Architect and optimize data pipelines to ensure efficient and reliable data flow across various platforms.

Drive the continuous improvement of our CI/CD pipelines, promoting best practices for automated testing, deployment, and release management.

Champion cloud-first strategies, leveraging the full capabilities of cloud platforms for infrastructure, services, and operations.