archer56

Senior Site Reliability Engineer (SRE)

Apply Now

At a Glance

Location
San Jose, California, United States
Experience
3+ years
Posted
2026-02-27T11:59:26-05:00

Key Requirements

Required Skills

AWSBashCI/CDDevOpsDockerJenkinsKafkaKubernetesPython

Domain Knowledge

  • Engineering
  • Regulatory

Requirements

3+ years of experience

in Site Reliability Engineering, DevOps, or a similar role with a strong focus on operational excellence.

Deep expertise in Amazon EKS

, including cluster provisioning, management, and troubleshooting.

Extensive experience with observability tools and practices

, including Prometheus, Grafana, ELK stack, or similar.

Responsibilities

Implement and maintain the infrastructure and pipeline required for an internal LLM-powered chat service, potentially leveraging platforms like OpenRouter or similar alternatives.

implement and maintain highly available, scalable, and secure cloud-native infrastructure on Amazon Elastic Kubernetes Service (EKS).

Develop and implement comprehensive observability strategies, including monitoring, logging, and alerting, to ensure the health and performance of our systems.

Architect and optimize data pipelines to ensure efficient and reliable data flow across various platforms.

Drive the continuous improvement of our CI/CD pipelines, promoting best practices for automated testing, deployment, and release management.

Champion cloud-first strategies, leveraging the full capabilities of cloud platforms for infrastructure, services, and operations.