archer56
Senior Site Reliability Engineer (SRE)
At a Glance
- Location
- San Jose, California, United States
- Experience
- 3+ years
- Posted
- 2026-02-27T11:59:26-05:00
Key Requirements
Required Skills
Domain Knowledge
- Engineering
- Regulatory
Requirements
3+ years of experience
in Site Reliability Engineering, DevOps, or a similar role with a strong focus on operational excellence.
Deep expertise in Amazon EKS
, including cluster provisioning, management, and troubleshooting.
Extensive experience with observability tools and practices
, including Prometheus, Grafana, ELK stack, or similar.
Responsibilities
Implement and maintain the infrastructure and pipeline required for an internal LLM-powered chat service, potentially leveraging platforms like OpenRouter or similar alternatives.
implement and maintain highly available, scalable, and secure cloud-native infrastructure on Amazon Elastic Kubernetes Service (EKS).
Develop and implement comprehensive observability strategies, including monitoring, logging, and alerting, to ensure the health and performance of our systems.
Architect and optimize data pipelines to ensure efficient and reliable data flow across various platforms.
Drive the continuous improvement of our CI/CD pipelines, promoting best practices for automated testing, deployment, and release management.
Champion cloud-first strategies, leveraging the full capabilities of cloud platforms for infrastructure, services, and operations.