snowflake

Senior Infrastructure Engineer, Observe by Snowflake

Apply Now

At a Glance

Location
US-CA-Menlo Park, United States
Employment
FULL_TIME
Experience
5+ years
Compensation
{'@type': 'MonetaryAmount', 'currency': 'USD', 'value': {'@type': 'QuantitativeValue', 'minValue': 200000, 'maxValue': 287500, 'unitText': 'YEAR'}}
Department
Snowflake
Posted
2026-03-10

Key Requirements

Required Skills

AWSAzureCI/CDDevOpsGCPKubernetesPythonSnowflakeTerraform

Domain Knowledge

  • Automation
  • Cloud
  • Engineering

Requirements

5+ years of experience in Infrastructure Engineering, Site Reliability Engineering (SRE), DevOps, or related roles.

Demonstrated experience designing and operating production systems at scale, with deep ownership of reliability and operational excellence.

Strong experience with container orchestration platforms such as Kubernetes or Nomad, including architectural decision-making and operational tuning.

Hands-on experience managing cloud infrastructure using Infrastructure-as-Code tools such as Terraform, Ansible, or similar, with a focus on scalable system design.

Strong programming skills in Go, Python, or similar languages, with a track record of building automation and infrastructure systems.

Experience driving cross-team technical initiatives and influencing infrastructure best practices.

Responsibilities

Lead the design, build, and operation of scalable, secure cloud infrastructure in AWS supporting a high-scale observability platform.

Drive architectural improvements that enhance reliability, performance, scalability, and operational visibility across development and production environments.

Own and evolve CI/CD pipelines, developer tooling, and platform automation to improve productivity and deployment safety at scale.

Proactively identify reliability, performance, and security risks, and lead efforts to mitigate them.

Design and implement infrastructure patterns that ensure high availability, fault tolerance, and operational resilience.

Play a key role in incident response, root cause analysis, and post-incident improvements, driving systemic reliability enhancements.