scoutai
AI Cloud Infrastructure Engineer - Fury Team
At a Glance
- Location
- Sunnyvale, California, United States
- Experience
- 3+ years
- Posted
- 2026-02-26T22:24:53-05:00
Key Requirements
Required Skills
Domain Knowledge
- Engineering
- Legal
Requirements
3+ years of experience in ML infrastructure, MLOps, or large-scale data systems
Proven experience with distributed training (PyTorch DDP, DeepSpeed, Ray, or similar) and workflow orchestration (Kubernetes, Airflow, or equivalent)
Strong proficiency in Python and cloud-native infrastructure (AWS, GCP, or Azure)
Deep understanding of data engineering (ETL pipelines, object storage, data versioning, metadata management)
Familiarity with containerization and deployment (Docker, Kubernetes) and monitoring systems (Prometheus, Grafana)
Experience optimizing GPU cluster utilization, scaling training jobs, and profiling model performance
Compensation & Benefits
Competitive base salary and bonus
Meaningful equity
Premium medical, dental, and vision plans with $0 paycheck contribution
Competitive PTO and company holiday calendar
Catered lunch daily and fully stocked kitchen
EV charging
Responsibilities
Design and implement data pipelines for ingesting, transforming, and storing petabytes of multimodal data from Fury’s robotic and operator systems
Develop internal tooling for dataset exploration, curation, versioning, and quality monitoring over time
Build and maintain distributed training infrastructure (cloud and on-prem) for large-scale multimodal and foundation model training
Implement job orchestration workflows for launching, tracking, and debugging large-scale model runs
Identify and remediate bottlenecks in compute, memory, storage, and network performance to optimize throughput and cost efficiency
Collaborate with AI, autonomy, and systems teams to ensure data and training infrastructure supports real-time and mission-critical use cases