In A Nutshell
Location
Hybrid Redwood City, CA, USA
Salary
$178,000-$267,000
Job Type
Full-time
Experience Level
Mid-level
Deadline to apply
July 10, 2025
Lead cross-functional programs that accelerate the effectiveness of CZI’s AI/ML and Data Infrastructure teams.
Responsibilities
- Lead AI/ML infrastructure programs: Drive execution of technical initiatives across GPU scheduling, platform enablement, observability, or workload orchestration.
- Lead access and lifecycle workflows: Own the end-to-end experience for users accessing shared infrastructure resources—including onboarding, offboarding, documentation, and support processes. Serve as the primary point of contact for researchers and internal teams navigating compute access, and collaborate with platform teams to ensure smooth transitions, including long-term model or data migration when needed.
- Coordinate infrastructure access requests: Manage intake and operational workflows for machine learning infrastructure access, including triage, tracking, and communication. Ensure alignment across engineering, research, and platform teams, and help evolve the process as usage scales and needs become more complex.
- Drive documentation systems: Own the structure, accuracy, and governance of internal documentation, onboarding guides, runbooks, and infrastructure wikis.
- Enhance visibility: Maintain and improve AI system dashboards and reporting systems for onboarding timelines, RFA volume, and infrastructure program milestones.
Skillset
- 7+ years of experience in technical program management or infrastructure-focused operations in complex engineering environments.
- Proven ability to manage large-scale technical programs across multiple stakeholders and teams.
- High-level understanding of machine learning workflows and model training pipelines, with the ability to translate infrastructure needs between research and engineering teams.
- Strong organizational skills and experience leading cross-functional programs with tight timelines and multiple stakeholders.
- Excellent written and verbal communication skills, including the ability to align stakeholders at multiple levels.
- A passion for building efficient, secure, and inclusive systems to support cutting-edge science and research.
- Familiarity with on-prem/HPC and/or multi cloud-based GPU infrastructure, orchestration tools, and platforms like Slurm, Run:AI, MLflow, W&B or similar systems is a huge plus.