HPC Operations Engineering Manager

Microsoft
$165,600.00 - $296,400.00 / yr
United States, California, Mountain View
7000 State Highway 161 (Show on map)
Jun 08, 2026
Overview Microsoft AI isseekinganexperiencedHighPerformanceComputingOperations Engineering Managertojoin our infrastructure teamon the MAISuperIntelligenceTeam.In this role,you'lllead a team of SiteReliabilityEngineers whoblendsoftware engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient.You'llwork closely with ML researchers, data engineers, and product developers to design andoperatethe platforms that power training, fine-tuning, andservinggenerative AI models. Microsoft Superintelligence Team Microsoft Superintelligence Team's mission is to empower every person and every organization on the planet to achieve more. Asemployeeswe come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. This role is part of Microsoft AI's Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence - ultra-capable systems thatremaincontrollable, safety-aligned, and anchored to human values. Our mission is to create AI that amplifies human potential while ensuring humanityremainsfirmly in control. We aim to deliver breakthroughs thatbenefitsociety - advancing science, education, and global well-being.We'realso fortunate to partner with incredible productteamsgiving our models the chance to reach billions of users and createimmensepositive impact. Ifyou'rea brilliant, highly-ambitiousand low ego individual,you'llfit right in - come and join us as we work on our next generation of models! By applying to this Mountain View, CA position, youare required tobe local to the San Francisco area and in office 4 days a week. Responsibilities Responsibilities Team leadership:Lead a team of experienced SREsto ensure uptime, resiliency and fault tolerance of AI model training and inference systems. Observability: Designandhelpmaintainmonitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra. Automation & Tooling:Lead building ofautomation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments. Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements. Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments. Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows. Qualifications Required Qualifications Bachelor's Degree in Computer Scienceor related technical field AND 8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles AND 8+years experiencewith Kubernetes, Docker, and container orchestration, AND 6+years experiencewith programming/scripting skills not limited to Python, Go, or Bash ORequivalentexperience Preferred Qualifications: Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience AND 10+ years experience with Kubernetes, Docker, and container orchestration, AND 10+ years' experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code OR equivalent experience 6+yearspeople management experience. 8+years experiencein monitoring & observability tools (Grafana, Datadog,OpenTelemetry, etc.). Knowledge ofCI/CD pipelinesfor Inference and ML model deployment. Solid knowledge ofdistributed systems, networking, and storage. Experience runninglarge-scale GPU clustersfor ML/AI workloads (preferred). Familiarity with ML training/inference pipelines. Experience withhigh-performance computing (HPC)and workload schedulers( Kubernetesoperators). Background incapacity planning & cost optimizationfor GPU-heavy environments Software Engineering IC6 - The typical base pay range for this role across the U.S. is USD $165,600.00 - $296,400.00 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $220,800.00 - $331,200.00 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.