As a Principal Engineer on the infrastructure fleet management team, your responsibilities include: Architect, design, and develop core AI Infrastructure services developed in Go, Rust, Python, C++, and C# deployed on large-scale Kubernetes clusters to support pre-training and post-training of state-of-the-art LLMs, SLMs, multimodal, and code-specific models. Design, build, and manage compute, storage and networking sub-system on large-scale GPU clusters to support LLM training, customization, a…
Responsibilities
Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds. Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineering systems and practices. Support development and troubleshooting from the frontline, resolving complex issues impacting large-scale services. Provide vision, expertise, and technical leadership to other team members.
Required Qualifications
master's degree in computer science or a related field. 10+ years designing, developing, and shipping high quality software. 4+ years of experience with distributed systems and cloud based infrastructure. 2+ year of experience with DevOps practices (CI/CD, automated testing, deployment, etc.). Passionate and self-motivated. Strong ability in self-learning, entering new domain, managing through uncertainty in an innovative team environment. 10+ years of software development experience in C#, C++, Python, or similar languages. 6+ years of experience with containerization tools (e.g., Docker, Kubernetes). Knowledge and hands on experience with production ML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools. This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. *
Original Posting
This role is sourced from Microsoft. Apply on Microsoft careers page