Bachelor’s /Masters Degree in Computer Science, Mathematics, Software Engineering, Computer Engineering, or related technical field, and.
Responsibilities
Design and optimize end-to-end ML/LLM inference workflows across online low-latency serving, near-real-time inference, and large-scale batch inference scenarios. Build scalable serving and execution systems for large-scale models, including scheduling, batching, routing, admission control, and resource-aware execution. Improve inference performance and efficiency across compute, memory, storage, network, and concurrency dimensions, with strong focus on latency, throughput, reliability, and cost. Develop and apply modern serving techniques such as continuous or dynamic batching, prefix caching, KV-cache optimization, request shaping, tail-latency reduction, and runtime-level performance tuning. Optimize systems for key generative inference metrics such as time to first token, inter-token latency, throughput, accelerator utilization, and cost per request. Work on runtime and serving optimizations for modern inference stacks such as vLLM, TensorRT-LLM, SGLang, Triton, ONNX Runtime, and PyTorch-based serving systems. Partner with applied scientists to productionize new models and inference patterns, including agentic workflows with tool use, structured outputs, and long-context workloads, and evaluate quality-latency-cost tradeoffs in real production scenarios. Design and improve scheduling and resource management for heterogeneous and multi-tenant inference workloads, including GPU-aware placement, admission control, burst handling, and workload isolation. Build strong observability and diagnostics for inference services, including bottleneck analysis, performance regression detection, and end-to-end latency and cost measurement.
Required Qualifications
5+ years of related experience in machine learning systems, distributed systems, inference infrastructure, or software engineering OR Doctorate in Computer Science, Mathematics, Software Engineering, Computer Engineering, or related technical field, and 2+ years of related experience. Strong programming skills in Python, C++, or C#. Large-scale ML/LLM inference serving in production MLSys for model deployment, serving, or runtime optimization Experience building or optimizing systems for online inference, batch inference, or near-real-time inference. Experience with one or more modern inference stacks or runtimes such as vLLM, TensorRT-LLM, SGLang, Triton, ONNX Runtime, DeepSpeed, or PyTorch inference tooling. Experience with modern LLM inference and serving techniques, including areas such as KV-cache management, prefix caching, speculative decoding, quantization, prefill/decode disaggregation, or MoE inference optimization. Experience with production-scale model serving platforms and distributed inference systems, including multi-node or multi-tenant deployments, resource-aware scheduling, and optimization across heterogeneous workloads.
Original Posting
This role is sourced from Microsoft. Apply on Microsoft careers page