Bachelor’s or.
Responsibilities
Implement large-scale model training, especially with LLMs, SLMs, multimodal, or code-specific models. Develop robust evaluation frameworks to assess model performance, conduct systematic benchmarking, and address identified weaknesses while ensuring compliance with customer standards. Write efficient, production-quality code and debug complex distributed systems. Build and maintain internal tools to streamline training and evaluation workflows and automate repetitive tasks within secure development environments.
Required Qualifications
master's degree in computer science, Engineering, or a related field, or equivalent practical experience. 5+ years of professional experience, including 2+ years with Python and ML frameworks such as PyTorch or TensorFlow. Hands-on experience with training or fine-tuning LLMs or multimodal models. Familiarity with production ML systems and concepts like model serving, caching, batching, and monitoring. Understanding of distributed systems and cloud-based infrastructure. Experience with containerization tools (e.g., Docker, Kubernetes). Exposure to MLOps or DevOps practices (CI/CD, automated testing, deployment). Interest in generative AI and open-source model ecosystems. Ability to work in a fast-paced, collaborative environment with a growth mindset. Strong communication and documentation skills.
Original Posting
This role is sourced from Microsoft. Apply on Microsoft careers page