Resiliency, Reliability & Operational Excellence Apply a reliability‑first mindset, designing and validating highly available, fault‑tolerant systems through proactive testing, failure simulations, chaos engineering, and resilience reviews. Monitoring, Observability & Intelligent Operations Lead adoption of modern monitoring and observability practices, including distributed tracing, metrics, logs, and end‑to‑end service health visibility across complex, distributed systems. AI‑Enabled Architect…
Responsibilities
Support for Mission Critical is seeking deep technical architects aligned to SDC customers who are undergoing—or anticipating—hyper‑growth and increasing operational complexity. Serve as a senior technical leader, driving vision for customers and internal teams; pilot new operating models, AI‑enabled capabilities, and data‑driven practices; scale proven architectures and patterns; and mentor others to elevate technical depth across the organization. Guide customers in defining and achieving SLOs, SLIs, and error budgets, with clear accountability and measurable outcomes. Drive continuous improvement by going beyond traditional root‑cause analysis to understand systemic, architectural, and organizational contributors to incidents. Correlate telemetry, customer signals, and platform events to produce actionable insights, risk identification, and proactive recommendations. Promote automation and AI‑assisted approaches for incident detection, triage, and remediation, reducing MTTR and escalation frequency.
Required Qualifications
7+ years experience in cloud/infrastructure technologies, information technology (IT) consulting/support, systems administration, network operations, software development/support, technology solutions, practice development, architecture, and/or consulting OR equivalent experience. Deep proficiency in cloud, software, ISV, or consulting ecosystems. Strong technical depth, including level‑500 expertise in at least one Azure domain, with broad familiarity across the Azure platform. Proven ability to design, operate, and troubleshoot complex, highly available, mission‑critical systems and to lead customer escalations effectively. Demonstrated experience with monitoring, observability, and reliability engineering practices in large‑scale distributed systems. Software development experience, including AI‑enabled solutions, and strong understanding of DevOps and CI/CD practices. Advanced degree and/or certifications such as PMP, SRE, or equivalent are a plus. Experience launching products, platforms, or support offers at enterprise scale. This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. *
Original Posting
This role is sourced from Microsoft. Apply on Microsoft careers page