Operational Excellence For Software Engineers
Published 9/2025
MP4 | Video: h264, 1920x1080 | Audio: AAC, 44.1 KHz
Language: English | Size: 2.15 GB | Duration: 4h 51m
Published 9/2025
MP4 | Video: h264, 1920x1080 | Audio: AAC, 44.1 KHz
Language: English | Size: 2.15 GB | Duration: 4h 51m
Master the mindset, tools, and strategies behind building reliable, scalable, and cost-efficient systems.
What you'll learn
Define a framework for continuous improvement and apply it across operational workflows
Establish measurable operational targets and track SLAs using relevant KPIs
Identify common availability issues and implement mitigation strategies
Minimize the impact and duration of operational incidents
Optimize system performance using proven engineering techniques and design patterns
Forecast capacity and Manage budget to ensure cost-effective operations
Maintain reliable delivery pipelines through structured update and deployment practices
Design dashboards that provide actionable insights and enhance operational visibility
Requirements
Basic Software Engineering Experience, although we won't be coding
Basic Understanding of DevOps concepts, like: CI/CD Pipeline, Testing, Deployment
Basic Cloud Knowledge, like: Auto Scaling, API Gateway
Familiarity with Production Systems: Exposure to live systems, deployments, and incident handling (even at a junior level) is important for context.
Description
Operational Excellence is the backbone of resilient, scalable, and cost-effective software systems. This course is designed for software engineers, DevOps professionals, and technical leaders who want to elevate their operational mindset and take full ownership of system health. Through a structured, hands-on approach, learners will explore the methodology of continuous improvement, define and track operational targets like SLAs, and learn to identify and mitigate availability issues before they escalate.Students will gain practical skills to minimize the impact of incidents using deployment strategies, rollback mechanisms, and regional isolation techniques. The course dives deep into performance optimization, covering advanced engineering practices such as caching, parallelism, and request hedging. Budget and cost management are treated as first-class concerns, with strategies for forecasting demand, planning capacity, and reducing operational expenses.In addition, learners will master delivery pipeline hygiene and update methodologies, ensuring reliable deployments and long-term system stability. The course also teaches how to design dashboards that transform observability into actionable insight—empowering teams to monitor, respond, and improve with confidence.Whether you're scaling infrastructure, responding to outages, or refining deployment workflows, this course will help you build systems that are not only reliable and performant, but also aligned with business goals and engineering best practices.
Overview
Section 1: Course Overview
Lecture 1 Introduction - What is Operation Excellence
Lecture 2 What exactly are we trying to improve?
Lecture 3 OE Importance & what would you gain from this Course
Lecture 4 Course Topics
Lecture 5 DevOps Concepts
Section 2: Continuous Improvement Methodology
Lecture 6 Building a Mechanism for Improvement
Lecture 7 Learning from your own mistakes
Lecture 8 Applying past mistakes to future-proof operations
Lecture 9 Improvement Flywheel
Section 3: Operation Targets & Execution Tracking
Lecture 10 SLA - Service Level Agreement
Lecture 11 Availability
Lecture 12 Latency
Lecture 13 Additional external operational targets (Throughput, Freshness, Support etc.)
Lecture 14 Internal operational targets (Cost, KTLO, Tickets)
Lecture 15 Monitoring - Tracking execution
Lecture 16 Sharing operational performance publicly
Section 4: Availability Problems & Mitigations
Lecture 17 External Dependencies
Lecture 18 Mitigation - Dependencies Redundancy
Lecture 19 Mitigation - Asynchronous implementation
Lecture 20 Mitigation - Retries
Lecture 21 Unpredicted Demand
Lecture 22 Bugs
Lecture 23 Mitigation - Code Reviews and Tests
Lecture 24 Unpredicted Failures - Problem and Mitigations
Lecture 25 Performance Issues - Problem and Mitigations
Lecture 26 Gamedays: Real-World Performance Testing
Lecture 27 Breaking API Contract
Lecture 28 Neglected Operations
Lecture 29 Manual Operations Mistake
Lecture 30 Mitigation - Change Management
Section 5: Minimizing Incidents Impact
Lecture 31 Minimizing Blast Radius
Lecture 32 Minimizing Incident Duration & Auto Rollback
Lecture 33 Identifying there is a Problem
Lecture 34 Finding the Cause - Runbooks
Lecture 35 Finding the Cause - Correlations
Lecture 36 Finding the Cause - Logs
Lecture 37 Finding the Cause - Debugging
Lecture 38 The Art of Investigation
Lecture 39 Implementing a Solution
Lecture 40 War Room
Lecture 41 OE Flywheel - COE (Correction of Error)
Section 6: Performance Optimization
Lecture 42 Why is Performance Optimization important?
Lecture 43 Code Optimization
Lecture 44 Caching Overview
Lecture 45 Caching Types
Lecture 46 Prefetching and Lazy Loading
Lecture 47 Precomputation, Parallelism and Sharding
Lecture 48 Improving Tail Latency (Request Hedging)
Lecture 49 Scaling
Section 7: Budget and Cost Management
Lecture 50 Measuring Demand
Lecture 51 Scaling frequency: On-Premises vs. Cloud
Lecture 52 Forecasting Demand
Lecture 53 Capacity Planning
Lecture 54 Cost Savings
Lecture 55 Monitoring your Cost
Section 8: Software Delivery
Lecture 56 Dependencies Packages and Libraries Update
Lecture 57 OS Patching
Lecture 58 Pipelines Hygiene and Velocity
Lecture 59 Test/Prod environment Similarity
Section 9: Operation Dashboard
Lecture 60 Dashboard Structure and Design Principles
Lecture 61 Dashboard Sections
Section 10: Conclusion
Lecture 62 Wrapping Up
Software Development Engineers (SDEs) who want take ownership of system reliability, performance, and cost, and level up their operational thinking,DevOps Engineers looking to expand their impact beyond tooling into strategic operational practices,Site Reliability Engineers (SREs) aiming to strengthen their approach to incident response and system resilience,Architects and Software Engineering Managers (SDMs) seeking a structured framework for improving system health and delivery velocity