Azure SRE Agent – configure an alert for remediation
A technical guide on configuring Azure Monitor alerts for automated remediation using the Azure SRE Agent in preview.
A technical guide on configuring Azure Monitor alerts for automated remediation using the Azure SRE Agent in preview.
Analysis of Cloudflare's major outage that took down sites like ChatGPT and X, and their transparent postmortem report.
Explores why on-call rotations are often painful, framing them as a sociotechnical problem requiring both technical fixes and organizational changes.
An analysis of Twitter's most severe cache-related incidents from 2012-2022, exploring patterns and knowledge loss.
Explains the importance of automated alerts in IT operations, detailing a cycle for identifying symptoms, creating triggers, and improving incident response.
A tutorial on building a no-code incident management system using Azure Logic Apps to monitor Twitter sentiment and create PagerDuty alerts.
A critique of traditional 'war room' monitoring centers, arguing they are ineffective and harmful compared to automated observability and developer ownership.
Discusses why software releases cause incidents and offers strategies to make deployments safer and less stressful for engineering teams.
A guide on conducting effective post-incident retrospectives for software teams, focusing on process and improvement.
A developer's guide to using Azure Service Health to monitor and receive alerts for active service incidents affecting Azure subscriptions.
A summary and analysis of Google's SRE book, contrasting traditional sysadmin roles with Google's Site Reliability Engineering approach.
A developer explains why avoiding major code changes on Fridays prevents weekend production bugs and burnout, advocating for refactoring and R&D instead.