Notes on OpenAI Kubernetes outage
Analysis of OpenAI's Kubernetes outage, focusing on API server overload and DNS service discovery issues in large-scale clusters.
Analysis of OpenAI's Kubernetes outage, focusing on API server overload and DNS service discovery issues in large-scale clusters.
A technical guide on creating Azure Action Groups for notifications using Terraform and PowerShell code examples.
A guide to automating Azure monitoring and alert setup using PowerShell within Infrastructure as Code (IaC) deployments.
A guide to creating an Azure Service Health dashboard using Azure Resource Graph Explorer, including KQL queries and a shared workbook.
Analyzes the rising costs and diminishing value of traditional observability tools, exploring the 'cost multiplier' effect of using multiple overlapping telemetry systems.
A guide to migrating from Classic Application Insights to the new Workspace-based model, covering the process, data merging, and alert reconfiguration.
A guide on copying specific elements like queries, metrics, or groups between Azure Workbooks using the Advanced Editor and JSON.
A tutorial on setting up a comprehensive Kubernetes monitoring stack using Prometheus, Grafana, and the Robusta platform.
A developer's monthly digest covering books on Go, TypeScript, and Prometheus, plus articles on AI, work culture, and teaching observability.
A guide to implementing OpenTelemetry for monitoring and observability in an Angular application using the browser SDK.
A case study on implementing a custom microservice (Chronos) to measure end-to-end latency in a microservice architecture.
A guide to designing a state-of-the-art, multi-account security logging and monitoring platform in Google Cloud Platform (GCP).
Explores challenges and solutions for setting up Azure alerts at scale, focusing on Log Analytics and host platform metrics for IaaS VMs.
A guide to setting up low-cost website monitoring for Azure Static WebApps using Application Insights URL ping tests and alerts.
Learn how to implement and use the Python logging module to monitor events and analyze application performance.
Explores using eG Enterprise for comprehensive monitoring and performance insights in Azure Virtual Desktop environments.
A critique of traditional metrics for observability, arguing they are limited for debugging unknown issues but still valuable for system health monitoring.
Part 4 of a Kubernetes for Developers series, focusing on setting up monitoring with kube-prometheus-stack, Prometheus, and Grafana.
An independent web performance consultant explains the value they bring to organizations by focusing teams, sharing cross-client best practices, and driving measurable improvements.
A guide to setting up a free monitoring stack for Django applications, covering uptime, error reporting, logs, and performance.