Improve observability for Windows EC2 instances with the CloudWatch Agent
Learn how to extend monitoring for Windows EC2 instances by deploying and configuring the Amazon CloudWatch Agent using AWS CDK.
Learn how to extend monitoring for Windows EC2 instances by deploying and configuring the Amazon CloudWatch Agent using AWS CDK.
Five practical strategies to reduce Datadog logging costs by optimizing ingestion, indexing, retention, and using metrics.
Advice on when and why to form a computer performance engineering team, based on the author's experience at Netflix and Intel.
A technical guide on instrumenting AI agentic applications using Arize Phoenix and litellm for observability and trace grouping.
A developer's technical walkthrough of instrumenting LLM tracing for litellm using Braintrust and Langfuse, detailing setup and challenges.
An engineer shares his 3-year experience of working remotely from Australia for a US firm, detailing the challenges of extreme timezone differences.
A guide to help large organizations decide whether to use Datadog's multi-org feature, covering key factors like company structure, data correlation, and cost.
Introducing Logfire, Pydantic's new observability tool for Python, with easy integration for OpenAI LLM calls, FastAPI, and logging.
How eBPF technology can prevent system crashes like the massive July 2024 Windows outage caused by a faulty kernel driver update.
A list of essential Linux tools to pre-install for diagnosing performance issues and outages, including package names.
Blog post about the new eBPF documentary, which tells the story of how the revolutionary Linux kernel technology was developed and accepted.
A guide to setting up distributed tracing for C# applications using Grafana and Tempo, including infrastructure configuration and integration.
A guide to setting up centralized logging for C# applications using Grafana and Loki, including infrastructure setup and code integration.
Explains why eBPF observability tools, designed for low overhead, are not suitable for security monitoring due to evasion risks.
Brendan Gregg's SREcon22 APAC keynote on the future of computing performance, covering new developments and predictions.
A case study on implementing a custom microservice (Chronos) to measure end-to-end latency in a microservice architecture.
A discussion of common pitfalls in measuring tail latency metrics in distributed systems, using examples from Twitter's infrastructure.
Explains how to get high value from distributed tracing with less effort, using a real-world implementation from Twitter as a case study.