ICDSA2025

Ms. Bhulakshmi Makkena

Data-Centric SRE: Driving Cloud Reliability Through Observability Metrics

Abstract:

Modern digital systems produce vast volumes of telemetry data, yet incident diagnosis often remains reactive and manual. This paper explores how integrating data science into observability transforms fragmented logs, metrics, and traces into structured, actionable insights. By treating observability as a core discipline, Site Reliability Engineering (SRE) teams can proactively detect system stress, performance anomalies, and configuration issues through statistical modeling and machine learning. Key metrics—such as latency patterns, error rates, and MTTR are analyzed using time-series modeling and anomaly detection, enabling early signal identification and faster resolution. Real-world applications demonstrate improved coordination, reduced incident frequency, and predictive detection capabilities. Looking forward, observability powered by intelligent analytics will drive adaptive, self-correcting systems that enhance operational efficiency and reliability across engineering, operations, and leadership.

Profile:

BhuLakshmi Makkena is a Lead Site Reliability Engineering (SRE) professional with a strong foundation in cybersecurity principles and cloud infrastructure resilience. With a background in maintaining highly available, secure, and scalable systems, she brings a unique perspective to defending complex environments against
evolving cyber threats. Her expertise spans incident response, infrastructure hardening, observability, and
automation, with a particular focus on integrating security into every layer of system reliability. Passionate about bridging the gap between operations and security, BhuLakshmi actively works to embed cyber-resilience into modern DevOps pipelines.