Mr. Jeevan Kumar Goud Bandharapu
From Predictive Reliability to Responsible Autonomy: Designing Sustainable Self- Healing in Production AI Systems
Abstract:
As AI systems scale, reliability and sustainability must advance together. This talk introduces Responsible Autonomy—a framework for self-healing that optimizes availability, cost, and carbon within explicit safety constraints. Unlike traditional AIOps, it integrates sustainability objectives, explainable decision-making, and policy guardrails into a unified control architecture. A KPI-windowed reliability framework merges discrete status checks with learned health scores to forecast risk and trigger autonomous remediation such as selective pod restarts or right-sized scaling instead of cluster-wide over-provisioning.
Validated over six months across more than twenty production microservices processing 5–10 TB of logs daily, the approach reduced mean time to recovery (MTTR) by 75 percent—from 87 to 22 minutes—and cut energy waste by 48 percent (approximately 18,500 kWh monthly, about 8 tons of CO₂ avoided) while maintaining 99.1 percent service-level objective (SLO) attainment. A layered, explainable interface ensures that every autonomous action produces an auditable, natural-language justification—building transparency, trust, and continuous improvement.
Building on advances in predictive reliability and sustainable computing, the session distills seven design principles for trustworthy self-healing—calibrated confidence, monotonic safety, explainability-by-design, human-in-the-loop governance, energy- proportional responses, continuous adaptation, and fail-safe defaults—and outlines a phased integration roadmap beginning with shadow mode to earn operator confidence. The next era of AI will not merely think faster—it will recover smarter and consume wiser.
Keywords:
AI Reliability · Responsible Autonomy · Self-Healing Systems · Sustainable Computing · Energy-Aware Operations · AIOps · Explainable Automation · Cloud-Native Infrastructure
Profile:
Jeevan Kumar Goud Bandharapu is an accomplished Data Scientist specializing in Agentic AI, AIOps, and real-time observability, with over 7 years of experience in designing and implementing AI-driven, scalable data solutions. Currently with Evernorth Health Services, he develops predictive reliability frameworks, ML-based anomaly detection, and data engineering pipelines that enhance operational resilience and digital experience. Skilled in Splunk, Databricks (Python, PySpark), and Alteryx, Jeevan integrates deep technical expertise with a strong business understanding to transform raw data into actionable insights. He holds patents in machine learning for automated anomaly detection and certifications in Six Sigma Green Belt, Lean, and Kanban, reflecting his commitment to continuous improvement and operational excellence. Passionate about data storytelling and intelligent automation, Jeevan excels at building systems that not only inform but act—driving measurable impact across enterprise environments.
.png)