Goutham Yenuganti
Generative Intelligence Driven Self-Healing Systems: A Multi-Agent Framework for Autonomous Reliability in Distributed Cloud Environments
Abstract:
As distributed cloud native systems grow in scale and architectural complexity, traditional reactive monitoring approaches are no longer sufficient to ensure reliability, resilience, and operational efficiency. This session presents a computing and machine learning driven framework for autonomous self-healing enterprise applications that integrates structured observability with fine-tuned generative intelligence models.
The framework introduces a three-layer architecture consisting of high-fidelity telemetry collection, intelligent anomaly detection and reasoning, and controlled automated execution. The observability layer adopts OpenTelemetry best practices and correlation ID strategies to enable cross-service traceability, real-time analytics, and business context mapping. A tiered storage model, including hot storage with sub-second queries, warm storage with second-level queries, and cold storage with minute-level queries, supports both real-time anomaly detection and long term trend modeling.
The intelligence layer combines statistical baselining, machine learning based anomaly detection, and large language models fine-tuned on domain-specific incident data. Research demonstrates significant variability in large language model performance across operational contexts, highlighting the necessity of domain adaptation and rigorous evaluation strategies. A multi-agent architecture consisting of diagnostic, remediation, validation, and ensemble coordination agents enables structured reasoning, risk assessment, and consensus validation before execution.
Hierarchical anomaly detection methods grounded in comparative machine learning research improve detection robustness while minimizing false positives. The execution layer incorporates formal safety engineering mechanisms, including circuit breakers, canary deployments, runtime monitoring, and automated rollback triggers to ensure controlled autonomy.
Aligned with conference themes in artificial intelligence, machine learning algorithms, distributed computing, and cloud infrastructure, this talk bridges advanced machine learning research with practical distributed systems engineering to demonstrate how generative intelligence can enable safe, explainable, and autonomous system remediation.
Profile:
Goutham Yenuganti is a resourceful and accomplished Software Development Engineer with over twelve years of experience spanning Java/J2EE, Node.js, and Python. Throughout his career, he has specialized in designing and building large-scale, highly available systems and AI-powered solutions that drive measurable innovation across leading global enterprises.
In his current role as Lead Member of Technical Staff at Bellevue (Salesforce) since January 2025, Goutham leads the design and development of AI-powered IT Service Management and Employee Experience solutions. His work is central to Salesforce’s Customer 360 and Agentforce platforms, delivering proactive and intelligent service experiences to major enterprise customers such as Walmart, AT&T, IBM, and Accenture. He has spearheaded the development of AI-powered ITSM services embedded in Slack, enabling conversational ticket resolution and proactive service delivery at scale. He has also architected scalable backend services for IT Asset Management, IT Compliance, and Procurement, contributing to feature delivery across Salesforce’s Field Service Cloud, Revenue Cloud, and Manufacturing Cloud. His work in integrating conversational AI through Agentforce has significantly accelerated ticket resolution and increased automation of repetitive workflows through advanced prompt engineering and system optimization.
Prior to Salesforce, Goutham served as a Software Development Engineer at Amazon Advertising in Seattle (2019–2025), where he designed and scaled an advertising rendering service capable of handling one million transactions per second with sub-20 millisecond latency, contributing to substantial revenue growth and operational efficiency. He introduced modern frameworks, optimized infrastructure for cost efficiency, and re-architected core services to improve scalability and maintainability within Amazon’s AWS ecosystem.
Earlier in his career, he worked as an SDE II at Amazon Appstore (Chennai), a Developer at SAP Ariba (Bangalore), and held engineering roles at DXC Technology and on enterprise regulatory platforms in Chennai, contributing to ERP modules, authentication systems, analytics libraries, and compliance solutions across global markets.
Goutham holds a Bachelor of Technology in Electronics and Communication Engineering from Jawaharlal Nehru Technological University, Anantapur. He is an Oracle Certified Associate (Java SE8) and has completed advanced coursework in algorithms, Java multithreading, and Linux web server configuration. Fluent in English, Telugu, Hindi, and Tamil, he continues to focus on advancing scalable, AI-driven enterprise systems.