CIS2025

Mr. Dinesh Eswararaj

A Hybrid Framework for Scalable Data Quality: PySpark + AI-Powered Validation in Microsoft Fabric

Abstract:

Enterprises adopting lakehouse architectures often rely on rule-based data quality checks that are hard to scale and brittle across heterogeneous sources. In this talk, I present a hybrid framework that pairs PySpark for high-throughput syntactic validation with LLM-assisted semantic checks to catch context-driven errors that rules miss (e.g., unit mismatches, narrative-field anomalies, sentiment drift). Implemented on Microsoft Fabric/Databricks with Delta Lake and Change Data Feed, the approach supports incremental validation, automatic exception routing, and human-in-the-loop triage. I’ll share reference patterns for Bronze→Silver→Gold pipelines, choosing partition/Z-ORDER strategies, and using prompt libraries safely . Real-world results show reduced defect escape rates, faster triage, and meaningful cost controls through targeted AI invocation. The session provides a practical blueprint—architecture diagrams, sample PySpark/SQL snippets, and CI/CD hooks—to introduce AI-augmented data quality without disrupting existing pipelines.

Profile:

Dinesh Eswararaj is a Lead Data Engineer / Data Architect specializing in Azure, Microsoft Fabric, and Databricks. He has led large-scale data-platform modernizations in automotive and enterprise settings, designing medallion lakehouses, metadata-driven ingestion frameworks, and cost-optimized pipelines across ADLS, Delta Lake, and Unity Catalog. His work includes a configurable Data Service Automation Framework that accelerated vendor-feed onboarding and delivered significant cost savings. Dinesh is an IEEE Senior Member and frequent reviewer/industry judge. He has published multiple articles on cloud data engineering, data quality, and AI-assisted validation, and regularly mentors teams on CI/CD, governance, and performance tuning for lakehouse architectures.