Nirav Rana
ML Framework for Enterprise Data Enrichment: Scaling to Massive Datasets
Abstract:
Enterprise organisations struggle with entity resolution across fragmented data sources, as they attempt to determine when records refer to the same real-world entity despite variations in representation. Traditional deterministic matching fails at scale, unable to recognize that "International Business Machines," "IBM Corp.," and "IBM" represent the same company.
This presentation describes a production machine learning framework that processes hundreds of millions of records, combining gradient-boosted decision trees with probabilistic matching techniques. Built on AWS-native services (S3, Glue, Lambda, Step Functions), the system achieves measurable improvements in data quality through sophisticated confidence scoring that quantifies match certainty beyond binary predictions.
The framework handles diverse organizational types, from Fortune 500 enterprises with rich digital footprints to early-stage startups with minimal public information, through specialized matching strategies and feature engineering. Text-based features capture naming variations, geographic features measure spatial relationships, firmographic attributes distinguish entities by size and industry, and technographic signals track technology adoption patterns.
Production deployment required addressing operational challenges at scale: batch processing of complete datasets alongside incremental updates for ongoing changes, comprehensive monitoring and alerting, robust error handling and recovery mechanisms, and cost optimization strategies for cloud resources.
The business impact extends across functions: sales teams qualify prospects more efficiently with automated firmographic enrichment, marketing organizations achieve higher campaign response rates through precise segmentation, analytics teams build insights on reliable data foundations, and customer success teams provide better service with complete organizational context.
Key lessons include the critical importance of high-quality training data created through systematic annotation by subject matter experts, the need for domain expertise beyond technical implementation, and the value of starting with well-defined use cases before expanding scope. The framework demonstrates that sophisticated ML techniques combined with cloud-native architecture transform enterprise data enrichment from an intractable challenge into a source of competitive advantage.
Profile:
Nirav Pravinsinh Rana is an accomplished full-stack engineer with over 11 years of experience building high-performance, cloud-native applications. Currently a Software Development Engineer 3 at Amazon Web Services, he specialises in designing scalable solutions for Sales and marketing organisations, including sophisticated forecasting systems and machine learning-powered data matching infrastructure that processes millions of records efficiently.
His technical expertise spans the complete technology stack—from large-scale ETL pipelines processing over 300 million records using AWS Glue to high-performance backends with DynamoDB and Elasticsearch. He excels at developing optimized GraphQL APIs and interactive React applications for complex data visualization.
Before AWS, Nirav spent five years at SAP America as an IT Consultant, where he automated over 70% of business scenarios and improved backend performance by 60%. He began his career at Deloitte Consulting, successfully managing ERP implementations across various industries.
Nirav holds an MS in Information Systems from the University of Cincinnati and a BE in Information Technology from Sardar Vallabhbhai Patel Institute of Technology. Known for his collaborative approach and commitment to excellence, he continues driving innovation in cloud-native solutions while championing DevOps best practices and mentoring teams.