How Infinitive Boosted ETL Performance and Cost Efficiency for a Non-Profit by Migrating to AWS with Databricks

Infinitive recently partnered with an educational non-profit to undertake a significant data migration and transformation project. The initiative’s goal was to transition from an on-premises Oracle system to a more scalable AWS environment, focusing on optimizing data pipelines and transforming data in accordance with complex business rules.

Challenge Faced by the Educational Non-Profit

The client encountered several obstacles in their data management processes:
  1. ETL Performance: The newly designed ETL processes on AWS Redshift were not meeting desired SLAs due to the scale of data being scanned.
  2. Cost Efficiency: Higher and more expensive than anticipated AWS Redshift costs were required to keep the system running due to the large volume of data being scanned.
  3. Scalability: The infrastructure needed to scale effectively to handle increasing data volumes without further performance degradation.
  4. Codebase Maintenance: The client required a solution that allowed the use of their existing SQL stored procedures without extensive rewrites.

Challenges Faced by the Educational Non-Profit

The client encountered several obstacles in their data management processes:
  1. ETL Performance: The newly designed ETL processes on AWS Redshift were not meeting desired SLAs due to the scale of data being scanned.
  2. Cost Efficiency: Higher and more expensive than anticipated AWS Redshift costs were required to keep the system running due to the large volume of data being scanned.
  3. Scalability: The infrastructure needed to scale effectively to handle increasing data volumes without further performance degradation.
  4. Codebase Maintenance: The client required a solution that allowed the use of their existing SQL stored procedures without extensive rewrites

Infinitive's Strategic Solution with Databricks

Comparative Analysis—Databricks vs. Redshift

To address these challenges, Infinitive leveraged the capabilities of Databricks. A Proof of Concept (PoC) was executed with the following steps:

  1. Databricks Workspace Setup: An isolated Databricks workspace was established to conduct tests without affecting the existing architecture.
  2. Synthetic Data Testing: Synthetic data sets, doubling the client’s data volumes, were generated in Databricks to test real-world ETL workloads.
  3. SQL Code Portability: SQL stored procedures from Redshift were transferred to Databricks with the help of Databricks SQL magic for optimal performance.
  4. Benchmarking Performance: ETL execution times and resource usage were compared between Redshift and Databricks to evaluate performance gains.

The comparison revealed several advantages of Databricks:

  • Photon Engine: Databricks’ Photon engine significantly accelerated query performance.
  • SQL Magic: This feature allowed for the seamless transition of Redshift SQL code with minimal adjustments.
  • Resource Utilization: Databricks demonstrated superior cost-effectiveness by requiring fewer resources for the same workload.

Infinitive’s Strategic Solution with Databricks

To address these challenges, Infinitive leveraged the capabilities of Databricks. A Proof of Concept (PoC) was executed with the following steps:

  1. Databricks Workspace Setup: An isolated Databricks workspace was established to conduct tests without affecting the existing architecture.
  2. Synthetic Data Testing: Synthetic data sets, doubling the client’s data volumes, were generated in Databricks to test real-world ETL workloads.
  3. SQL Code Portability: SQL stored procedures from Redshift were transferred to Databricks with the help of Databricks SQL magic for optimal performance.
  4. Benchmarking Performance: ETL execution times and resource usage were compared between Redshift and Databricks to evaluate performance gains.

Comparative Analysis—Databricks vs. Redshift

The comparison revealed several advantages of Databricks:

  • Photon Engine: Databricks’ Photon engine significantly accelerated query performance.
  • SQL Magic: This feature allowed for the seamless transition of Redshift SQL code with minimal adjustments.
  • Resource Utilization: Databricks demonstrated superior cost-effectiveness by requiring fewer resources for the same workload.

Outcomes of the Databricks Implementation

The PoC highlighted remarkable improvements:

  1. ETL Speed: Databricks showcased a marked improvement in ETL task execution, running processes 15 to over 100 times faster than Redshift.
  2. Resource Efficiency: With Databricks, the client achieved the desired outcomes using 6x less compute power and instances that were 20x more cost-effective than Redshift.
  3. Code Reusability: The ability to reuse existing SQL stored procedures represented significant labor cost savings.

 

Conclusion: The Infinitive team successfully demonstrated that Databricks is a high-performance, cost-effective, and scalable solution, aligning with the client’s growing data management needs. The PoC confirmed the potential for substantial cost savings over the next five years, validating the strategic decision to migrate to Databricks.

Outcomes of the Databricks Implementation

 The PoC highlighted remarkable improvements:

  1. ETL Speed: Databricks showcased a marked improvement in ETL task execution, running processes 15 to over 100 times faster than Redshift.
  2. Resource Efficiency: With Databricks, the client achieved the desired outcomes using 6x less compute power and instances that were 20x more cost-effective than Redshift.
  3. Code Reusability: The ability to reuse existing SQL stored procedures represented significant labor cost savings.

Conclusion: The Infinitive team successfully demonstrated that Databricks is a high-performance, cost-effective, and scalable solution, aligning with the client’s growing data management needs. The PoC confirmed the potential for substantial cost savings over the next five years, validating the strategic decision to migrate to Databricks.

Are you ready to get more value out of your data?