The Databricks Journey: From Idea to Industry Leader

As Databricks continues to be a leading player in data and AI, and with our clients increasingly relying on Databricks, curiosity arises about how it all started. Did it begin because of a Netflix competition? Perhaps an old Commodore 64? The answer may be both, none, or somewhere in between.

For those unaware, Databricks provides a unified analytics platform designed to simplify the process of working with big data and artificial intelligence. Their platform allows businesses to prepare and clean data, perform advanced analytics, and build and deploy machine learning models.

The founder and CEO, Ali Ghodsi, began his coding journey at the age of 6 on a used Commodore 64 gifted by his parents. Remarkably, by the age of 8, he was already coding games on the commodore to entertain himself. Although Ali was born in Iran and raised in Sweden, he eventually moved to the United States to further his academic career at the University of California, Berkeley.

Before diving into Databricks and Ali Ghodsi’s journey, it is important to highlight how the Netflix Prize Competition impacted Data and AI. From 2006-2009, Netflix ran a competition that offered a 1-million-dollar prize to any contestant who could predict consumer ratings 10 percent more accurately than Netflix’s proprietary Cinematch algorithm. Essentially, it was a competition to improve the accuracy of predictions regarding individuals’ enjoyment of movies based on their preferences. This played a significant role in highlighting the importance of scalable, efficient algorithms for big data analytics and machine learning. (Fun Fact: The algorithm developed by the winning team was never used by Netflix due to the engineering costs.)

During his time at UC Berkely, Ali Ghodsi crossed paths with Matei Zaharia, the founder of Apache Spark and current CTO & Co-founder at Databricks. The two were both a part of AMPLab – and although you will hear different stories about whether they participated in the Netflix Prize Competition or not, I choose to go with Matei Zaharia’s story. According to Matei, he engaged in conversations with a colleague, Lester Mackey, who was participating in the Netflix Prize Competition. Upon learning about Lester’s requirements for distributed systems to compete effectively, Matei started to develop Spark. His goal was to empower individuals like Lester to create distributed machine learning applications.

Spark provided a more efficient and general-purpose cluster-computing framework compared to existing technologies like MapReduce. Its in-memory computing capabilities made it particularly well-suited for iterative algorithms, which are common in machine learning tasks. With Apache Spark serving as a unified analytics engine for large scale data processing that provided an interface for managing entire clusters with implicit data parallelism and fault tolerance, the foundation was laid for the creation of Databricks.

Databricks was founded in 2013 and enhances Apache Spark’s core functionalities with a cloud-based platform. The platform features a collaborative workspace, integrated workflows, and a suite of advanced tools for data processing and analytics. These offerings simplify the deployment of Spark at scale, manage clusters, and facilitate collaboration on projects. Ultimately, by providing a managed Spark service, Databricks allows users to focus on their data and analytics without worrying about the underlying infrastructure.

Infinitive leverages Databricks for its comprehensive support of the Data Lakehouse concept, where transactional support, schema enforcement, governance, and BI support merge. The decoupling of storage from compute, along with the adoption of an open storage format, ensures flexibility and efficiency. Databricks also shines with its support for diverse data types and workloads, and its end-to-end streaming capabilities cater to real-time needs. Additionally, the Unity Catalog adds a layer of unified governance, empowering organizations to manage data, AI models, notebooks, dashboards, and files across clouds and platforms within the Databricks Data Intelligence Platform.


Author: Brandon Peretin