Everything you wanted to know about Big Data processing (but were too afraid to ask)

Posted by Arvind Hosagrahara, September 9, 2024

12 views (last 30 days) | 0 Likes | 0 comment

Let me start with a simple observation - You Probably Don’t Have Big Data. Modern data processing tools including our very own MATLAB and Simulink are powerful enough to handle hefty workloads. Today's computers have evolved to be massively powerful and can scale (sometimes trivially) to meet most data processing needs. Simultaneously, mature cloud technologies bring with them the promise of practically unlimited compute, storage, and networking capabilities.

This evolution has led to the explosion of "Big Data". Datasets are now measured in Petabytes or larger. Event streams are ingested at the rates of hundreds of billions of data points a day, etc.

The LinkedIn Kafka system processed over 175 terabytes (over 800 billion messages) daily as far back as 2015. At peak loads that translated to 2.75 gigabytes of data per second balanced over thousands of brokers on more than 60 clusters.

Impressive as that may sound, in our line of work, i.e., in accelerating the pace of discovery in engineering, we run across systems that will clock up that kind of traffic in a hurry. The optical slipring on a typical CAT scan achieves speeds in excess of 10 Gbps. Some of these machines record tremendous amounts of data over minutes. From a different domain, the datasets that result from LIDAR point clouds acquired for large fleets of vehicles or the sampling rates from high-speed acquisition systems result in significant Big Data challenges. It would be fair to say that engineering data presents the same acute challenges and have been doing so for quite a while.

The happy marriage of the techniques developed for handling Google sized workloads, with engineering data problems of today presented the rationale for this blog post. To fully understand why we are where we are, let us turn the clock back and start at the beginning.

In the beginning ...

I could start with monoids, Bird-Meertens formalism, the parallelizaton contract, StarLisp's *map and reduce!! implementations and other esoteric topics when it comes to tracing the root of modern Big Data processing. This dates back to the 1980s and sometimes even earlier. Some of the debate on who did it first is a religious topic and often contested.

Picture of pyramid. Nice! Can you move it 50ft that way! I'm sure our Data and ETL engineers will love it.

For this post, however, I am going to adopt a more pragmatic view of modern Big Data processing and its roots. So, please sit back and enjoy the story, as re-told from my perspective. And no, I am not going to cover the 4 V's of what Big Data is - Volume, Velocity, Variety and Veracity... if you were curious. It is a topic that has been beaten to death elsewhere across the internet.

In the not so distant past ...

In the early 2000s when I joined MathWorks, the tech industry was an exciting place to be. It still is. At the time, my work with leading automotive companies was where I touched my first Peta scale database - an ASAM ODS data fusion server containing test data for post-processing and evaluation. All of this was set against the backdrop of early internet giants on the west coast jockeying for internet search dominance.

At the time, Yahoo offered a hugely popular free email service and within a few years Google introduced the now ubiquitous Gmail service. With that offering, the internet search giant offered the internet connected public a treat. Rather than the 4-10MB capacities commonly available for free, Gmail offered a staggeringly large 1GB mailbox at launch - for free. Some people actually thought it was a hoax while many engineers (including myself) had only one thought... how are they doing it?

2002 - Apache Nutch

The Apache Nutch project was in the process of building a search engine system that can index 1 billion pages. After a lot of research, two of their key researchers (Doug Cutting and Mike Cafarella) concluded that such a system would cost around half a million dollars in hardware, along with a monthly running cost of $30,000 approximately - which was both extremely expensive and infeasible. This will become relevant to our story shortly.

2003 - The Google File System

The “how” of the Google approach became clear with the publication of the whitepaper detailing the Google File System (GFS) by Sanjay Ghemawat, et. al. From a high-level, the system looked like:

Their design described how they achieved consistency, reliability and high-availability of a storage medium distributed across clusters. What is available as modern cloud storage e.g.: Amazon's S3, Azure Blob, Google Cloud Storage, Hadoop Distributed File System (HDFS), etc. are implementations of similar concepts. The real implication of the work was that the cost of storage plummeted. Esoteric and expensive enterprise storage systems had competition in the form of systems running over massive scales with cheap low-cost hardware (mentally, think of consumer-grade computers).

2004 - MapReduce: Simplified Data Processing on Large Clusters

The Google system scanned all data on Gmail and on the heels of the “how we store stuff” came a key whitepaper describing “how we analyze stuff”. Their paper on MapReduce: Simplified Data Processing on Large Clusters by Jeff Dean and Sanjay Ghemawat, shed light on how the massive amounts of data was being processed. While one can argue that the concepts for map-reduce pre-date the whitepaper, it is usually accepted that this represented the first accessible implementation for practitioners. A functional programming API designed for ease of use. In operation, this looked like:

It became clear that data being split or chunked could be replicated, stored, and distributed across a cluster. This data could then be analyzed in a distributed way without moving the data. Results of the analysis could be mapped and reduced to the user. In other words when the data is too big to move, you move the program to the data.

This approach was disruptive to the way algorithms are built. As the simplest example, if I ran a mean() operation over every chunk or slice of data distributed across a cluster, I may never get the right result.

To correctly compute a mean, I would need to return the sum of the samples on each chunk and the number of samples i.e. a tuple. This step would be the map() operation and subsequently I could reduce() these two numbers by adding all the totals and then dividing the result by the sum of the number of samples across the chunks.

In short, the structure and staging of my math operation changes at scale.

2006 Apache Community support and adoption by Yahoo

The ideas in the whitepapers above were used by the engineers/researchers working on Nutch who formed an independent subproject they called Hadoop. Additionally, Doug joined Yahoo and by 2007 Yahoo reported using Hadoop on a 1000 node cluster. By 2008, Yahoo released Hadoop as an open-source project and at around the same time, Google reported that its MapReduce implementation had sorted 1 TB in 68 seconds. By 2009, the team at Yahoo broke that record by doing it in 62 secs.

The size of the clusters grew. The cost of storing data plummeted. Big Data movement grew nearly exponentially and continues to grow to this day fueled by the advances in AI, IoT systems and the cloud.

2009 Spark

MapReduce was not without its challenges. It forced a particular linear dataflow on distributed programs and much of the map and reduce activity relied on disk storage. At the University of California Berkeley, Matei Zaharia initially developed Spark which as a fault-tolerant, in-memory, implementation with architectural roots in a resilient distributed dataset (RDD). The management of workflows as directed acyclic graphs (DAG), as well as a robust API to allow transformations and operations on the RDD resulted in reduced latencies sometimes orders of magnitude faster when compared to MapReduce implementations. The code was open sourced in 2010 and donated to the Apache Software Foundation.

2013 Databricks

The company founded by Ali Ghodsi, Andy Konwinski, Arsalan Tavakoli-Shiraji, Ion Stoica, Matei Zaharia, Patrick Wendell, and Reynold Xin, grew out of the AMPLab at Berkeley offered a managed cloud-native data platform based on Spark.

2014 Spark Top-level Apache Project

By 2014 Spark became a Top-Level Apache Project. The core contributors to the project from Databricks, set a new record by sorting 100TB (1 trillion records) in 23 minutes breaking the previous record set by Hadoop MapReduce. Their system ran 3X faster with 10X fewer machines.

That was nearly 10 years ago. Since then, there have been steady advances in the optimization of performance of the libraries. A stream of innovations such as Delta Lake (to boost query performance) which launched in 2020, MLflow (for model lifecycle management), Photon (vectorized engine in C++ that maintains a Spark compatible API) etc. continue to evolve capabilities of these big data systems.

2014-Present

Building on the foundation of Spark, innovation continues with innumerable optimizations in workflow, security, and performance. The ecosystem of big data tools, features and vendor plugins mature with a focus on interoperability and access to a wide array of clients. Governance features such as the Unity Catalog appear as industry converge around standards. The unification of the underlying storage formats, brings large top-level projects closer together in terms of performance and feature parity.

How it all works

Enough of the history lesson, let us take a high-level peek as to how these systems work. I can try to summarize it in layperson terms with a few core principles.

Move the compute to the data - if the data is too big to move, it is easier to move the compute to the data.
Speed is achieved by being smart about what work to do - When optimized down to the storage format, many of these systems are fast because they are not doing the work. For example, does a software program need to actually read all the rows of a dataset looking for a max if the metadata for the chunk of rows tells you that the answer you are looking for is not within the chunk.
Keep it in memory when possible - The RDD and subsequently the Dataset and DataFrame APIs allow higher level abstractions to perform distributed computing tasks efficiently.
Try not to do the work until absolutely necessary - The DAG and lazy evaluation translate to methods of optimizing the work before actually doing any work.

There is a story behind each of these principles. I can describe what can you do with say a Parquet or ORC file that cannot be done with HDF or other binary file but that is a topic that is best left to a different blog post.

The hidden requirements

As one starts doing real work with these systems, a whole slew of other technical and non-technical challenges emerge.

How does one enable access to the data securely and effectively?
How are the insights from the analytics shared with the rest of the organization leading to actionable information?
What is the lineage of the data (where it came from and how has it been processed and enriched)?
What is the provenance of the analytics models (which version, what data was used to train it, how does it perform, etc.)?

The actual data processing is a subset of a larger picture in the context of data and model management. In this space, there is a need for security and governance. Often these requirements can be traced down to a maturity model. For example, are all users centralizing the tracking of their experiments so that they can compare effectively? All of this leads to a desire for a holistic and unified approach to data analytics.

Hold on, why are you discussing this?

I wanted to address how a typical user of MATLAB and Simulink can leverage these developments in handling Big Data problems. With the proliferation of the cloud, the explosion in the data sizes as well as the availability and maturity of data platforms like Databricks, these advancements enable our users to tackle bigger problems while being faster. My desire is to enable our users to leverage all the domain specific expertise that comes with the wide array of our toolboxes with approaches that scale.

The Ah-ha! moment for me was realizing that the same tools that are used to tackle web indexing problems at scale can be used to tackle some of the more gnarly engineering problems. After all, data is just data. Sometimes even the math is similar and well as we all know, the Math... Works.

In an upcoming series of posts, we will hear from Michael Browne, Anders Sollander and Andy Thé, on the topic of processing Big Data. The next post will dive into the details of the many workflows enabled by integrations into our products.

P.S: The view of "why we are where we are" in this post is my perspective. Perhaps a bit romanticized, but the narrative should be mostly accurate. How did I do? Please do consider leaving a comment below as hearing from you, the reader, usually makes my day.