A blog about SQL and Big Data: Big data Lambda Architecture

What is Lambda λ Architecture?

The lambda architecture, first proposed by Nathan Marz, is essentially a framework to deal with big data providing for fault tolerance, speed and efficiency. The framework achieves this through its 3 layers:

Batch Layer
This layer aims at perfect accuracy by being able to process all available data when generating views and can fix any errors by recomputing the complete data set, then updating existing views
Serving Layer
This layer processes data streams in real time. The layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received
Speed Layer
Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.

The Lamba Archietecture (sourced from http://lambda-architecture.net/)

Key points of λ architecture

1. Simultaneous processing by batch and speed layers

The batch layer always recomputes the entire data set whilst the speed layer only processes the latest. The speed layer real-time views are intended to be transient and as soon as the data propagates through the batch and serving layers the corresponding results in the real-time views can be discarded


Batch vs Speed layer (sourced from http://bit.ly/18zXJEa)

2. Store unchanged data for batch reprocessing

The lambda architecture proposes to store the unchanged original data so that as your code evolves and you need new output from your input you are able to do so. Jay Kreps (@jaykreps) put its eloquently in the article Questioning the Lambda Architecture:

"Reprocessing is one of the key challenges of stream processing but is very often ignored....This is a completely obvious but often ignored requirement. Code will always change. So, if you have code that derives output data from an input stream, whenever the code changes, you will need to recompute your output to see the effect of the change.
Why does code change? It might change because your application evolves and you want to compute new output fields that you didn’t previously need. Or it might change because you found a bug and need to fix it. Regardless, when it does, you need to regenerate your output. I have found that many people who attempt to build real-time data processing systems don’t put much thought into this problem and end-up with a system that simply cannot evolve quickly because it has no convenient way to handle reprocessing. The Lambda Architecture deserves a lot of credit for highlighting this problem."

Implementing Lambda architecture

Open Source

Batch layer (Apache Hadoop)
Serving layer (Cloudera Impala)
Speed layer (Storm, Apache HBase)

Microsoft/Open source

I've looked high and low for case studies and found very few references to implementing Lambda architecture using the Microsoft stack. Here's something I found on the Microsoft Virtual Academy site that lays out the Microsoft technology stack by lambda layer. There are also a few case studies within the linked powerpoint presentations of companies that have implemented a big data solution using Microsoft and open source tools under lambda architecture.