Big Data Frameworks for Data Science

By | February 10, 2021
Big Data Frameworks

‘Big’ data often refers to huge data in volume and keeps growing by the day. For example, the number of Facebook users keeps growing every day, and each user’s data also grows as they browse through Facebook. Such data can be structured as well as unstructured. The data is big in size and big in complexity and speed, i.e., fast and complex. Big data is thus identified by the 3V’s, i.e., Volume, Variety, and Velocity.

Big data helps us analyze data and perform various operations on it to optimize cost and time. When we use this big data with robust frameworks, it becomes easier to find the exact problem or issue in real-time, make dynamic offers to users, detecting real-time fraud, etc. Frameworks have the following advantages:

  • Provides a structure and common reference to organizations to explore the full potential of the big data
  • Improves the approach to collecting, storing, managing, and sharing data; use data effectively for business purposes.
  • Perform advanced analytics to get better insights and make intelligent data-driven decisions.
  • Taps data from various sources and capture different types of data to find the best insights
  • Faster and affordable, can reuse common pieces of code, and has great community support.
  • Facilitates advanced analytics through visualization, predictive analysis

What does the Big Data Framework include – the big data framework structure

The big data framework consists of 6 important elements:

  1. Big data strategy: The strategy includes analyzing the most likely areas to return more business value and improve sales. Out of the big chunk of data, only the relevant data can be picked up if the strategy is clearly defined, making the analysis easier.
  2. Big data architecture: There are many architectures to store and process huge datasets. The architecture to be chosen depends on the project and business needs. It includes considering the technical capabilities of a framework for the storage and processing of big data.
  3. Big Data Functions: Functions involve assigning roles and responsibilities to an organization’s resources so that the best result can be obtained. It covers the non-technical aspects of Big Data.
  4. Big Data Algorithms: Working with statistics and algorithms forms the core of big data analysis, processing, and automation of tasks. Algorithms cover the technical aspects of handling big data, like getting insights and knowledge from data.
  5. Big Data Processes: Following processes give a structure to the project and makes it easy to track goals on a day to day basis. The process helps an organization focus on the business while following the best practices and measures.
  6. (Surprisingly) AI or Artificial Intelligence: Since AI can learn from Big Data, it is the next logical step of a big data framework. The framework brings business benefits by taking a functional view of AI.

Top 10 Big Data Frameworks

There are many big data frameworks available in the market, out of which the below are most popular and yield the best results for your business. These are among the top big data frameworks of 2020. It all started with Apache Hadoop, which revolutionized how big data is stored and processed. Hadoop remains to be popular amongst projects. Let us see some popular big data frameworks:

1. Hadoop

Hadoop is a Java-based open-source framework that provides batch processing and data storage services. It has a big architecture consisting of many layers, like HDFS and YARN, for data processing. Storage happens across various hardware machines arranged as clusters. Hadoop provides a distributed environment with the following main components:

  • HDFS (Hadoop Distributed File System), the hardware layer, stores data in the Hadoop cluster, including replication and storage activities across all the data clusters.
  • YARN (Yet Another Resource Negotiator) is responsible for job scheduling and resource management,
  • MapReduce is the software layer that works as the batch processing engine and processes huge data in a cluster.

Hadoop is fast and can store petabytes of data. The performance is better as the data storage space increases. HDFS is used in many companies like Alibaba, Amazon, eBay, Facebook, etc., to store data and be integrated with many popular big data analytics frameworks, making it a popular choice.

Pros: Cost-effective, reliable, compatible with most popular big data technologies, high scalability, multiple language support, fault tolerance, and good failure handling mechanism.

Cons: not suitable for real-time processing, has many processing overheads as it does not perform in-memory computations, and not very secure.

2. Apache Spark

Spark is a batch processing framework with enhanced data streaming processing. It facilitates in-memory computations making it superfast. Spark integrates with Hadoop and can act as a stand-alone cluster tool. It is used by companies like Amazon, Hitachi solutions, Baidu, Nokia, etc. Spark supports four languages: Python, R, Java, and Scala and has 5 main components:

  • HDFS and HBase, which form the first layer of storage systems,
  • YARN manages the resources
  • Core engine that performs task management, memory management and defines RDD (Resilient Distributed Datasets) API, responsible for distributing data across the nodes for parallel processing.
  • Utilities containing Spark SQL to execute SQL queries for stream processing, GraphX to process graph data, and MLLib for machine learning algorithms
  • API for integration with programming languages like Java, Python, etc.

Pros: Extremely fast parallel processing, scalable, fault tolerance, integration with Hadoop, support for advanced analytics and AI implementations, a smaller number of I/O operations to disk

Cons: set up and implementation takes time and is a bit complex, supports only a few languages

Learn Data Science with Spark.

3. MapReduce

MapReduce is a big data search engine and part of the Hadoop framework. When it was developed, it was just an algorithm to process huge volumes of data parallelly. Now, it is more than just that and works in three stages:

  • Map: This stage handles the pre-processing and filtration of data
  • Shuffle: Shuffles (sorts) the data as per the output key, generated by the map function.
  • Reduce: reduces the data based on the function set by the user and produces the final output.

Although many new technologies have come, MapReduce is popular and much used because it is resilient, stable, fast, scalable, and based on a simple model. Further, it is secure and fault-tolerant for failures like crashes, omissions, etc.

Pros: Handles data-intensive applications well, simple to learn, flexible, best suited for batch processing

Cons: Requires a large amount of memory, requires a pipeline of multiple jobs, real-time processing is not possible,

4. Apache Hive

Apache Hive

Facebook designed the Apache Hive as an ETL and data warehousing tool. It is built on top of the HDFS platform of the Hadoop ecosystem. It consists of three components, namely clients, services and storage, and computing. Hive has its own declarative language for querying, namely HiveQL, which is highly suitable for data-intensive jobs. Companies like JP Morgan, Facebook, Accenture, PayPal use Hive. Hive engine converts queries and requests into MapReduce task chains using the following:

  • Parser: takes in the SQL request and parses and sorts them
  • Optimizer: optimizes the sorted requests
  • Executor: sends the optimized tasks to the MapReduce framework

Pros: Runs queries very fast, even joins can be written and run quickly and easily, multiple users can query data (using HiveQL), easy to learn

Cons: Data has to be structured for processing, not suitable for processing online transactions (OLTP) but suitable only for online analytical processing (OLAP), HiveQL doesn’t support updates and deletes

5. Flink

FlinkFlink is an open-source single-stream processing engine and is based on the Kappa architecture. It has one processor which treats the input as a stream, and the streaming engine in real-time processes the data. Batch processing is a special case of streaming. Flink architecture has the following components:

  • Client: Takes the program, builds a job dataflow graph, and passes it to the job manager. The client is also responsible for retrieving job results.
  • Job Manager: Creates the execution graph based on the dataflow graph received from the client. Assigns and supervises the jobs to task managers in the cluster.
  • Task manager: Executes tasks assigned by the JobManager. Multiple Task managers perform their specified tasks parallelly.
  • Program: The code that is run on the Flink cluster

Flink APIs are available for Java, Python, Scala. It also provides utility methods for common operations, event processing, machine learning, and graph processing. Flink processes data at the blink of an eye. It is highly scalable and scales thousands of nodes of a cluster.

Pros: High-speed processing, easy to learn and use APIs, better testing capabilities, unified programming, work on file systems other than HDFS.

Cons: APIs are still raw and can be enhanced, memory management can be an issue for longer running pipelines, limited fault tolerance compared to competitors

6. Samza

SamzaThrough Samza, you can build stateful applications that can process real-time data from various sources. It was built to solve the batch processing latency (large turn-around time) problem. Some of the input sources are Kafka, HDFS, Kinesis, Eventhubs, etc. The unique feature of Samza is that it is horizontally scalable. It also has rich APIs like Streams DSL, Samza SQL, or Apache Beam APIs. You can process both batch and streaming data using the same code (write once, run anywhere!). LinkedIn created and used Samza architecture consisting of the following components:

  • Streaming layer: provides partitioned streams that are durable and can be replicated
  • Execution layer: schedules and coordinates tasks across machines
  • Processing layer: processes and applies transformations to the input stream

The streaming layer (Kafka, Hadoop, etc.) and the execution layer (YARN, Mesos) are pluggable components.

Pros: Makes full use of the Kafka architecture for fault tolerance, state storage, and buffering, more reliable as there is better isolation between tasks (as Samza uses separate JVM for each stream processor)

Cons: supports only JVM languages, use of separate JVM can be a memory overhead, doesn’t support low latency, depends on Hadoop cluster for resource negotiation

7. Storm

Apache Storm

Storm works with a huge real-time data flow. The storm was built to handle low latency and is highly scalable. It can recover faster after a downtime. The storm was Twitter’s first big data framework, after which it has also been adopted by giants like Yahoo, Yelp, Alibaba, and more. Storm supports Java, Python, Ruby, and Fancy. Storm architecture is based on master-slave and consists of 2 nodes: Master (Nimbus) and worker.

  • Master node: allocates tasks and monitors machine/cluster failures.
  • Worker node: Also called supervisor nodes, worker nodes are responsible for task completion.

The storm is platform-independent and fault-tolerant. Although it is said to be stateless, Storm does store its state using Apache ZooKeeper. It has an advanced topology, namely, Trident topology, that maintains state.

Pros: Open-source, flexible, performance is always high as resources can be linearly added under high load (scalable), high-speed real-time stream processing, guaranteed data processing

Cons: Complex implementation, debugging is not easy, not so easy to learn

8. Impala

In C++ and Java, Impala is an open-source massive parallel processing query engine that processes enormous volumes of data in a single Hadoop cluster. Just like Hive has its own query language, Impala has one too! It has low latency and high performance. It gives a near RDBMS experience in terms of performance and usability. Impala is like the best of both worlds: the performance and support of SQL like query language and Hadoop’s flexibility and scalability. Impala is based on daemon processes that monitor query execution, making it faster (than Hive). Impala supports in-memory data processing. Impala is decoupled from its storage engine and has three components:

  • Impala daemon (impalad): Runs on all the nodes where Impala is installed. Once a query is received and accepted, Impalad reads and writes it to data files and distributes the queries to the nodes in that cluster. The results are then received by the coordinating node that initially took the query.
  • Impala statestore: Checks the health of each Impala daemon and updates the same to the other daemons.
  • Impala metastore & metadata: Metastore is a centralized database where table and column definitions and information are stored. Impala nodes cache metadata locally so that they can be retrieved faster.

Pros: If you know SQL, you can quickly work with Impala, uses the Parquet file format, which is optimized for large-scale queries like in a real-time use case, uses EDA and data discovery to make data loading and reorganizing faster, no data movement as processing occurs where data resides

Cons: no support for indexing, triggers, and transactions, tables have to be always refreshed whenever new data is added to HDFS, Impala can only read text files and not custom binary files

9. Presto

prestodbPresto is an open-source distributed SQL tool suited for smaller datasets (Tb). IT provides fast analytics and supports non-relational sources like HDFS, Cassandra, MongoDB, etc., and relational sources like MSSQL, Redshift, MySQL, etc. Presto has a memory-based architecture where query execution runs in parallel, and results are often obtained in seconds. Facebook, Airbnb, Netflix, and Nasdaq, and many more giants use Presto as their query engine. Presto runs on Hadoop and uses a similar architecture to that of Massively Parallel Processing, having,

  • Coordinator nodes: users submit their queries to this node, which then uses a custom query and the engine to distribute and schedule queries across the worker nodes
  • Worker nodes: Execute the assigned queries parallel, thus saving time

Pros: User-friendly, query execution time is extremely low, minimal query degradation when there is high workload, easy to add images and links

Cons: Reliability issues in terms of results

10. Hbase

Hbase can store humongous amounts of data and process and access it randomly. It is built on top of the Hadoop file system and is linearly scalable, and uses distributed column-oriented database architecture. It also provides data replication across clusters and automatic fail-over support. Hbase also has a Java API for clients.

Tables are split into regions taken care of by the region servers. Regions are further vertically split into stores. Stores are saved as files in the Hadoop file system. There are three main components in Hbase:

  • MasterServer: maintains the state of the cluster, handles load balancing, responsible for the creation of tables and columns
  • RegionServer: handles data-related operations, determines the size of each region, and handles read and write requests for all the regions under it
  • Client library: Provides methods for the client for communication.

Pros: Uses hash tables internally to provide random access, and stores data in indexed files thus enabling faster lookup, no fixed schema thus flexible, auto-sharding, provides row-level atomicity, can be integrated with Hive (if one has to work with SQL)

Cons: No support for transactions, if the master fails, the cluster goes down (single failure point), no built-in permissions or authentications, joins and normalization process are difficult, more hardware requirements making it a bit costly.


There are many other frameworks that we have not covered in this article but need a mention: Heron, Kudu, Open Refine, Kaggle, Cloudera, Pentaho. Each framework has been developed with some unique features and purpose, and we cannot say that one big data framework fits all the projects. Each project has different requirements and hence needs the framework best suited for that particular project. For example, if your project needs batch-processing, Spark is a great choice. For data-intensive jobs, Hive is much suitable and is easier to learn too. For real-time streaming, Storm or Flink are both great choices.

People are also reading:

Leave a Reply

Your email address will not be published. Required fields are marked *