What is Apache Spark? The big data platform that crushed Hadoop (2024)

Fast, flexible, and developer-friendly, Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing, and machine learning.

By Ian Pointer

InfoWorld |

What is Apache Spark? The big data platform that crushed Hadoop (2)

Table of Contents

Apache Spark defined
What is Spark in big data
Spark RDD
Spark SQL
Spark MLlib and MLflow
Structured Streaming
Delta Lake
Pandas API on Spark
Running Apache Spark
Databricks Lakehouse Platform

Apache Spark defined

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores. Spark also takes some of the programming burdens of these tasks off the shoulders of developers with an easy-to-use API that abstracts away much of the grunt work of distributed computing and big data processing.

What is Spark in big data

When people talk about “big data,” they generally refer to the rapid growth of data of all kinds—structured data in database tables, unstructured data in business documents and emails, and semi-structured data in system log files and web pages. Whereas analytics in years past focused on structured data and revolved around the data warehouse, analytics today gleans insights from all kinds of data, and revolves around the data lake. Apache Spark was purpose-built for this new paradigm.

From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. You’ll find it used by banks, telecommunications companies, games companies, governments, and all of the major tech giants such as Apple, IBM, Meta, and Microsoft.

Spark RDD

At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing. Apache Spark turns the user’s data processing commands into a Directed Acyclic Graph, or DAG. The DAG is Apache Spark’s scheduling layer; it determines what tasks are executed on what nodes and in what sequence.

RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.

Spark runs in a distributed fashion by combining a driver core process that splits a Spark application into tasks and distributes them among many executor processes that do the work. These executors can be scaled up and down as required for the application’s needs.

Spark SQL

Spark SQL has become more and more important to the Apache Spark project. It is the interface most commonly used by today’s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compliant interface for querying data, bringing the power of Apache Spark to analysts as well as developers.

Alongside standard SQL support, Spark SQL provides a standard interface for reading from and writing to other datastores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular data stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem. Spark SQL allows user-defined functions (UDFs) to be transparently used in SQL queries.

Selecting some columns from a dataframe is as simple as this line of code:

citiesDF.select(“name”, “pop”)

Using the SQL interface, we register the dataframe as a temporary table, after which we can issue SQL queries against it:

citiesDF.createOrReplaceTempView(“cities”)spark.sql(“SELECT name, pop FROM cities”)

Behind the scenes, Apache Spark uses a query optimizer called Catalyst that examines data and queries in order to produce an efficient query plan for data locality and computation that will perform the required calculations across the cluster. Since Apache Spark 2.x, the Spark SQL interface of dataframes and datasets (essentially a typed dataframe that can be checked at compile time for correctness and take advantage of further memory and compute optimizations at run time) has been the recommended approach for development. The RDD interface is still available, but recommended only if your needs cannot be addressed within the Spark SQL paradigm (such as when you must work at a lower level to wring every last drop of performance out of the system).

Spark MLlib and MLflow

Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset. MLlib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLlib, and then imported into a Java-based or Scala-based pipeline for production use.

An open source platform for managing the machine learning life cycle, MLflow is not technically part of the Apache Spark project, but it is likewise a product of Databricks and others in the Apache Spark community. The community has been working on integrating MLflow with Apache Spark to provide MLOps features like experiment tracking, model registries, packaging, and UDFs that can be easily imported for inference at Apache Spark scale and with traditional SQL statements.

Structured Streaming

Structured Streaming is a high-level API that allows developers to create infinite streaming dataframes and datasets. As of Spark 3.0, Structured Streaming is the recommended way of handling streaming data within Apache Spark, superseding the earlier Spark Streaming approach. Spark Streaming (now marked as a legacy component) was full of difficult pain points for developers, especially when dealing with event-time aggregations and late delivery of messages.

All queries on structured streams go through the Catalyst query optimizer, and they can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data. Support for late messages is provided by watermarking messages and three supported types of windowing techniques: tumbling windows, sliding windows, and variable-length time windows with sessions.

In Spark 3.1 and later, you can treat streams as tables, and tables as streams. The ability to combine multiple streams with a wide range of SQL-like stream-to-stream joins creates powerful possibilities for ingestion and transformation. Here’s a simple example of creating a table from a streaming source:

val df = spark.readStream .format("rate") .option("rowsPerSecond", 20) .load()df.writeStream .option("checkpointLocation", "checkpointPath") .toTable("streamingTable")spark.read.table("myTable").show()

Structured Streaming, by default, uses a micro-batching scheme of handling streaming data. But in Spark 2.3, the Apache Spark team added a low-latency Continuous Processing mode to Structured Streaming, allowing it to handle responses with impressive latencies as low as 1ms and making it much more competitive with rivals such as Apache Flink and Apache Beam. Continuous Processing restricts you to map-like and selection operations, and while it supports SQL queries against streams, it does not currently support SQL aggregations. In addition, although Spark 2.3 arrived in 2018, as of Spark 3.3.2 in March 2023, Continuous Processing is still marked as experimental.

Structured Streaming is the future of streaming applications with the Apache Spark platform, so if you’re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.

Delta Lake

Like MLflow, Delta Lake is technically a separate project from Apache Spark. Over the past couple of years, however, Delta Lake has become an integral part of the Spark ecosystem, forming the core of what Databricks calls the Lakehouse Architecture. Delta Lake augments cloud-based data lakes with ACID transactions, unified querying semantics for batch and stream processing, and schema enforcement, effectively eliminating the need for a separate data warehouse for BI users. Full audit history and scalability to handle exabytes of data are also part of the package.

And using the Delta Lake format (built on top of Parquet files) within Apache Spark is as simple as using the delta format:

df = spark.readStream.format("rate").load()stream = df .writeStream .format("delta") .option("checkpointLocation", "checkpointPath") .start("deltaTable")

Pandas API on Spark

The industry standard for data manipulation and analysis in Python is the Pandas library. With Apache Spark 3.2, a new API was provided that allows a large proportion of the Pandas API to be used transparently with Spark. Now data scientists can simply replace their imports with import pyspark.pandas as pd and be somewhat confident that their code will continue to work, and also take advantage of Apache Spark’s multi-node execution. At the moment, around 80% of the Pandas API is covered, with a target of 90% coverage being aimed for in upcoming releases.

Running Apache Spark

At a fundamental level, an Apache Spark application consists of two main components: a driver, which converts the user’s code into multiple tasks that can be distributed across worker nodes, and executors, which run on those worker nodes and execute the tasks assigned to them. Some form of cluster manager is necessary to mediate between the two.

Out of the box, Apache Spark can run in a stand-alone cluster mode that simply requires the Apache Spark framework and a Java Virtual Machine on each node in your cluster. However, it’s more likely you’ll want to take advantage of a more robust resource management or cluster management system to take care of allocating workers on demand for you.

In the enterprise, this historically meant running on Hadoop YARN (YARN is how the Cloudera and Hortonworks distributions run Spark jobs), but as Hadoop has become less entrenched, more and more companies have turned toward deploying Apache Spark on Kubernetes. This has been reflected in the Apache Spark 3.x releases, which improve the integration with Kubernetes including the ability to define pod templates for drivers and executors and use custom schedulers such as Volcano.

If you seek a managed solution, then Apache Spark offerings can be found on all of the big three clouds: Amazon EMR, Azure HDInsight, and Google Cloud Dataproc.

Databricks Lakehouse Platform

Databricks, the company that employs the creators of Apache Spark, has taken a different approach than many other companies founded on the open source products of the Big Data era. For many years, Databricks has offered a comprehensive managed cloud service that offers Apache Spark clusters, streaming support, integrated web-based notebook development, and proprietary optimized I/O performance over a standard Apache Spark distribution. This mixture of managed and professional services has turned Databricks into a behemoth in the Big Data arena, with a valuation estimated at $38 billion in 2021. The Databricks Lakehouse Platform is now available on all three major cloud providers and is becoming the de facto way that most people interact with Apache Spark.

Apache Spark tutorials

Ready to dive in and learn Apache Spark? We recommend starting with the Databricks learning portal, which will provide a good introduction to the framework, although it will be slightly biased towards the Databricks Platform. For diving deeper, we’d suggest the Spark Workshop, which is a thorough tour of Apache Spark’s features through a Scala lens. Some excellent books are available too. Spark: The Definitive Guide is a wonderful introduction written by two maintainers of Apache Spark. And High Performance Spark is an essential guide to processing data with Apache Spark at massive scales in a performant way. Happy learning!

Next read this:

Why companies are leaving the cloud
5 easy ways to run an LLM locally
Coding with AI: Tips and best practices from developers
Meet Zig: The modern alternative to C
What is generative AI? Artificial intelligence that creates
The best open source software of 2023

Apache Spark
Big Data
SQL
Machine Learning
Hadoop
Open Source
Analytics

Ian Pointer is a senior big data and deep learning architect, working with Apache Spark and PyTorch. He has more than 15 years of development and operations experience.

FAQs

What is Apache Spark? The big data platform that crushed Hadoop? ›

What is Apache Spark? Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

Read On ›

What is Apache Spark vs Hadoop? ›

Both Hadoop and Spark allow you to process big data in different ways. Apache Hadoop was created to delegate data processing to several servers instead of running the workload on a single machine. Meanwhile, Apache Spark is a newer data processing system that overcomes key limitations of Hadoop.

Discover More Details ›

How does Spark work with Hadoop? ›

Apache Spark™ FAQ. How does Spark relate to Apache Hadoop? Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.

What is Apache Hadoop in big data? ›

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

See Details ›

Will Apache Spark replace Hadoop? ›

Hadoop excels over Apache Spark in some business applications, but when processing speed and ease of use is taken into account, Apache Spark has its own advantages that make it unique. The most important thing to note is, neither of these two can replace each other.

Find Out More ›

What is Apache Spark and why is it used? ›

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

Tell Me More ›

What is Spark and why is it used? ›

Posted by Rohan Joseph. Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. Simply put, Spark is a fast and general engine for large-scale data processing.

Show Me More ›

What is Apache Spark in simple words? ›

Spark is an Apache project advertised as “lightning fast cluster computing”. It has a thriving open-source community and is the most active Apache project at the moment. Spark provides a faster and more general data processing platform.

Explore More ›

What is replacing Hadoop? ›

BigQuery is a powerful alternative to Hadoop because it seamlessly integrates with MapReduce. Google continuously adds features and upgrades BigQuery to provide users with an exceptional data analysis experience. They have made it easy to import custom datasets and use them with services like Google Analytics.

Can we use Hadoop and Spark together? ›

Hadoop and Apache Spark can be used together as part of a big data processing and analytics solution. In fact, many organizations use both Hadoop and Spark in their big data infrastructure to take advantage of the strengths of each platform.

Show Me More ›

What is big data Hadoop vs spark? ›

Performance: Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce. Cost: Hadoop runs at a lower cost since it relies on any disk storage type for data processing.

Read The Full Story ›

Why is Hadoop called Apache? ›

Inspired by Google's MapReduce, a programming model that divides an application into small fractions to run on different nodes, Doug Cutting and Mike Cafarella started Hadoop in 2002 while working on the Apache Nutch project. According to a New York Times article, Doug named Hadoop after his son's toy elephant.

See Details ›

Is Apache Hadoop still used? ›

Is Hadoop still in demand? Hadoop remains applicable in specific cases, especially for big data processing and analytics tasks. Nevertheless, the big data technology landscape has advanced, with newer frameworks such as Apache Spark gaining favor due to improved performance and user-friendly features.

Get More Info Here ›

Can Spark work without Hadoop? ›

Spark doesn't need a Hadoop cluster to work. Spark can read and then process data from other file systems as well. HDFS is just one of the file systems that Spark supports. Spark does not have any storage layer, so it relies on one of the distributed storage systems for distributed computing like HDFS, Cassandra etc.

Can we run Apache Spark without Hadoop? ›

Spark can be used without Hadoop, although Hadoop and Spark are often used together in big data processing clusters.

Why not to use Apache Spark? ›

The second and most important point is, that Spark adds an overhead to the processing of your workload. If you use Spark, you increase network and CPU load and also add a relevant memory overhead to the actual workload. This is due to the fact, that Spark needs to compute the distribution of your workload.

View Details ›

Is Hadoop or Spark better? ›

Performance. Spark has been found to run 100 times faster in-memory, and 10 times faster on disk. It's also been used to sort 100 TB of data 3 times faster than Hadoop MapReduce on one-tenth of the machines. Spark has particularly been found to be faster on machine learning applications, such as Naive Bayes and k-means ...

Does Apache Spark need Hadoop? ›

As per Spark documentation, Spark can run without Hadoop. You may run it as a Standalone mode without any resource manager. But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc. Yes, Spark can run without Hadoop.

Learn More ›

Is Apache Spark a language or tool? ›

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Simple.

Discover More Details ›

What is Spark vs Hadoop advantages and disadvantages? ›

Spark is a good choice if you're working with machine learning algorithms or large-scale data. If you're working with giant data sets and want to store and process them, Hadoop is a better option. Hadoop is more cost-effective and easily scalable than Spark.

Show Me More ›