Apache Spark

Product

Developers:	Apache Software Foundation (ASF)
Last Release Date:	2020/05/14
Technology:	Development tools of applications

Content

2020: Exit of Apache Spark 3.0
2016: Description of Apache Spark
Notes
See Also

Apache Spark is a framework for creation of projects of the distributed processing of unstructured and poorly structured data, enters an ecosystem of the Hadoop Apache Spark projects^[1].

2020: Exit of Apache Spark 3.0

On May 14, 2020 the company NVIDIA announced cooperation with community developers Open source software to report GPU acceleration to the engine of the analysis and processing data bulks Apache of Spark 3.0.

According to the developer, with an exit of Spark 3.0 specialists on works with data to machine learning will also be able to apply GPU acceleration to ETL- the processing (sample, conversion and loading) which is often executed using transactions databases SQL.

As noted in NVIDIA, training of AI model can be provided on the same cluster of Spark, without transferring processes to separate infrastructure. It allows to carry out high-speed data analysis at all stages of the pipeline of data analysis, accelerating dozens and thousands terabyte of data - from the lake of the models given before training. Moreover, there is no need to change the existing code used for the applications of Spark working in the local environment and a cloud.

"Data analysis is the biggest computing task facing the companies and researchers. Native GPU acceleration for all pipeline of Spark 3.0 — from ETL before training and an inferens — provides performance and scale necessary for consolidation of potential of Big Data and force of AI",

noted Manuvir Das, the head of Enterprise Computing in NVIDIA

Being a strategic partner of NVIDIA in the field of AI, Adobe of one of the first got access to Spark 3.0 on Databricks. The first series of tests showed a 7-fold gain of performance and 90% cost reduction, thanks to the GPU accelerated data analysis for product development in Adobe Experience Cloud and support of the opportunities supporting digital businesses.

According to the statement of the developer, performance gain in Spark 3.0 increases the accuracy of models, allowing to train them at larger data sets and a thicket to retrain. It gives the chance to process terabytes of new data every day that very important for the specialists supporting the recommendatory systems and analyzing new research data. Besides, the accelerated processing means that obtaining results requires less hardware resources, so costs are considerably reduced.

NVIDIA and Databricks jointly optimize Spark using RAPIDS software for Databricks, providing GPU acceleration for data processing and machine learning on Databricks in health care, finance, retail and many other industries, emphasized in NVIDIA.

NVIDIA provides RAPIDS for Apache Spark with the open code to help specialists to increase performance of the pipelines. The accelerator cancels the functions which are earlier executed on CPU, applying GPU to the following tasks:

acceleration of ETL pipelines in Spark due to performance improvement of transactions of Spark of SQL and DataFrame needlessly in change of the code;
acceleration of data preparation and training of models in the same infrastructure, needlessly in a separate cluster for machine and deep learning;
acceleration of data transfer between nodes in the distributed cluster of Spark. These libraries use an open framework of Unified Communication X (UCX) of Consortium UCF and minimize latency, moving data directly to memories of GPU.

For May, 2020 the preliminary version of Spark 3.0 is already available at Apache Software Foundation. In the coming months the access to major version will be open.

2016: Description of Apache Spark

Unlike the classical processor from a core of Hadoop implementing the two-level concept of MapReduce with disk storage, Apache Spark uses specialized primitives for recurrent processing in RAM by means of what it gets advantage in work speed to some classes of tasks, in particular, the possibility of repeated access to the user data loaded into memory does library attractive to algorithms of machine learning.

Architecture of Apache Spark (2015)

According to information for April, 2016 the project provides program interfaces for the languages Java, Scala, Python, R. Written generally on Scala. Consists of a core and several expansions:

Spark of SQL (allows to execute SQL queries over data),
Spark Streaming (a superstructure for processing of stream data),
Spark MLib (set of libraries of machine learning),
GraphX (it is intended for the distributed processing of graphs).

Can work in the environment of a cluster Hadoop under control of YARN and without Hadoop core components, supports several distributed systems of storage — HDFS, OpenStack Swift, NoSQL-DBMS Cassandra Amazon S3.

Notes

↑ [1]

Apache Spark

Content

2020: Exit of Apache Spark 3.0

2016: Description of Apache Spark

Notes

See Also