RSS
Логотип
Баннер в шапке 1
Баннер в шапке 2

Arenadata ADB-Spark Connector

Product
Developers: Arenadata (Arenadata Software)
Date of the premiere of the system: 2021/07/22
Technology: SOA

2021: ADB-Spark Connector Release

Arenadata announced on July 22, 2021 the release of a data exchange tool between Arenadata DB (analytical MPP-DBMS based on Greenplum) and Apache Spark (a framework for distributed data processing included in the Hadoop ecosystem) - ADB-Spark Connector.

The connector is built using Scala 2.11.x, 2.12.x, Twitter Finagle and ScalikeJDBC. Its work is based on, HTTP server implementing the gpfdist protocol. This method, compared to other existing methods of exchanging with ADB, provides parallel writing to Greenplum segments without Master participation, the ability to flexible partitioning when reading data from Greenplum to Spark, the need to install the gpfdist utility on each Spark-nod and other advantages.

To implement the gpfdist protocol, the Finagle framework was taken, which showed better performance when there are many simultaneous sessions from ADB segments, compared to the initially selected Akka HTTP.

The main functions of ADB-Spark Connector are:

  • Reading data from Greenplum to Spark with support for different ways of partitioning
  • writing data from Spark to Greenplum using several recording modes: Append, Overwrite, ErrorIfExists;
  • Support for push-down statements
  • Extracting additional metadata from Greenplum, including statistics and data distribution schemes
  • automatic generation of data diagrams;
  • optimizing the execution of the count aggregate function.

{{quote 'author = said Dmitry Pluzhnikov, director of the Arenata system architecture department.|The solution we developed will be useful for customers who combine Arenata Hadoop and Arenata DB when building their enterprise storage. ADB-Spark Connector provides fast bi-directional communication between them, allowing you to read and write data as efficiently as possible, }}

Compared to the closest analogue in the market - Spark-Greenplum connector from Pivotal - ADB-Spark Connector provides more flexible partitioning (5 ways instead of 2) and more data types (including intervals and arrays), and provides additional functionality, including support for Batch mode in Spark, statistics collection to build query plans using Catalyst, and execution of arbitrary SQL queries through the ADB Master mode.

As of July 2021, ADB-Spark Connector supports Spark 2.3.x and 2.4.x. Further product development plans include the addition of Spark 3.x support and the implementation of streaming functionality.