Developers: | ITSumma (Total I&T) |
Date of the premiere of the system: | 2023/08/30 |
Last Release Date: | 2024/03/20 |
Technology: | Big Data |
Main article: Big Data
2024: Compatible with Apache Spark version 3.0 and higher
ITSumma introduced an updated version of the connector on March 20, 2024. So, until March 2024, the solution supported work only with Apache Spark 2.0, but now the connector is compatible with Apache Spark version 3.0 and higher.
Spark-greenplum-connector is designed to replace the connector built into Apache Spark. Thanks to it, data engineers will be able to increase the speed of reading and writing to the database and quickly scale the number of connected and processed sources.
Compared to the previous version of the solution, performance increased 10-20 times - from 1 to 10... 20 Mbps, noted in ITSumma. According to the developers, this was due to the use of the zero-copy method - the connector stopped using copying internal caches of binary line representation.
For the connector, a general optimization was carried out that reduces the delay between fights and microbats in Spark. Basically, the speed was increased by 10-20 times due to a change in the mechanism for copying the buffer - now, instead of copying, a pointer is passed to it. This kind of technical solutions made it possible to significantly increase performance - said Alexey Ponamorevsky, leading developer of the Spark-Greenplum-Connector project. |
Based on the connector, you can build ETL solutions and analyze the quality of data. It is highly flexible in configuration and has all the functionality necessary for integration into big data platforms.
The connector is applicable wherever you want to stream large amounts of data. In those industries where there is telemetry or a constant stream of events: finance, e-commerce, telecom, media, production and industry, advertising, transport and logistics, etc.
2023: Development of an open source plugin for Apache Spark
ITSumma announced the development of an open source plugin for Apache Spark on August 30, 2023, which significantly speeds up data processing through parallel read and write operations.
Spark-greenplum connector is a multifunctional plugin for big data processing and analysis platforms. Using it, instead of the connector built into Apache Spark, data engineers will be able to increase the speed of reading and writing from and to the Greenplum database tenfold and quickly scale the number of sources that are connected and processed.
Using the connector, engineers will be able to configure structured streaming using micropacketing. This functionality helps to get up-to-date updates of the required data, which increases the processing speed to almost real time.
Spark-greenplum connector has a number of additional features. For example, using an anonymous block or PL/pgSQL function as a source or sink for read and write operations. This allows you to delegate part of the data processing to the database side.
Based on it, you can build ETL solutions and analyze in-memory data. It has a high data transfer rate, great flexibility in setting up, and also:
- automatically generates data schemes;
- splits the calculations into parallel independent flows;
- and supports push-down operators.
Such solutions that enable the operation of big data processing systems are of import-substituting importance. Given the importance of this, we decided to make our connector publicly available, "said Timur Hasanov, CTO of ITSumma. |