2025/04/18 15:33:13

Apache Parquet

Apache Parquet is an efficient column storage format focused on use in the Apache Hadoop ecosystem.

Content

History
- 2025: Vulnerability discovered in popular Java library to hijack big data databases
Notes

Apache Parquet is an efficient column storage format focused on use in the Apache Hadoop ecosystem. A feature of Parquet is that it separates data into columns, not rows, which allows you to achieve greater performance when querying large amounts of data. Some key features of Parquet are:

Compression. The format supports many compression algorithms such as Snappy, Gzip, and LZO. This reduces storage requirements and minimizes the amount of data that needs to be read from disk when running requests;
Metadata. Parquet files store metadata and column statistics such as minimum and maximum values, data types, and encoding information;
Pushdown predicate. This feature allows query engines to pass filters to the storage level. With its help, Parquet allows users to skip reading irrelevant data during the execution of a request;
Data types. Parquet supports primitive data types (e.g. integer, floating point number, string) and can efficiently handle complex data structures (e.g. arrays, maps, structures) and advanced nested data structures
Portability. Parquet files are portable to many other platforms and languages, making it easier to share and share data between different systems.

To work with the data presented in this standard, the Apache Parquet Java Java library is used.

History

2025: Vulnerability discovered in popular Java library to hijack big data databases

NCCC sent a warning VULN-20250409.52^[1] about the discovery in early April of a critical vulnerability in the open source project Apache Parquet Java. The development community has released fixes that the center recommends to quickly install "after assessing all associated risks."

The Apache Parquet Java library is designed to work with Java with Apache Parquet files. It is a format for efficient data storage and processing that supports high-performance compression, coding, and mass processing of complex structures. However, the library for working with it, which is written in Java, has the ability to generate a file in such a way that untrusted data will be incorrectly deserialized. This allows external attackers to execute extraneous code using a specially prepared file uploaded to the corresponding system.

The Apache Parquet Java library is mainly used to work with Apache Hadoop and Spark projects

The vulnerability was discovered by a security researcher from Amazon named Kayi Lee in version 1.15.0, and vulnerable are versions of the library since version 1.8. The error has been fixed in version 1.15.1. The danger of the discovered vulnerability is estimated as 10 out of 10 according to the CVSS method, since it allows you to remotely execute the code without the need to interact with the user.

The Apache Parquet format is quite in demand in Russia, especially in areas where large amounts of data are habitually processed, - Alexey Krasnov, leading developer of MD Audit, commented on the situation for TAdviser. - It is appreciated for its high efficiency and compatibility with popular analytics tools such as Hadoop and Spark. Attackers can exploit this flaw to execute arbitrary code on the attacked system. The consequences can be extremely unpleasant - from losing control of the system to leaking confidential information or infecting malware. It is exploited quite simply - by sending specially formed data.

However, not all companies use tools such as Hadoop and Spark, so in general, among corporate information systems, the share of use of this library may be quite low.

The Apache Parquet Java library is not very common in Russia, "said Daniil Chernov, author of Solar appScreen. - It is used only in analytical systems, the main users are fintech services, telecom operators and, to a lesser extent, retail. The key protection in this case is timely update of dependencies and control of software supply chains using code analysis platforms. We also recommend using SCA tools in automatic mode to check the security of libraries used by developers in the software.

Its distribution data is also confirmed by BI.Zone. However, the big data systems themselves, which are susceptible to this vulnerability, usually contain a lot of valuable information.

According to the data, BI.Zone TDR the vulnerability is not widespread, vulnerable hosts are rare in the infrastructure, "Pavel Blinnikov, head of the company's vulnerability research team, shared his data with TAdviser readers. BI.Zone- Vulnerabilities are affected by big data frameworks: Spark,, Flink Hadoop. In addition, any applications that support the Apache Parquet file format are vulnerable. A massive attack using this type of vulnerability is unlikely, since most often vulnerable functionality is available only after authorization on the server. However, if an attacker has such access obtained using another vulnerability or incorrect server configuration, such an attack becomes possible.

In fact, this means that the vulnerability can be exploited to target big data systems, so all companies that have similar solutions in their infrastructure should check them both for updating all components and for the effectiveness of their protection in order to be able to notice the exploitation of the vulnerability in a timely manner and respond.

The vulnerability affected many versions of the library, starting with 1.8, and secure version 1.15.1 was released only in April, "Sergei Smirnov, head of the DevSecOps platform cluster, reminded TAdviser readers Sphere." - However, to implement a massive attack, attackers will need information about specific products and configurations, which greatly complicates the task. However, if such data or insider information are available, large-scale incidents are possible. To protect against exploitation of the vulnerability, it is primarily necessary to ensure control and monitoring of the versions used. The task can be automated using the mass analysis tools. Secondly, specialized solutions should be used that automatically analyze the component composition of the collected product images and identify unsafe elements.

In fact, this means the introduction of full-fledged DevSecOps pipelines that would automatically check all dependencies and detect the appearance of vulnerable components in them.

Protection against vulnerabilities in open source libraries requires an integrated approach at all stages of the software development life cycle, Sergei Matusevich, director of AI and web technology development at Artezio, recommended to TAdviser readers. - First, you need to implement a rigorous dependency management process that includes documenting all libraries used, their versions, and licenses. Secondly, it is critical to use automatic vulnerability scanning tools. There are both commercial and open solutions for this task. For example, OWASP Dependency-Check, Snyk, Sonatype Nexus IQ Server. Thirdly, I strongly recommend creating a process for prompt response to vulnerabilities. It is necessary to track security bulletins, including from NCCCA and have a clear plan of action when detecting critical vulnerabilities in the components used.

Notes

↑ Arbitrary code execution in Apache Parquet Java

Источник — «https://tadviser.com/index.php/Article:Apache_Parquet»

The site content is translated by machine translation software powered by PROMT. The machine-translated articles are not always perfect and may contain errors in vocabulary, syntax or grammar. Read original article
If you find inaccuracies or errors in the results of machine translation, please write to editor@tadviser.ru. We will make every effort to correct them as soon as possible.

Simple Link

How to create a "smart plant": Key characteristics of a modern digital enterprise 10200

Model Studio CS: How to use BIM to give new impetus to the development of the fuel and energy complex 10300