Customers: Uber Contractors: Uber Product: Apache HadoopSecond product: Apache Spark Project date: 2014/01 - 2018/09
|
In October, 2018 Uber published on the website article devoted to volume as the IT infrastructure allowing to process huge data arrays with the minimum delay when millions of trips through service are made every day is arranged.
Till 2014 the amount of data which were stored by Uber was small (only several terabyte) that allowed the company to use a small amount of the traditional OLTP databases. Data had no global providing, and access to them was provided quickly enough as each database received requests directly (see an illustration below).
Since 2014 Uber works on the solution Big Data to provide reliability, scalability and usability of data. In the 2018th the company focused on fall forward and efficiency of the platform which is called Hadoop Upserts and Incremental (Hudi). It is constructed on the open framework of Apache Spark intended for implementation of the distributed processing of unstructured and semistructured data. Before creation the BigData-platform passed a way of development which is imprinted on images below.
Hudi allows users to take gradually only the changed data that considerably increases efficiency of requests. It is scaled in horizontal direction and can be used from any task of Spark. At last, the main advantage is that Hudi works only with a distributed file system of HDFS, note in Uber.
Why Hudi was created?
Uber studied contents of data in the base, access templates to them and requirements, specific to the user, for identification of problem places. As a result the company came to four to the main restrictions which should be overcome through Hudi.
Limited scalability in HDFS
Many companies which use HDFS for scaling of the infrastructure of Big Data face this problem. Storage of a huge number of small files can significantly affect performance as a lack of HDFS is the NameNode node which is present at one copy which executes a role of services of metadata. It becomes a serious problem when data size exceeds 50-100 PBytes.
Need of fast delivery of data for Hadoop
As Uber works in real time, the company needs to provide services using the most up-to-date data. It was important to service to make delivery of data much quicker as the 24-hour latency of data (time necessary for extraction and record of a data packet) was too big for many cases of use of data.
Lack of direct support of updates and removals of the existing data
Uber used instant pictures of data which assumed receipt of new copies of initial data each 24 hours. As service needs to process up-to-date data, it needed the solution supporting updating and removal of already created data. However Big Data of the company are stored in the HDFS and Parquet formats therefore to implement updating directly it was not provided possible.
Fast processes of ETL and modeling
ETL is one of basic processes in management of data warehouses which includes extraction of data from external sources, their transformation and cleaning that they corresponded to requirements of a business model and also loading in storage. In the system of Uber processes of ETL and modeling were also based on instant pictures therefore the platform had to rebuild derivative tables at each start. Fast ETL are necessary for gradual reduction of a delay of data.
The chart showing the principle of work of the Uber BigData-platform of third generation after Hudi implementation is given below. It is, allegedly, created in 2017 with a view to long-term work and processes about 100 PBytes of data.
Irrespective of whether the updated data are new records in recent subareas or replace the oldest data, Hudi allows to transfer the last time mark of a control point and to receive all records which were updated since then. This extraction of data happens without accomplishment of an expensive request which scans all initial table.
Using this library, Uber departed from absorption of data with use of instant pictures on incremental model. As a result the latency of data was reduced from 24 hours to less than one hour.
At the same time the company continues to improve the system. In particular, it aims to improve quality of data in base and to reduce latency of initial data in Hadoop up to five minutes and up to 10 minutes in a case with modelirovanny tables.
To Hudi it is going to add additional browse mode which will include the overview of data optimized for reading and also new representation which displays data with a delay only several minutes. Technologies open source which in the company call Merge-On-Read or Hudi 2.0 will be applied for this purpose.
Uber also refuses use of the specialized equipment for all the services and switches to containerization of services. All mechanisms of resource allocation integrate in and around Hadoop ecosystem to overcome a gap between Hadoop and not connected with these services of the company.
The next version of Hudi will allow Uber to generate much larger files of Parquet (more than 1 Gbyte against 128 MB for October, 2018) by default within several minutes for all data sources. Thanks to it and also the reliable platform, independent of a source, for data reception Hudi will grow according to business, says Uber.[1]