Developers: | Databricks |
Technology: | BI, DBMS |
General manager: Ion Stoica
The release of Apache Spark in May, 2014 was one of the most considerable progress on the arena of "Big Data". It is the processing mechanism - in - memories with the open code,[1] exceeding the Hadoop platform on performance of analytics of data[2].
The Databricks company was created by several developers Spark and in June, 2014 offered the Databricks Cloud platform constructed on this technology. This hosting platform which undergoes beta testing simplifies deployment and granting Spark and is delivered with a set of the built-in applications for collecting and data analysis. The organization can use, for example, Databricks Cloud quickly to process and analyze the data stored in Amazon S3.
Databricks is related to Apache Spark, however the company substantially modified a framework, having supplied it with higher level of abstraction of API and faster data processing in memory (in-memory processing) therefore it not only supplements possibilities of "traditional" Hadoop, but also can act as its replacement. Modification of Databricks — Delta Lake is a completely managed Open Source-version Spark which works in a cloud and is delivered with several proprietary[3].
Delta Lake is purely cloud project which is applied by a number of large customers worldwide. According to one of founders of Apache Spark and the technical director of Databricks Mateja of Zachariah, clients migrate from Spark on the Databricks platform for various reasons, but it is frequent it is dictated by requirements of business which everything to a bowl is guided by work with cloud services. Desire of clients to connect lakes of data which are both in cloud, and in local storage and became for the company incentive motive to creation of the solution for ensuring their reliability.
"Today each company has a lake of data almost. They try to obtain from it information, but its value and reliability often raises doubts. Delta Lake fixes these problems — tells interest in this solution of hundreds of enterprises about it. Considering that Delta Lake has the open code, developers will be able freely to create reliable lakes of data" — the cofounder and the CEO of Databricks Ali Godsi told. |
He also explained that "Delta-lakes of data" are and what types of file systems and data they support. "Delta Lake is located over your DWH (but does not replace it) and offers the transaction service layer of storage both in the HDFS format, and in a format of BLOB objects Azure which are stored in the cloud storage, for example, of S3. Users can download Delta Lake and combine it with HDFS in onpremis-option. They can also read out data from any storage system which supports Apache Spark data sources and to write in Parquet — a storage format which understands Delta Lake" — told Godsi.
Databricks gave preference to Apache Parquet because this columnar focused (column) data storage format initially was created for an ecosystem of Hadoop and does not depend on the choice of the environment of data processing. Delta Lake acts as a layer over the supported data storage formats.
Notes
- ↑ [http://www.crn.ru/news/detail.php?ID=93943 the Big Data Market
- ↑ : ten best products of this year]
- ↑ Open Databricks Platform expansions will help to turn swamps of data into lakes