2021/06/28 16:05:30

DataLake Data Lake

DataLake (in the term "data lake") is a term describing any large amounts of data. In fact, this is a repository that stores a huge amount of "raw" data in their original format until it is used. The purpose of creating data lakes is to give highly qualified analysts the opportunity to study raw, non-aggregated data and apply various analytical techniques to them.

Content

2025: Data Lakehouse Market Trendwatching. TAdviser and DIS Group study
2021
- Four ways to ensure the relevance and efficiency of the corporate data lake
- Data Lake vs. Data Store
2019: Recommendations for Deploying Data Lakes
See also
Notes

Although data lakes are still a new phenomenon, they have recently received some recognition from IT departments as data is increasingly becoming the basis of modern business. Lakes are seen as a solution to reduce data sprawl and isolation. They spun off data stores that were supposed to help IT create organized repositories of strategically important datasets to make key business decisions. This data can be used to solve a wide variety of tasks, from analytics and a better understanding of customer needs to the use of artificial intelligence for real-time decision making^[1]

Data lakes represent further storage evolution. Many projects to create the latter failed: they turned out to be too expensive, required too much time and allowed to achieve only a few of the set goals. The data is changing and growing so quickly that the need to benefit from it immediately has become even more urgent. No one can afford to spend months or years analyzing and modeling data for business. By the time data in storage becomes available for use, business needs are already changing.

Data marts, like repositories, were created for data intended for use for certain purposes or with certain properties (for example, for marketing department data). They have gained popularity, since here the use of data is more understandable, and the results can be issued faster. However, they share data, making storefronts less useful for companies that have huge amounts of data and need their multifunctional use by many employees.

In this regard, data lakes have been developed that are designed to speed up the work with data and facilitate its use to meet those needs that were not previously determined. The advent of clouds that provide cheap processing power and virtually unlimited storage made it possible to create data lakes.

2025: Data Lakehouse Market Trendwatching. TAdviser and DIS Group study

Think tank TAdviser, in partnership with the DIS Group, has released an analytical report, Data Lakehouse Market Trendwatching. Read more here.

2021

Four ways to ensure the relevance and efficiency of the corporate data lake

In the balance sheets, most companies data remain intangible assets, the advantages of which are not fully used, and value is often impossible to determine. IDC Seagate Only 32% of the data available to businesses is usefully used, according to an estimate in the company's Rethink Data report. As reported in Seagate on June 28, 2021, more than 1,500 respondents around the world were interviewed as part of the study. The results showed that the remaining 68% of the data remained intact and unaffected. Among the main reasons for this were the inefficiency, data management growth and fragmentation of data, as well as the lack of the ability to ensure their security at the required level.

To make the most of the data, many companies have deployed cloud-based data lakes - centralized storage platforms for all types of data. Such a platform provides elastic storage capacity and flexibility in setting I/O speeds. It covers various data sources and supports several types of computing and analytical cores.

However, projects to implement a data lake are accompanied by significant risk: if you do not take timely measures, the lake can turn into a "data swamp" - a repository in which potentially valuable information is simply stored on media without use. A huge, practically standing swamp is formed, in which data "plunged to the bottom" and, being inaccessible to end users, turned into a useless resource.

To prevent the lake from becoming a swamp and constantly providing up-to-date insights, IT Directors and Data Architects are encouraged to adhere to the four principles described below.

1. It is necessary to clearly formulate the business problem that is planned to be solved

With a clear formulation of the problem, it is relatively easy to find the data that needs to be collected and choose optimal machine learning methods to extract analytical information from such data. Investments in storage infrastructure improve the results of virtually any business initiative. Accordingly, there is a need for quantification - to measure the benefits of such investments.

For example, in marketing, the analytical core of the data lake helps to conduct advertising campaigns with targeted accurate selection of channels and a range of potential customers. The data lake can be used to collect, store, and analyze information throughout its management cycle.

In the lake industry, data is used to increase production, optimizing production parameters using special algorithms artificial intelligence and deep learning models.

In order for such solutions to work as efficiently as possible, it is important that fresh data constantly enter the data lake. Only in this case will the corresponding software systems be able to extract the necessary information from the data.

2. It is important to capture and save all information that is possible

Organizations need the ability to capture the right data, identify it, store it at the right levels, and provide it to decision makers in a convenient way. Data activation, that is, its useful application, begins with a commit procedure.

The expansion of the use of the Internet of Things and 5G networks has led to an avalanche-like growth of data, and therefore enterprises do not have time to record their entire available volume. Nevertheless, companies are mastering methods to capture and preserve as much information as possible in order to fully use its potential both in the present and in the future. If you do not save the data, its potential value will be wasted.

When the data lakes first appeared, specialists of the relevant profile were engaged in the search for the necessary information in them. Modern data lakes support the standard SQL query language, and thanks to this, even ordinary users can work with them, for whom the most important thing is the result. Therefore, to help them research data and search for patterns, they use artificial intelligence and machine learning tools. With progress in this area, near-real-time analytical systems as well as advanced analytics and visualization tools are actively developing.

The landscape of the data lake market is rapidly evolving, and as of 2021, the relevant solutions have prioritized the ability to identify the necessary data and extract valuable information from it.

Using cloud storage services with highly efficient management mechanisms, companies are able to transfer information generated daily during their operations to a scalable data architecture. Modular storage solutions enable you to aggregate, store, move, and activate data on the periphery and in the cloud.

3. Periodic inventory of data to be carried out

Data lakes need audits and updates. It is necessary to periodically check the data of the enterprise stored in the cloud data lake, otherwise it will "darken" - it will be more and more difficult to use it. It will become much more difficult for data researchers to find the required patterns, or this opportunity will be completely lost.

It is estimated that the development of cloud storage services and the introduction of artificial intelligence and automation software will contribute most to improving the management capabilities of huge data lakes. Such systems effectively cope with the "sifting" of large amounts of information. The best option is to choose a data set and an appropriate machine learning algorithm for processing it, and then, if good results are obtained, use the same solution for other data sets. For example, fraud in banks is detected using systems based on artificial intelligence. First, such systems are trained to recognize fraudulent transactions, and then, using electron networks, they begin to work, guided by indicators such as the frequency of transactions, their volume and the type of retail organization.

You can move out-of-date information to another repository for long-term retention because old data may be valuable again in the future.

4. Implement mass data operations

IDC analysts define bulk data operations (DataOps) as a discipline dedicated to connecting data creators and consumers. DataOps are becoming an important factor in the effectiveness of the data management strategy. In addition to DataOps, this strategy should include endpoint and core data orchestration, data architecture design, and data security. The task of data management is to ensure the unity of the review of all data, both stored and moved, and to provide users with the opportunity to access them for the maximum benefit.

Modern enterprises create gigantic arrays of information, and, according to the forecast presented in the Rethink Data report, its volume from 2020 to 2022 will increase annually by 42%.

The survey overall found that businesses often move data between different storage locations: endpoints, peripherals, and the cloud. Of the more than a thousand companies in the survey, more than half move data between different storage locations every day, every week or every month. At the same time, the volume of information moved at a time is an average of 140 TB. The faster the company moves such an array from the periphery to the cloud, the more quickly it can analyze it to obtain valuable information.

Due to the rapid pace of digitalization, which has further accelerated due to the pandemic, many organizations have begun to collect even more data, and they all need to be managed.

Creating efficient data lakes and keeping them up-to-date lay the foundation for effective long-term strategies for enterprise data management and, accordingly, the successful application of digital infrastructure and the implementation of various business initiatives.

Data Lake vs. Data Store

The data lake is intended as the main place where the organization's data flocks. This is a repository for all data where it is stored in raw or partially processed form. In some cases, metadata tags are added to the data to help you find individual items. It is assumed that access to data in the lake is carried out by data processing specialists, as well as specialists who establish common ground for downstream data transmission. It is permissible to speak from a downward flow of data in the context of a data lake, because a data lake, like a real lake, accumulates data from all sources, and they can be numerous, diverse and raw^[2].

From the lake, downward data enters the data warehouse, which implies something more processed, packaged and ready for use. And if the lake stores data in formats that are difficult to recognize or not at all readable by the vast majority of employees (unstructured, semi-structured), then the data warehouse consists of structured databases that are available to applications and employees. Data provided in the form of storefronts or hubs is even more convenient for use by internal departments of the company.

Thus, the data lake contains large amounts of data in its original form. Unlike queries to a data store or data store, queries to a lake require a schema-on-read approach (we accept and store all the data, and talk about its structure only at the time of reading for a specific task).

Data Lake: Data Types and Access Methods

Data sources in a data lake include all data from an organization or one of its divisions. These include structured relational database data, semi-structured data (CSV, log files, etc.), XML and JSON data, unstructured data (e-letters, documents, PDF files, etc.), and binary data (images, audio and video). From a storage protocol point of view, this means that the lake needs to store data that arose in file, block and object stores.

Object storage is a generally accepted protocol for the data lake itself. Do not forget that it does not provide access to the data itself, but to metadata headers. The latter describe the data, they can be attached to anything from the database to the photo. Detailed data queries can occur anywhere, but not in the data lake.

Object storage is very suitable for storing large amounts of data in an unstructured form. That is, you cannot work with it, like a database in a block store, but it allows you to store several types of objects in a large flat structure and know what is there.

Object storage usually does not guarantee high performance, but with respect to a data lake, this is normal: queries for it are more difficult to create and process than for a relational database in a data store. But this is not scary, because most of the requests at the data lake stage will concern the formation of more detailed data stores suitable for detailed requests.

Data Lake: onpremis vs. in the cloud

All common arguments regarding local and cloud solutions apply to data lake operations. During the deployment of the lake of these facilities, it is necessary to take into account the requirements for the area and power supply, design, purchase of equipment, software, management, personnel qualification and current expenses.

The advantage of outsourcing a lake of data in the cloud is that capital expenditures (capex) on infrastructure are transferred to operating expenses (opex) in the form of payments to a cloud service provider. However, as the amount of data sent to and from the cloud increases, costs can increase and there will be an additional fee for this.

This requires a thorough analysis of the advantages and disadvantages of each storage model. It also takes into account issues such as regulatory compliance and connectivity that go beyond just the storage architecture and the data lake. Of course, you can also work in hybrid mode, expanding into the cloud if necessary.

Onpremis products

A data lake typically requires a large storage capacity. If we are talking about a data lake of the scale of the enterprise, then it definitely should be of a large volume. In the middle of the last decade, storage manufacturers released the first trial products for working with data lakes. EMC, for example, launched the Federation Business Data Lake line in 2015, featuring EMC DSS and VMware and Pivotal big data products. After testing the ground, in 2017, Dell EMC targeted the deployment of data lakes with its Elastic Data Platform. In addition, it has expanded the scope of its Isilon horizontally scalable networked storage (NAS) to data lakes.

Since its rebranding, Hitachi Vantara may be putting more emphasis on analytics, big data and the internet of things. It offers the ability to organize a data lake based on the Storage Technology Content Platform, combined with the Lumada IoT platform and Pentaho data integration environments. The Pentaho Data Integration and Analytics platform targets big data. It provides remote access to reports and analytics; by accessing the data, the user can process it and use it anywhere. Pentaho supports Hadoop, Spark, NoSQL, and analytical databases. Lumada uses Pentaho software for orchestration, visualization and data analytics.

IBM also belongs to the category of storage array and storage providers for data lakes, acts as a consultant, and also collaborates with Cloudera, a provider of a data management platform that is designed to orchestrate and analyze large amounts of data.

NetApp does not really delve into the DSS segment for data lakes, but it still has its Ontap-based arrays as storage for big data, Hadoop and Splunk, for example.

HPE is also not taking active action on a product release plan to deploy data lakes, except that they can be deployed with GreenLake's pay-as-you-go product portfolio.

It is worth noting that you can create lakes of data on the equipment of any supplier, and you can choose a commercial white box kit as a suitable tool.

Capabilities in the Cloud

Some large DSS providers tried to offer upgrades for data lakes, but it turned out that this is too difficult a task with many branches and is more suitable for consulting or specific implementations. Meanwhile, cloud service providers have taken a different path, with the top three offering certain services in the data lakes.

For example, AWS offers a console with which clients can search and view available datasets. They can then tag, search, share, convert, analyze data, and manage certain subsets of data within a company or with external users. The solution is based on the AWS S3 object storage and uses various AWS services for its maintenance, including AWS Lambda microservices, Amazon Elasticsearch posik, Cognito user authentication, AWS Glue for data conversion and Amazon Athena analytics.

The Azure offering is similar and offers the ability to run programs for mass parallel conversion and processing of petabytes of data to U-SQL, R, Python and. It .Net Microsoft also has a Azure HDInsight managed analytics service based on, Open Source which includes frameworks such as Hadoop, Apache Spark,,,,, and Apache Hive R. LLAP Apache Kafka Apache Storm

Google's cloud platform looks a little less like a one-stop shop for deploying a lake of data. Obviously, GCP allows you to create lakes of data - Google has previously said that Twitter uses it, but probably its solution requires more consulting services than standard competitors' offers.

2019: Recommendations for Deploying Data Lakes

As this is still a fairly new phenomenon, the market has not fully adapted to the data lakes. Therefore, pioneers will benefit the most now, who are likely to use them in combination with artificial intelligence to conduct everyday operations. Many IT departments are looking for the most suitable solution for their company. The following are best-practice recommendations for deploying data lakes.

1. Follow a strategy when putting data in a lake

The main reason for the placement of data in lakes is the use of data for certain purposes. Although, in theory, lakes should serve many purposes that still need to be determined, it is better to start when something is known about how the data will be used. Consider the benefits of a data lake beyond storage. As with any other IT initiative, it is important to first align the deployment with a specific strategy that defines not only the goals of IT, but also the long-term goals of the company as a whole.

Wonder if the lake will help manage the company's data. It will be too expensive to store data for future use when it comes to several years. If the company does not intend to use the data for a certain purpose in the near future, storing it means squandering funds.

2. Store data in maximum detail and tag

Storing data with maximum detail allows you to build, aggregate and perform other manipulations with it for a wide variety of purposes. Data should not be aggregated or summarized before placing it in a lake. Since the value of the data lake will not manifest itself until the company uses the data, it is better to put it in the lake after tagging and cataloging. When required, IT will be able to sift through the repository and allocate assets. Tagging, which is necessary for reporting, facilitates analytics. Machine learning and AI help sift through data and create tags.

In addition, companies can use analytics, machine learning and AI to improve the company's overall competitiveness. One tool allows you to apply others.

3. Have a Data Destruction Plan

Companies too often accumulate large amounts of data without a plan to get rid of unnecessary assets. The absence of such a plan may prevent compliance with the requirements of regulators to destroy information after a certain time. For example, such a requirement is contained in the GDPR in relation to data on EU citizens.

The combination of the destruction plan and the data lake can help determine what and when should be destroyed. This is also a solution in cases where companies are required to track the location of customer data. Having a single vault reduces costs and saves time.

Preparing for the future

Companies are accumulating more and more data, so there will be a need to store and use it for strategic purposes. Data lakes are a great way to identify the business value of data. When choosing a solution, first of all, determine how you think the organization will use the data, and then how to store it. For example, after the reduction in storage prices, the creation of data lakes in the clouds became very attractive. If using the cloud meets your company's goals, you should find a provider that will meet your unique infrastructure needs. How will a cloud service provider or your own DevOps division embed a process in a lake of data so that data can be downloaded and retrieved as needed?

Since it will undoubtedly take a lot of computation to get the most out of the data lake, consider which analytical processing steps can be automated. Experienced specialists will also be needed to create an infrastructure for storing data lake, loading data into it and transforming data for use. Establishing a regular, open exchange of information between IT and business managers can be the first step towards any IT transformation, including creating data lakes.

Notes

Источник — «https://tadviser.com/index.php/Article:DataLake_(data_lake)»

The site content is translated by machine translation software powered by PROMT. The machine-translated articles are not always perfect and may contain errors in vocabulary, syntax or grammar. Read original article
If you find inaccuracies or errors in the results of machine translation, please write to editor@tadviser.ru. We will make every effort to correct them as soon as possible.

Simple Link

How to create a "smart plant": Key characteristics of a modern digital enterprise 10800

Model Studio CS: How to use BIM to give new impetus to the development of the fuel and energy complex 11700