DataLake Data Lake
DataLake (in the letter "data lake") is a term describing any large amounts of data. In fact, it is a repository that stores a huge amount of raw data in its original format until it is used. The goal of creating data lakes is to give highly qualified analysts the opportunity to study raw, non-aggregated data and apply various analytical techniques to them.
Content |
Although data lakes are still a new phenomenon, they have recently received some recognition from IT departments because data is increasingly becoming the basis of modern business. Lakes are seen as a solution to reduce data sprawl and isolation. They were spun off from data warehouses, which were supposed to help IT departments create organized repositories of strategically important datasets for making key business decisions. These data can be used to solve a wide variety of tasks, from analytics and a better understanding of customer needs to the use of artificial intelligence to make real-time decisions[1]
Data lakes represent the further evolution of storage facilities. Many projects to create the latter failed: they turned out to be too expensive, required too much time and allowed to achieve only a few of the goals set. Data are changing and growing so rapidly that the need for immediate benefit has become even more urgent. No one can afford to spend months or years analyzing and modeling data for business. By the time data in storage becomes available for use, business needs are already changing.
Data storefronts, like repositories, were created for data intended for use for certain purposes or with certain properties (for example, for marketing department data). They have gained popularity, since the use of data is more clear here, and results can be issued faster. However, they share data, which has made storefronts less useful for companies that have huge amounts of data and need their multifunctional use by many employees.
In this regard, data lakes have been developed to expedite data processing and facilitate their use to meet needs that have not been previously identified. The advent of clouds that provide cheap processing power and almost unlimited storage capacity has made it possible to create data lakes.
2021
Four ways to keep the enterprise data lake up-to-date and efficient
In the balance sheets of most companies, data remains an intangible asset, the benefits of which are not fully exploited, and the value is often impossible to determine. According to the estimate in the IDC Rethink Data report, prepared jointly with Seagate, only 32% of the data available to enterprises is usefully used. As reported in Seagate on June 28, 2021, more than 1,500 respondents around the world were interviewed as part of the study. The results showed that the remaining 68% of the data remained intact and unused. The main reasons for this were the inefficient management of data, the growth and fragmentation of data, as well as the inability to ensure their security at the necessary level.
To maximize the value of data, many companies have introduced cloud-based data lakes - centralized storage platforms for all types of data. This platform provides elastic storage capacity and flexibility to adjust I/O speed. It covers various data sources and supports several types of computational and analytical cores.
However, projects to introduce a data lake are accompanied by significant risks: if you do not take timely measures, the lake can turn into a "data swamp" - a repository in which potentially valuable information is simply stored on media without use. A huge, practically standing swamp is formed, in which data "plunged to the bottom" and, being inaccessible to end users, turned into a useless resource.
To ensure that the lake does not become a swamp and is constantly a provider of up-to-date analytical information, it is recommended that IT directors and data architects adhere to the four principles described below.
1. It is necessary to clearly formulate the business task that is planned to be solved
With a clear formulation of the problem, it is relatively easy to find the data that needs to be collected and to choose the best machine learning methods that allow you to extract analytical information from such data. Investing in storage infrastructure improves the results of almost any business initiative. Accordingly, there is a need to quantify - to measure the benefits of such investments.
For example, in marketing, the analytical core of a data lake helps to conduct advertising campaigns with a targeted selection of channels and a range of potential customers. A data lake can be used to collect, store and analyze information throughout its management cycle.
In the industry, data lakes are used to increase production, optimizing production parameters using special algorithm-artificial intelligence and deep learning models.
For such solutions to work as efficiently as possible, it is important that fresh data is constantly supplied to the data lake. Only then will the relevant software systems be able to extract the necessary information from the data.
2. It is important to capture and store all the information that is possible
Organizations need to be able to capture the data they need, identify it, store it at appropriate levels, and provide it in a convenient way to decision makers. Activation of data, that is, its useful application, begins with a commit procedure.
The increased use of the Internet of Things and 5G networks has led to avalanche-like data growth, and therefore enterprises do not have time to record all their available volume. Nevertheless, companies are mastering methods to capture and store as much information as possible in order to fully exploit its potential both in the present and in the future. If you do not save the data, its potential value will be wasted.
When the data lakes first appeared, specialists of the corresponding profile were engaged in the search for the necessary information in them. Modern data lakes support the standard SQL query language, and thanks to this, even ordinary users can work with them, for whom the most important is the result. Therefore, to help them study data and find patterns, artificial intelligence and machine learning tools are used. Thanks to progress in this area, analytical systems operating in almost real time, as well as advanced analytics and visualization tools, are actively developing.
The landscape of the data lakes market is rapidly evolving, and, as of 2021, the ability to identify and extract valuable data from relevant solutions has become a priority.
By using cloud storage services with highly efficient management mechanisms, companies are able to migrate to a scalable data architecture the information generated daily during their operations. Modular storage solutions allow you to aggregate, store, move, and activate data on the edge and in the cloud.
3. Periodic physical inventory must be carried out
Data lakes need audits and updates. It is necessary to periodically check enterprise data stored in a cloud data lake, otherwise it will "cloudy" - it will be more difficult to use. It will become much more difficult for data researchers to find the required patterns, or this opportunity will be completely lost.
It is estimated that the development of cloud storage services and the introduction of artificial intelligence and automation software will help improve the management of huge data lakes. Such systems effectively handle the "screening" of large amounts of information. The best option is to choose the dataset and the appropriate machine learning algorithm to process it, and then, if you get good results, use the same solution for other data arrays. For example, fraud in banks is detected using systems based on artificial intelligence tools. First, such systems are trained to recognize fraudulent transactions, and then, using electronic networks, they begin to operate based on indicators such as the frequency of transactions, their volume and the type of retail organization.
You can move out-of-date information to another repository for long-term retention because older data may be valuable again.
4. Implement mass data transactions
IDC analysts define mass data transactions (DataOps) as a discipline that promotes communication between data creators and consumers. DataOps become an important factor in the effectiveness of the data management strategy. In addition to DataOps, such a strategy should include orchestrating data on endpoints and in the core, designing the architecture of data and ensuring its security. The goal of data management is to ensure a unified view of all data, both stored and moved, and to enable users to access it for maximum benefit.
Modern enterprises create huge amounts of information, and, according to the forecast presented in the Rethink Data report, its volume from 2020 to 2022 will increase annually by 42%.
The survey found that enterprises often move data between endpoints, periphery, and cloud. Of the more than a thousand companies that participated in the survey, more than half move data between different storage locations every day, every week, or every month. In this case, the amount of information transferred at one time is an average of 140 TB. The faster a company moves such an array from the periphery to the cloud, the faster it can analyze it for valuable information.
Due to the rapid rate of digitalization, which was further accelerated by the pandemic, many organizations began to collect even more data, and they all need to be managed.
Building efficient data lakes and keeping them up-to-date allows you to lay the foundation for effective long-term strategies for managing corporate data and, accordingly, successfully applying the digital infrastructure and implementing various business initiatives.
Data Lake vs Data Warehouse
The data lake is conceived as the main place where these organizations flock. This is the repository for all data where it is stored in raw or partially processed form. In some cases, metadata tags are added to the data to make it easier to find individual items. It is assumed that access to data in the lake is carried out by data processing specialists, as well as specialists who establish points of contact for downstream data transmission. Speaking from a downstream data stream in the context of a data lake is permissible because a data lake, like a real lake, accumulates data from all sources, and they can be numerous, diverse and raw[2].
From the lake, descending data enters the data warehouse, which implies something more processed, packaged and ready for use. And if the lake stores data in formats that are difficult to recognize or not at all read by the vast majority of employees (unstructured, semi-structured), then the data warehouse consists of structured databases that are available to applications and employees. Data provided in the form of storefronts or hubs is even more convenient for use by internal departments of the company.
Thus, the data lake contains large amounts of data in its original form. Unlike queries to a data store or showcase, queries to a lake require a schema-on-read approach (we accept and store all the data, and discuss their structure only at the time of reading for a specific task).
Data lake: data types and access methods
Data sources in a data lake include all data from an organization or one of its departments. These include structured relational database data, semi-structured data (CSV, log files, etc.), XML and JSON data, unstructured data (e-mails, documents, PDF files, etc.), as well as binary data (images, audio and video). In terms of the storage protocol, this means that the lake needs to store data that has arisen in file, block, and object stores.
Object storage is a common protocol for the data lake itself. Do not forget that it does not allow access to the data itself, but to metadata headers. The latter describe the data, it can be attached to anything from a database to a photo. Detailed data queries can occur anywhere, but not in a data lake.
Object storage is very well suited for storing large amounts of data in unstructured form. That is, you cannot work with it like a database in block storage, but it allows you to store several types of objects in a large flat structure and know what is there.
Object storage usually does not guarantee high performance, but this is normal for a data lake: requests for it are more difficult to create and process than for a relational database in the data store. But this is not scary, because most of the requests at the data lake stage will concern the formation of lower-level data warehouses that are more suitable for detailed requests.
Data lake: onpremis vs. in the cloud
All common arguments regarding local and cloud solutions apply to data lake operations. When deploying a lake of these onpremis, it is necessary to take into account the requirements for area and power supply, design, equipment purchase, software, management, personnel qualification and current expenses.
The advantage of cloud data lake outsourcing is that infrastructure capex is translated into opex in the form of payments to the cloud service provider. However, as the amount of data sent to and from the cloud increases, costs may increase and additional charges will be charged.
This requires careful analysis of the advantages and disadvantages of each storage model. It also takes into account compliance and connectivity issues that go beyond just a storage architecture and a data lake. Of course, you can also work in hybrid mode, expanding into the cloud if necessary.
Onpremis products
A data lake typically requires large storage capacity. If we are talking about a data lake of enterprise scale, then it should definitely be of a large volume. In the middle of last decade, storage manufacturers released the first trial products for working with data lakes. EMC, for example, launched the Federation Business Data Lake line in 2015, which introduced EMC storage, as well as VMware and Pivotal products for big data. After probing the ground, in 2017, Dell EMC aimed to deploy lakes of data to its Elastic Data Platform. In addition, it expanded the scope of its Isilon scale-out networked storage (NAS) to data lakes.
Since its rebranding, Hitachi Vantara may be placing more emphasis on analytics, big data and the Internet of Things. It offers data lake management capabilities based on the HHitachi Content Platform combined with Lumada's IoT platform and Pentaho data integration environments. The Pentaho Data Integration and Analytics platform targets big data. It provides remote access to reports and analytics; After accessing the data, the user can process it and use it anywhere. Pentaho supports Hadoop, Spark, NoSQL, and analytical databases. Lumada uses Pentaho software for orchestration, visualization and data analytics.
IBM is also a provider of storage arrays and storage for data lakes, serving as a consultant, and collaborating with Cloudera, a provider of a data management platform designed to orchestrate and analyze large amounts of data.
NetApp does not go much deeper into the storage segment for data lakes, but still has its Ontap-based arrays as storage for big data, Hadoop and Splunk, for example.
HPE also does not take active action in the product release plan to deploy data lakes, except that they can be deployed through a portfolio of paid-for-use GreenLake products.
It is worth noting that you can create data lakes on equipment of any supplier, and you can choose a commercial white box kit as a suitable tool.
Cloud Capabilities
Some large storage vendors have tried to offer applications for data lakes, but it turns out that this is too difficult with many offshoots and is more suitable for consulting or specific implementations. Meanwhile, cloud service providers have taken a different path, and the top three of them offer certain services in the field of data lakes.
So, AWS offers a console through which clients can search and view available datasets. They can then tag, search, share, transform, analyze, and manage specific subsets of data within a company or with external users. The solution is based on the AWS S3 object store and uses various AWS services for its maintenance, including AWS Lambda microservices, Amazon Elasticsearch search, Cognito user authentication, AWS Glue for data conversion and Amazon Athena analytics.
The Azure proposal is similar and offers the ability to run programs for mass parallel conversion and processing of petabytes of data on U-SQL, R, Python and. We .Net Microsoft also have - Azure HDInsight a managed analytical service based on, Open Source which includes frameworks such as Hadoop, Apache Spark,,,, and Apache Hive R. LLAP Apache Kafka Apache Storm
Google Cloud Platform looks a little less like a universal store for deploying a data lake. Obviously, GCP allows you to create data lakes - Google previously said that it is used, Twitter but probably its solution requires consulting services more than standard competitor offerings.
2019: Recommendations for Deploying Data Lakes
Since this is still a fairly new phenomenon, the market has not fully adapted to data lakes. Therefore, now the pioneers will benefit the most, who are likely to use them in combination with artificial intelligence to conduct everyday operations. Many IT departments are looking for the most appropriate solution for their company. Best practice recommendations for the deployment of data lakes are presented below.
1. Follow the strategy when placing data in the lake
The main reason for placing data in lakes is the use of data for certain purposes. While lakes should theoretically serve many purposes that still need to be defined, it is better to start when something is known about how the data will be used. Consider how a lake of data can benefit from storage. As with any other IT initiative, it is important to first align the deployment with a specific strategy that defines not only the IT goals but also the long-term goals of the company as a whole.
Ask if the lake will help manage the company's data. Storing data for future use will be too expensive when it comes to several years. If the company does not expect to use the data for a specific purpose in the near future, their storage means waste of funds.
2. Store data with maximum detail and tag
Data storage with maximum detail allows you to assemble, aggregate, and manipulate data for a variety of purposes. Do not aggregate or summarize data before placing it in a lake. Since the value of the data lake will not appear until the company uses the data, it is better to place it in the lake after tagging and cataloguing. When required, IT will be able to sift through the repository and allocate assets. Tagging that is necessary for reporting facilitates analytics. Machine learning and AI help you sift through data and create tags.
In addition, companies can use analytics, machine learning and AI to increase the overall competitiveness of the company. One tool allows you to apply other tools.
3. Have a data destruction plan
Companies too often accumulate large amounts of data without a plan to get rid of unnecessary assets. The absence of such a plan can prevent regulators from meeting the requirements to destroy information after a certain time has elapsed. For example, such a requirement is contained in the GDPR for data on EU citizens.
A combination of a destruction plan and a data lake can help determine what and when should be destroyed. This is also the solution when companies are required to track the location of customer data. Single storage reduces costs and saves time.
Preparing for the future
Companies are accumulating more and more data, so there will be a need to store it and use it for strategic purposes. Data lakes are a great way to identify the business value of data. When choosing a solution, first determine how you think the organization will use the data and then how to store it. For example, after lowering storage prices, it became very attractive to create data lakes in the clouds. If using the cloud meets your company's goals, you should find a provider that meets your unique infrastructure needs. How will a cloud service provider or your own department DevOps build a process into a data lake so that data can be downloaded and retrieved as needed?
Since it will certainly take a lot of computation to get the most out of the data lake, consider which analytical processing steps can be automated. Experienced professionals will also be needed to build infrastructure to store the data lake, load data into it and transform data for use. Establishing a regular open exchange of information between IT and business managers can be the first step towards any transformation of IT, including the creation of data lakes.
See also
- Data Mining
- Big Data
- Big Data Global Market
- Big Data in Russia
- Big Data: First Totals
- Big Data in E-Commerce
- Big Data at Sberbank
- Machine intelligence
- Cognitive computing
- Data Science
- DataLake (Data Lake)
- BigData
- Neuronets
Notes
- ↑ At the current pace of qualitative changes and data growth, the need to benefit from them becomes even more urgent. John Gray, Chief Technologist of Infiniti Consulting Group, part of InterVision, one of the leading providers of strategic services, is shared on the portal InformationWeek with advice on creating data lakes..
- ↑ Data Lake Storage: cloud vs. onpremis