DLP - Data Loss/Leak Prevention Technologies
DLP (Data Loss/Leak Prevention) technology - technologies to prevent leaks of confidential information from the information system outside, as well as technical devices (software or hardware) to prevent leaks. The DLP Leak Prevention Solutions catalog is available on TAdviser. What are the scares of data leaks and how to protect yourself from them? TA Details
Content |
Network leaks (for example, e-mail or ICQ), local (using external USB drives), stored data (databases) can become leakage channels that lead to information outward from the company's information system. Separately, you can highlight the loss of media (flash memory, laptop). A DLP system can be classified as if it meets the following criteria: multichannel (monitoring of several possible data leakage channels); unified management (unified management tools across all monitoring channels); Active protection (compliance with security policy) accounting for both content and context.
The competitive advantage of most systems is the analysis module. Manufacturers are so protruding this module that they often name their products after it, for example, "DLP solution based on labels." Therefore, the user chooses solutions often not for performance, scalability or other criteria traditional for the corporate information security market, namely, based on the type of document analysis used.
Obviously, since each method has its own advantages and disadvantages, the use of only one method for analyzing documents makes the solution technologically dependent on it. Most manufacturers use several methods, although one of them is usually a "flagship." This article is an attempt to classify the methods used in document analysis. Their strengths and weaknesses are evaluated from experience in the practical application of several types of products. The article does not fundamentally consider specific products, because the main task of the user when choosing them is to drop out marketing slogans like "we will protect everything from everything," "unique patented technology" and awareness of what he will remain with when the sellers leave.
Container analysis
This method analyzes the properties of a file or other container (archive, cryptodisk, etc.) in which the information is located. The colloquial name of such methods is "solutions on labels," which quite fully reflects their essence. Each container contains a label that uniquely identifies the type of content contained within the container. These methods require virtually no computational resources to analyze the information being moved, since the label fully describes the user's rights to move content along any route. In a simplified form, such an algorithm sounds like this: "there is a label - we prohibit, there is no label - we skip."
The advantages of this approach are obvious: the speed of analysis and the complete absence of errors of the second kind (when the system mistakenly detects an open document as confidential). Such methods are called "deterministic" in some sources.
The disadvantages are also obvious - the system only cares about the marked information: if the label is not supplied, the content is not protected. It is necessary to develop a procedure for labeling new and incoming documents, as well as a system for preventing the transfer of information from a marked container to an unlabeled one through buffer operations, file operations, copying information from temporary files, etc.
The weakness of such systems is also manifested in the organization of labeling. If they are placed by the author of the document, then by malicious intent he has the opportunity not to mark the information that he is going to steal. In the absence of malice, negligence or carelessness will sooner or later manifest itself. If you oblige to mark a certain employee, for example, an information security officer or system administrator, then he will not always be able to distinguish confidential content from open content, since he does not know thoroughly all the processes in the company. So, the "white" balance should be posted on the company's website, and the "gray" or "black" cannot be taken out of the information system. But one can only distinguish the chief accountant from the other, i.e. one of the authors.
Labels are usually subdivided into attribute, format, and external labels. As the name suggests, the first are placed in file attributes, the second - in the fields of the file itself and the third - are attached to the file (associated with it) by external programs.
Container structures in information security
Sometimes the advantages of solutions on the tags are also considered low requirements for the performance of interceptors, because they only check the tags, i.e. act as turnstiles in the metro: "there is a ticket - pass." However, do not forget that miracles do not happen - the computing load in this case is transferred to workstations.
The place of solutions on labels, whatever they are, is the protection of document storage. When a company has a documentary repository, which, on the one hand, is replenished quite rarely, and on the other hand, the category and level of confidentiality of each document is precisely known, then it is easiest to organize its protection using labels. You can organize labeling on documents that arrive in the vault using the organizational procedure. For example, before sending a document to the repository, the person responsible for its operation can contact the author and specialist with a question about what level of confidentiality the document should be set. This task is especially successfully solved using format labels, i.e. each incoming document is saved in a protected format and then issued at the request of the employee with an indication of it as an admitted to read. Modern solutions allow you to assign access rights for a limited time, and after the expiration of the key, the document simply ceases to be readable. It is according to this scheme that, for example, the issuance of documentation for public procurement tenders in the United States is organized: the procurement management system generates a document that can be read without the opportunity to change or copy the contents of only the bidders listed in this document. The access key is valid only until the deadline for submitting documents to the competition, after which the document ceases to be read.
Also, with the help of solutions based on tags, companies organize document management in closed segments of the network, in which intellectual property and state secrets are addressed. Probably, now, according to the requirements of the Federal Law "On Personal Data," document management will also be organized in the human resources departments of large companies.
Content analysis
When implementing the technologies described in this section, unlike those described earlier, on the contrary, it is completely indifferent in which container the content is stored. The task of these technologies is to extract meaningful content from the container or intercept the transmission over the communication channel and analyze the information for the presence of prohibited content.
The main technologies in defining prohibited content in containers are signature control, hash-based control, and linguistic methods.
Signatures
The simplest control method is to search the data stream for some sequence of characters. Sometimes a forbidden sequence of characters is called a "stop word," but more generally it can be represented not by a word, but by an arbitrary set of characters, for example, by the same label. In general, this method cannot be attributed to content analysis in all its implementations. For example, in most UTM devices, the search for prohibited signatures in the data stream occurs without extracting text from the container, when analyzing the "as is" stream. Or, if the system is configured for only one word, then the result of its work is the definition of a 100% match, i.e. the method can be attributed to deterministic.
However, more often, the search for a certain sequence of characters is still used when analyzing text. In the vast majority of cases, signature systems are configured to search for several words and the frequency of terms, i.e. we will still attribute this system to content analysis systems.
The advantages of this method include independence from the language and the ease of replenishing the dictionary of prohibited terms: if you want to use this method to search the data stream for a word in Pashto, you do not have to speak this language, you just need to know how it is written. It is also easy to add, for example, transliterated Russian text or "Alban" language, which is important, for example, when analyzing SMS texts, ICQ messages or blog posts.
Disadvantages become apparent when using non-English. Unfortunately, most manufacturers of text analysis systems work for the American market, and English is very "signature" - word forms are most often formed using prepositions without changing the word itself. In Russian, everything is much more complicated. Take, for example, the word "secret" (secret), cute to the heart of an information security officer. In English, it means the noun "secret," and the adjective "secret," and the verb "classify." In Russian, several dozen different words can be formed from the root "secret." I.e. if in an English-speaking organization it is enough for an information security employee to enter one word, in a Russian-speaking organization you will have to enter a few dozen words and then change them in six different encodings.
In addition, such methods are unstable to primitive coding. Almost all of them pass before their favorite reception of novice spammers - replacing symbols with similar ones in style. The author repeatedly demonstrated to security officers an elementary technique - the passage of confidential text through signature filters. The text containing, for example, the phrase "top secret" and the mail interceptor configured for this phrase are taken. If you open the text in MS Word, then the two-second operation: Ctrl + F, "find 'o' (Russian layout)," "replace with 'o' (English layout)," "replace everything," "send a document" - makes the document completely invisible to this filter. It is all the more disappointing that such a replacement is carried out using the standard means of MS Word or any other text editor, i.e. they are available to the user, even if he does not have local administrator rights and the ability to run encryption programs.
Most often, signature control of flows is included in the functionality of UTM devices, i.e. solutions that clear traffic from viruses, spam, intrusions and any other threats, which are detected by signatures. Since this feature is "free," users often believe that this is enough. Such solutions really protect against accidental leaks, i.e. in cases where the outgoing text is not changed by the sender in order to bypass the filter, but they are powerless against malicious users.
Masks
An extension of the search functionality for stopword signatures is the search for their masks. It is a search for content that cannot be precisely specified in the stopword database, but you can specify its element or structure. Such information should include any codes characterizing the person or enterprise: TIN, account numbers, documents, etc. You cannot search for them using signatures.
It is unreasonable to set the number of a specific bank card as a search object, and I want to find any credit card number, no matter how it is written - with spaces or together. This is not just a desire, but a requirement of the PCI DSS standard: unencrypted plastic card numbers are prohibited from being sent by e-mail, i.e. the user's duty is to find such numbers in e-mail and drop prohibited messages.
Here, for example, is a mask that sets such a stop word as the name of a confidential or secret order, the number of which starts from zero. The mask takes into account not only an arbitrary number, but also any register and even the substitution of Russian letters in Latin. The mask is written in the standard notation "REGEXP," although various DLP systems may have their own, more flexible notations. The situation is even worse with phone numbers. This information is classified as personal data, and you can write it in a dozen ways - using various combinations of spaces, different types of brackets, plus and minus, etc. Here, perhaps, the only mask cannot do. For example, in anti-spam systems where a similar problem has to be solved, several dozen masks are used to detect a telephone number at the same time.
Many different codes entered into the activities of companies and its employees are protected by many laws and represent trade secrets, bank secrets, personal data and other information protected by law, therefore the problem of detecting them in traffic is a prerequisite for any solution.
Hash function
Various types of hash functions of samples of confidential documents were at one time considered a new word in the leak protection market, although the technology itself has existed since the 1970s. In the West, this method is sometimes called "digital fingerprints," i.e. "digital fingerprints," or "shindles" in scientific slang.
The essence of all methods is the same, although the specific algorithms of each manufacturer may differ significantly. Some algorithms are even patented, which confirms the uniqueness of the implementation. The general scenario of action is as follows: a database of samples of confidential documents is collected. A "fingerprint" is removed from each of them, i.e. significant content is extracted from the document, which is reduced to some normal, for example (but not necessarily) text form, then hashes of all content and its parts are removed, for example paragraphs, sentences, fives of words, etc., detailing depends on the specific implementation. These prints are stored in a special database.
The intercepted document is similarly cleared of service information and brought to a normal form, then shindles are removed from it using the same algorithm. The resulting prints are searched in the database of fingerprints of confidential documents, and if they are, the document is considered confidential. Since this method is used to find direct quotes from a sample document, the technology is sometimes called "anti-plagiarism."
Most of the advantages of this method are simultaneously its disadvantages. First of all, this is a requirement to use sample documents. On the one hand, the user does not need to worry about stop words, significant terms and other information that is completely nonspecific for security officers. On the other hand, "no pattern - no protection," which gives rise to the same problems with new and incoming documents as when referring to technology based on labels. A very important advantage of such a technology is its focus on working with arbitrary character sequences. This follows, first of all, independence from the language of the text - even hieroglyphs, even Pashto. Further, one of the main consequences of this property is the ability to fingerprint non-text information - databases, drawings, media files. It is these technologies that Hollywood studios and world recording studios use to protect media content in their digital storages.
Unfortunately, low-level hash functions are unstable to the primitive coding seen in the signature example. They easily cope with changing the order of words, rearranging paragraphs and other tricks of "plagiarists," but, for example, changing letters throughout the document destroys the hash pattern and such a document becomes invisible to the interceptor.
Using this method alone makes working with forms difficult. Thus, the empty form of the loan application is a freely distributable document, and the completed one is confidential, since it contains personal data. If you simply take a print from an empty form, then the intercepted filled-in document will contain all the information from the empty form, i.e. the prints will largely match. Thus, the system will either skip confidential information or prevent the free distribution of empty forms.
Despite the shortcomings mentioned, this method is widespread, especially in such a business that cannot afford qualified employees, but operates on the principle of "put all confidential information into this folder and sleep quietly." In this sense, the requirement of specific documents for their protection is somewhat similar to solutions based on labels that are only stored separately from samples and stored when changing the file format, copying part of the file, etc. However, a large business with hundreds of thousands of documents in circulation is often simply unable to provide samples of confidential documents, since the company's business processes do not require this. The only thing that is (or, more honestly, should be) at each enterprise is the "List of information constituting a trade secret." Making samples from it is a non-trivial task.
The ease of adding samples to the controlled content base often plays a cruel joke with users. This leads to a gradual increase in the fingerprint base, which significantly affects the performance of the system: the more samples, the more comparisons of each intercepted message. Since each print occupies from 5 to 20% of the original, the fingerprint base is gradually growing. Users note a sharp drop in performance when the base begins to exceed the amount of RAM of the filtering server. Usually, the problem is solved by regular audit of sample documents and removal of outdated or duplicate samples, i.e., saving on implementation, users lose on operation.
Linguistic methods
The most common method of analysis to date is linguistic analysis of the text. It is so popular that it is often colloquially called "content filtering," that is, it carries the characteristics of the entire class of content analysis methods. In terms of classification, both hash analysis and signature analysis and mask analysis are "content filtering," i.e. traffic filtering based on content analysis.
As the name implies, the method only works with texts. You will not protect with it a database consisting only of numbers and dates, especially drawings, drawings and a collection of favorite songs. But this method works wonders with the texts.
Linguistics as a science consists of many disciplines - from morphology to semantics. Therefore, linguistic methods of analysis also differ among themselves. There are methods that use only stop words, only entered at the root level, and the system itself already makes up a complete dictionary; there are terms based on the arrangement of the scales of the terms found in the text. There are also imprints based on statistics in linguistic methods; for example, a document is taken, the fifty most used words are considered, then the 10 most used words are selected in each paragraph. Such a "dictionary" is an almost unique characteristic of the text and allows you to find meaningful quotes in the "clones."
Analysis of all the intricacies of linguistic analysis is not within the scope of this article, so let's focus on the advantages and disadvantages.
The advantage of the method is complete insensitivity to the number of documents, that is, scalability is rare for corporate information security. The content filtering base (a set of key dictionary classes and rules) does not change in size from the appearance of new documents or processes in the company.
In addition, users note in this method the similarity with "stop words" in the part that if the document is delayed, then it is immediately clear why this happened. If a system based on prints reports that some document is similar to another, then the security officer will have to compare the two documents himself, and in linguistic analysis he will receive already marked content. Linguistic systems, along with signature filtering, are so common because they allow you to start working unchanged in the company immediately after installation. There is no need to mess around with marking and fingerprinting, inventory documents and do other non-specific work for the security officer.
The flaws are equally obvious, and the first is the dependence on language. In each country whose language is supported by the manufacturer, this is not a drawback, but from the point of view of global companies that have, in addition to a single corporate communication language (for example, English), many more documents in local languages in each country, this is a clear drawback.
Another drawback is the high percentage of errors of the second kind, which requires qualifications in the field of linguistics to reduce (for fine-tuning the filtering base). Standard industry bases typically give 80-85% filtration accuracy. This means that one in five to six letters is intercepted in error. Setting the base to an acceptable 95-97% response accuracy is usually associated with the intervention of a specially trained linguist. And although it is enough to have two days of free time and master the language at the level of a high school graduate to learn how to adjust the filtering base, there is no one to do this work, except for a security officer, and he usually considers such work non-core. It is always risky to attract a person from the outside - after all, he will have to work with confidential information. The way out of this situation is usually to buy an additional module - a self-learning "autolinguist" who is "fed" false positives, and he automatically adapts the standard industry base.
Linguistic methods are chosen when they want to minimize business interference, when the information protection service does not have an administrative resource to change existing processes for creating and storing documents. They work always and everywhere, albeit with the mentioned shortcomings.
Popular random leak channels - mobile media
Analysts InfoWatch believe that mobile media (laptops, flash drives, mobile communicators, etc.) remain the most popular channel for accidental leaks, since users of such devices often neglect enciphering data tools.
Another common cause of accidental leaks is paper media: it is more difficult to control it than electronic, since, for example, after the sheet leaves the printer, you can only follow it "manually": control over paper media is weaker than control over computer information. Many leak protection tools (you cannot call them full-fledged DLP systems) do not control the channel of information output to the printer - so confidential data easily goes beyond the organization.
This problem is solved by multifunctional DLP systems that block the sending of illegal information to print and check the correspondence of the mailing address and the addressee.
In addition, ensuring protection against leaks is significantly complicated by the growing popularity of mobile devices, because there are no corresponding DLP clients yet. In addition, it is very difficult to identify a leak in the case of cryptography or steganography. An insider, to bypass some filter, can always turn to the "best practices" on the Internet. That is, DLP tools protect quite poorly from an organized deliberate leak.
The effectiveness of DLP tools can be hampered by their obvious flaws: modern leak protection solutions do not allow you to control and block all available information channels. DLP systems will monitor corporate mail, web usage, instant messaging, external media, document printing, and hard drive content. But Skype remains not controlled for DLP systems yet. Only Trend Micro managed to declare that it can control the operation of this communication program. The rest of the developers promise that the corresponding functionality will be provided in the next version of their security software.
But if Skype promises to open its protocols to DLP developers, then other solutions, such as Microsoft Collaboration Tools for collaboration, remain closed to third-party programmers. How to control the transmission of information on this channel? Meanwhile, in the modern world, the practice is being developed when specialists are remotely combined into teams to work on a common project and disintegrate after its completion.
The main sources of confidential information leaks in the first half of 2010 are still commercial (73.8%) and state (16%) organizations. About 8% of leaks come from educational institutions. The nature of the leaked confidential information is personal data (almost 90% of all information leaks).
The leaders in leaks in the world are traditionally the United States and the United Kingdom (also Canada, Russia and Germany with significantly lower rates were among the five countries in terms of the largest number of leaks), which is associated with the peculiarity of the legislation of these countries, which prescribes reporting all incidents of confidential data leakage. Infowatch analysts predict a reduction in the share of accidental leaks and an increase in the share of intentional leaks next year.
Implementation difficulties
In addition to the obvious difficulties of implementing DLP, it is also difficult to choose the right solution, since various DLP system providers have their own approaches to organizing protection. Some have patented algorithms for analyzing content by keywords, and someone offers a method of digital fingerprints. Under these conditions, how to choose the best product? Which is more effective? It is very difficult to answer these questions, since there are very few implementations of DLP systems today, and there are even fewer real practices for their use (which could be relied on). But those projects that were still implemented showed that more than half of the amount of work and budget in them is consulting, and this usually causes great skepticism among management. In addition, as a rule, existing business processes of the enterprise have to be rebuilt under the requirements of DLP, and companies are struggling to do so.
How does DLP implementation help you meet current regulatory requirements? In the West, the introduction of DLP systems motivates laws, standards, industry requirements and other regulations. According to experts, the clear requirements of legislation available abroad, methodological guidelines for ensuring requirements are the real engine of the DLP market, since the introduction of special solutions eliminates claims from regulators. Our position in this area is completely different, and the implementation of DLP systems does not help comply with the law.
Some incentive for the implementation and use of DLP in the corporate environment may be the need to protect the commercial secrets of companies and comply with the requirements of the federal law "On Commercial Secrets."
Almost every enterprise has adopted such documents as the "Regulation on Commercial Secrets" and the "List of Information Constituting Commercial Secrets," and their requirements should be fulfilled. There is an opinion that the Trade Secrets Act (98-FZ) does not work, however, company leaders are well aware that it is important and necessary for them to protect their trade secrets. Moreover, this awareness is much higher than understanding the importance of the Law "On Personal Data" (152-FZ), and it is much easier for any manager to explain the need to introduce confidential document management than to talk about the protection of personal data.
What prevents DLP from being used in trade secret protection automation processes? According to the Civil Code of the Russian Federation, in order to introduce a regime for the protection of commercial secrets, it is only necessary that the information has some value and is included in the corresponding list. In this case, the owner of such information is obliged by law to take measures to protect confidential information.
At the same time, it is obvious that DLP will not be able to resolve all issues. In particular, cover access to confidential information to third parties. But there are other technologies for that. Many modern DLP solutions are able to integrate with them. Then, when building this technological chain, a working system for protecting trade secrets may result. Such a system will be more understandable for business, and it is the business that will be able to act as the customer of the leak protection system.
The need for a DLP class system has been called into question. You can prevent theft of secrets by more effective methods
Data loss prevention (DLP) systems, in fact, are mainly designed to control data leakage and investigate incidents, and not to prevent the leakage itself. And how to prevent the theft of secrets? Read about this in a separate TAdviser article.
Russia and the West
According to analysts, Russia has a different attitude towards security and a different level of maturity of companies supplying DLP solutions. The Russian market is focused on security specialists and highly specialized problems. Data breach prevention people don't always understand what data has value. In Russia, a "militaristic" approach to organizing security systems: a solid perimeter with firewalls and every effort is being made to prevent entry.
But if the employee of the company has access to the amount of information that is not required to fulfill his duties? On the other hand, if you look at what approach has been formed in the West over the past 10-15 years, then we can say that more attention is paid to the value of information. Resources are directed to where valuable information is located, not to all information in a row. Perhaps this is the biggest cultural difference between the West and Russia. However, analysts say, the situation is changing. Information begins to be perceived as a business asset, and evolution will take some time.
There is no comprehensive solution
Not a single manufacturer has yet developed 100% protection against leaks. Some experts formulate problems with using DLP products something like this: the effective use of the experience of dealing with leaks used in DLP systems requires an understanding that significant work to ensure protection against leaks should be carried out on the customer's side, since no one knows their own information flows better than him.
Others believe that it is impossible to protect against leaks: it is impossible to prevent information leakage. Since the information has value for someone, it will be received earlier or later. Software tools can make obtaining this information a more expensive and time-consuming process. This can significantly reduce the benefit of owning information, its relevance. This means that the efficiency of DLP systems should be monitored.
2019: DLP systems sales reach $1.65 billion
The global market for systems to protect companies from data breaches reached $1.647 billion in 2019. Analysts at ResearchAndMarkets predict that it will grow at an average annual rate of 21.03% and reach $6.265 billion by 2026.
The market for enterprise data loss prevention systems is largely driven by the growing demand for optimized solutions, as well as a sharp increase in cybersecurity threats to enterprises. The rise in data breaches, along with other factors such as DLP provision as a service, DLP functionality extending to the cloud, and improved protection against attacks targeting data theft are the main drivers of the DLP systems market.
In addition, the growing demand for increasing volumes of both structured and unstructured data, digital assets, and the growing need for data security services with a focus on data-centric organizations or enterprises has led to the growth of the enterprise data loss prevention systems market. Many Fortune 500 companies have invested in the DLP systems market for several years.
DLP systems are mainly used in industries such as healthcare, industrial communications and technology, as well as in government. In addition, with the rise of cyber threats, data loss prevention solution providers are targeting service companies working with end users from a wide range of industries.
One of the main challenges of effectively preventing data loss in companies is the high cost of implementing DLP systems. Some companies may also need professional service support from the supplier, which could cost larger businesses thousands of dollars. In addition, additional utilities or integration may be required from third-party manufacturers or from the supplier itself, which are sometimes sold as separate modules or devices, which also increases the total cost of operating DLP systems.[1]
See also
Links