Lately, we have been dealing with some new interesting conditions and requirements that involve data lake security. It’s an apparently simple concept. But break it into its two sub-concepts and you would quickly notice plenty of complexity and detail within these three words.
On the one hand, you may be new to the idea of a “data lake” (or “data hub”). Even if you are not new to it, your definition may differ from that of others. On the other hand, security. Well, I suppose you have many things in mind by now that can be associated with security, all of which could turn your lake water muddy…
Because data lake security is such a complex concept, in this post I will:
- Provide a little bit of background on each of its composing concepts.
- With that foundation, we’ll navigate the different waters of the lake: the main areas to secure.
- Finally, we’ll dive a little deeper into the lake waters of document-level security, a concept in which we have a lot of experience. This will allow us to review how the known document-level security concept is expanded with yet another variant definition, more appropriate to the lake.
What is Data Lake Security?
As promised, we’ll broadly define the sub-concepts so that we can approach the waters of our lake safely, with a similar understanding. For our purpose, let’s define the “data lake” (or “data hub”) as:
“A repository of enterprise-wide raw data, but combined with big data and search engines, a data lake can deliver impactful benefits. Data lakes bring together data from separate sources and make it easily searchable, maximizing discovery, analytics, and reporting capabilities for end-users.”
Perhaps, you go beyond to qualify it with the word “enterprise” in front of it? In such case, your enterprise data lake is private, as only those within the organization would have access to it.
You can read about the data lake and its architecture in my previous in-depth blog titled A Data Lake Architecture with Hadoop and Open Source Search Engines.
So, let’s move into the security part of the lake. This is, at the same time, simpler and harder to define than data lake. It is simpler because we all have an understanding of security. Yet, it is harder because we all are aware that such understanding is quite dependent on the context in which it is used.
For example, our view into data lake security on a recent project started from the perspective of what data was available to end-users via the search application’s user interface. Soon after going over secure search requirements, we had to consider how to secure the content within the lake so that users with direct access to the lake wouldn’t have rights to delete or update the data. No over fishing or contaminating, please!
Like these, there are considerations about the platform administration, the machines where different components of the system run, how and where the data is stored at, who can execute what and where, amongst others that we’ll cover in a bit.
To summarize, data lake security is ensuring that only those that should have access to the lake, to specific components of the system, or to specific portions of the data, are granted specific permissions based on the security rules defined for the data lake system.
Navigating the Lake Waters: Four Areas to Secure
A natural or man-made lake has different areas; some are shallow while others are deep. The plants and animals vary within the lake; depending on the size, depth, and location of the lake. Gates, fences, and perhaps even natural barriers protect access to the lake. There may be boats, visitors, and keepers of the lake. Likewise, the data lake has different components which can be grouped into four main areas with respect to data lake security.
1. Platform Access and Privileges
The platform provides the components to store data, execute jobs, tools to manage the system and the repositories, etc. Security for each type or even each component varies. Let’s assume your data lake uses Hadoop as a platform. Here are some examples of the kind of security to be used in some of the components at the platform level:
- Machine access and users’ roles – What accounts have access to a particular machine and what roles do those users are associated with?
- HDFS and the file system in the different machines used – Either, files stored in HDFS or those used by the programs running as part of your data lake system, should have restricted access. Posix-like Access Control Lists (ACLs) are used to manage file and folder privileges, in a form like <rwxrwxrwx owner group>.
- HBase, Impala or similar – You may store metadata or even files in a NoSQL repository as an alternative or to complement the HDFS stored content. Namespaces and accounts access (like in a traditional RDBMS) are used in protecting these data stores.
- Jobs execution – Permissions to execute MapReduce, YARN, or similar applications.
- Administration utilities – Permissions to access platform’s components management utilities.
2. Network Isolation
Prevent undesired access to your environment by protecting your data lake property. You already have such protection for on-premises (or hosted) information technology. Additionally, cloud technology is now considered secure enough to host enterprise applications outside of the traditional boundaries of an on-premises (or hosted) network. Virtual private networks in the cloud, along with firewalls and other mechanisms enable you to implement network isolation on a cloud-based solution.
3. Data Protection
Since data lakes store content retrieved from other sources, you may need to think of ways to protect the data as it may be done already in the original content sources. Data encryption is one known way to protect content. For strong data protection, data encryption should be applied at the storage level (data at rest) and while data travels over the network (data on transfer).
4. Document-Level Security (or Secure Search)
There are plenty of resources from experts in the areas mentioned above. Since we specialize in search engines, we’ll dive just a little deeper in this area. You can already find a lot of details on this subject in our Document-Level Security for Enterprise Search blog series and some other valuable information in Industry-Wide Standards for Document-level Security in Enterprise Search. I won’t repeat what we’ve presented in those blogs, instead, I will describe some of our recent experiences about document-level security specifically in a data lake implementation.
Data lakes may be intended to break the barriers that silos create by allowing users of the lake access to the centralized content in it. Still, some applications or rather, some data, require document-level access restrictions in place: users must only see documents to which they are granted read permissions. Where are those access controls defined at? Well, there may be different business rules depending on the content in the lake so let’s briefly describe some of the possible scenarios.
Scenario 1: Replicate Access Control from the Content Source
This requires enforcing the same content access permissions of the original source of the data. At first, it may seem that we are just replicating a silo by preserving the security of the content source in the lake. Keep in mind that the power of the lake resides on adding value, not in simply storing the content from multiple sources, by:
– Enabling users to efficiently retrieve or discover documents through search results or analytics on content from different sources in a central location
– Enriching the content as it is added to the lake, such as doing normalization, cleansing, or entity extraction not available or possible at the source
– Creating a read-only copy of data from a source that is no longer actively accessible, such as an old system that has been decommissioned after its data was replicated to the lake (stopping short from becoming the system of record, perhaps)
Scenario 2: Replicate Access Control from the Content Source… Only for Some of the Data
A variant of this scenario consists of only enforcing the source’s content access permissions for a subset of that content, the most sensible or confidential portions of it. A relaxed set of permissions would be applied to the rest of the content from that given source. Perhaps allowing any user with access to the original system to see all of the non-sensitive content from that source in the lake, regardless of his or her permissions in the source.
Scenario 3: Grant Access to All Users from Parts of the Organization
Portions of the data lake may be available to all users from a Business Unit, Division, Department, or similar. The source system may have a limited capacity that prevents all potential users from accessing content in that source, whether it is the number of licenses (seats) available for that system; platform or software capacity limitations of an old software system; or other reasons. Your data lake may break the silos barriers and overcome the limitations at the sources to allow access to content to all users from the same part of the organization, such as Manufacturing, Research & Development, Finance, other areas, or a combination of them.
Scenario 4: Grant Access to All Users of the Data Lake
If no other specific permissions are defined, the minimum document-level security is to make content available only to users authorized to work with data lake powered applications.
Document-Level Security for Data Lakes: Additional Considerations
A data lake should be intended to be a live, adapting platform for many years to come. Therefore, the few access control scenarios described above, or your specific circumstances today, may evolve and new ones may be added. Therefore, it is better to prepare for those changes with at least the following two lessons learned from our prior implementations.
1. Always Store Content Permissions in the Data Lake for All Documents
Even if your current requirements do not include replicating the access controls at the content sources, retrieve those permissions along with the documents and store them in the data lake. Remember that the data lake is a repository of enterprise-wide raw data. Document-level permissions are part of that raw data.
This will enable you to implement document-level security more efficiently using those access controls if a future application requirement calls for it. By storing the permissions along with other metadata and free text content, you won’t have to go back to the source and retrieve those permissions for all content later. This is particularly important for content sources that are too slow, too big, may soon become unavailable, or with a higher risk of becoming un-accessible before you have a chance to get more data out of them (think third-party hosted systems or licensed systems that block access upon license termination).
2. Updating Content Permissions Stored in the Data Lake
Your storing mechanism or the way you handle permissions at the source may impose efficiency limitations in maintaining permissions up-to-date in the data lake. Consider the following atomic changes in the source that affect multiple documents in that repository:
- Modifying a folder’s (or other containers’) permissions that are inherited by the files (or records) and subfolders within that higher-level container
- Adding or removing users from a security group or role
- Changing the permissions granted to users, groups, or roles
- Other non-document specific access control changes
What the cases above have in common is that multiple documents access controls must be updated as a result of that one change. Your update mechanism must be capable of updating the stored permissions associated with all affected documents in the data lake and specific client applications that have document-level security, like a search engine.
Chances are that some limitations on your data lake implementation may force you to update all stored documents in the data lake periodically to ensure that each of them has the access control lists that match those at the content source. If so, make sure your process and tools are prepared to do so while your content sources are still available!
Back to the Lake Shore
Writing about these experiences makes me think how we, humans, like to conceptualize the world we live in. According to a paper from the Association for Academic Psychiatry (AAP), “conceptualization is the act of thinking through and seeing beyond existing ideas to discover higher order ideas from within one’s own mind.” Imagine trying to implement a complex system like a secure data lake while having to think of all the details of everything involved. Even the simplest task would be pretty difficult that way. By using concepts, our minds allow us to manage all that information without making our heads explode.
So, rely on your computational thinking capacity to deal with data lake security. An approach that has worked well for our customers is breaking down data lake security challenges into smaller, more manageable pieces like those described above. You can then rely on your data lake implementation partner or internal experts in each area to ensure your data lake is secure and well-maintained all around in order to power and scale your search and analytics applications.
– Carlos