Guidance for Building a Security Data Lake

November 23, 2021 | By IANS Faculty

A security data lake can offer a powerful way to analyze a lot of information in a more cost-effective way than what’s provided with today’s SIEM technology. While we don’t see SIEM being displaced in the short-term, a security data lake can be a powerful addition to your security program and help perform advanced analysis and threat analytics at scale. This piece explains how data lakes differ from SIEMs, and offers tips for planning, building, and securing a security data lake. 

Data Lakes vs. SIEM 

Understanding the differences between a SIEM solution and a data lake will help you determine if implementing a security data lake is right for your organization. SIEM solutions were developed to solve the problem of log aggregation across the enterprise. They allow for a centralized “single pane of glass” approach to security information and event logging. SIEM solutions help support detecting, monitoring and responding to threats across a wide range of systems. The primary driver for SIEM remains threat detection and incident response.   

The SIEM market is considered relatively mature at this point. However, both the cost of deploying a SIEM and its narrow focus on security events have led some firms to seek alternatives. The growth of data and log sources in most enterprises (which increases overall SIEM management costs) is leading some firms to question the long-term viability of SIEM solutions. 

Data lakes address these SIEM shortcomings by offering a big data solution at low cost using open-source alternatives. Key data lake upsides include: 

  • No strict data structures required. Data lakes can hold large amounts of information independent of data structure. This differs from data warehousing technology, which requires a rigid, structured approach to data. 
  • Fewer data source restrictions: Unlike a SIEM solution, a data lake can accept data from almost any source. This means that in addition to basic log files, you could include open-source intelligence (OSINT), threat feeds and other relevant information that might help in analysis. 
  • Better analytics: In addition, machine learning (ML) and artificial intelligence (AI) tools can perform more advanced levels of analytics than what is currently possible with today’s SIEM technology. 

Security Data Lake: Getting Started 

Choosing the right technology stack to build a data lake is largely a personal preference but may be impacted by the existing skills on your team. However, we suggest having an idea of how much data you might store so you can estimate cloud storage costs, but this isn’t a requirement. 

Before building a security data lake, consider the following: 

  • Required use cases: SIEM features like alerting and ticketing are more difficult to implement in a data lake. Most use cases for security data lakes are more analytics-based and focus on investigating incidents, threat hunting, uncovering fraud, data visualization and dashboard creation. 
  • Employee skills: Who will help build and maintain the data lake? While the somewhat free-form structure of a security data lake is easy to work with initially, the analysis and reporting tools require some skill and configuration before they can be of value. Employee skills should be a key consideration when picking a data analysis platform. Without having the proper skills and data governance processes, a data lake implementation can quickly become a data management challenge. Data lakes work best when someone has at least a nominal data science background and the know-how to focus on the details of data lake implementation. The ability to maintain data source updates and fix broken data connections are all considerations for ongoing maintenance, but if you are trying to really leverage data lakes in everyday security operational processes, you might also want someone who can handle performance tuning. 
  • Data sources: Before adding data to a data lake, be sure to have some idea of what the end state might look like. If you plan to integrate SIEM data now or in the future, some parsing and processing may be required. Real-time feeds might also require configuration. Planning what data you want to include will help ensure that you get the full picture when it’s time for data analysis. For example, think about your most critical data elements, including privileged access information, systems accessed and timestamps. 
When building a security data lake, it’s best to spend a large chunk of time planning, start small and scale out from there. Make sure you have the right skills to not only build but manage a data lake on an ongoing basis. This will help you gain the most value from the endeavor and reduce the risk of a failed implementation. 

Protecting the Security Data Lake 

Data lake architecture involves ingesting, storing and analyzing a lot of data. Given the potential sensitivity of security information, the security controls of the data lake deserve careful consideration. Key areas to focus on include: 

  • IAM: Cloud services have matured quite a bit from a security perspective, but security can be compromised when identities are not managed well. Consider integrating your cloud-based data lake with your current identity provider, such as AD, to keep this process streamlined. 
  • Data integrity: A modern data lake architecture should be able to track changes and support simultaneous users. A file integrity monitoring (FIM) service can help ensure the integrity of files and protect against file manipulation and damage. For less stringent integrity requirements, strong IAM and authorization controls might be sufficient. 
  • Encryption: Many cloud storage security exposures result from insufficient attention given to encryption and key management. 
  • Vulnerability management: No matter which cloud provider or analytics tools you choose, make sure your data lake infrastructure has the latest security patches. Employ a regular patching cadence and periodic security vulnerability scanning. 
With the right planning and attention to security controls, cloud-based data lake solutions can provide a scalable and secure environment to perform data analysis on large volumes. 

Security Data Lake Tips for Success  

A data lake is not a replacement for a SIEM solution. However, creating a security data lake can make a lot of sense if you’re looking to augment existing SIEM technology, offload some log data and perform levels of analysis that are simply not possible on today’s SIEM platforms. To be successful with building your security data lake, consider: 

  • Starting small but plan ahead: Identify some use cases and have an idea of what information you’re hoping to get out of the effort. Test the tools before settling on an analytics platform, and make sure you have the skills on staff to build and maintain the data lake. 
  • Protecting the data lake: Security event and log information can contain sensitive information. Consider all the same security controls you would include in any cloud deployment, including encryption, access management and system patching. 
Security data lakes hold the promise of adding efficiencies to data-driven incident investigations and more advanced threat hunting tactics. They represent a natural evolution of the SIEM solution, but for now it’s not a one-to-one replacement. 

Although reasonable efforts will be made to ensure the completeness and accuracy of the information contained in our blog posts, no liability can be accepted by IANS or our Faculty members for the results of any actions taken by individuals or firms in connection with such information, opinions, or advice. 

Find additional resources from our security practitioners.


2021 CISO Compensation Benchmark Report