How to Evaluate the Viability of a Data Lake for Security

November 18, 2021 | By IANS Faculty
Most organizations aren’t currently implementing security data lakes, primarily due to cost and skills shortages. Data lakes do have some compelling use cases, though, if the integration to the SIEM and other solutions can be managed. These integrations, however, can be hit-or-miss, and any organizations looking to invest in a security data lake should be aware of that potential issue. This piece explains the pros and cons of implementing a security data lake, along with tips for making it work. 

What is a Date Lake? 

It's important to note that a data lake is not a SIEM. A data lake is built to offer a central store of event data, along with a processing engine. A data lake has four goals: 

  • Provide a process and mechanism to collect all data. 
  • Process and enrich data in one location. 
  • Store data only once. 
  • Access data using a standard interface. 
The data lake can offer a central location for log storage and analysis, but the data lake will need to be compatible with data analysis and reporting solutions (like SIEM) to be truly effective. Data lakes should be able to ingest almost any type of data (events, raw logs, etc.), and triage/scrub the data and provide it to other solutions like SIEM. 

Data lakes can provide data for the following scenarios: 

  • Basic queries (searching) 
  • Relationship mapping (data types/strings that match a pattern) 
  • Data mining (large scale processing for trends/patterns) 
  • Raw data access 
  • Real-time statistics 
Data lakes do not often have security analytics and correlation in place natively and will need to publish data to a SIEM or another specialized platform to accomplish this. 

Security Data Lake Use Cases 

  • Security event management: Log data and other events are being produced in enormous quantities, and security teams must recognize specific indicators quickly, see patterns of events occurring, and spot events happening in cloud environments. Data lakes provide massive event data processing technology to build more intelligence detection and alerting tactics with SIEM platforms that are compatible. 
  • Endpoint/network behavior modeling: Network flow modeling is a great use case for data lake ingestion and analysis with large-scale processing. There are massive quantities of traffic between systems that could and should be developed into “normal” baselines for monitoring. Endpoint events and alerts can also be processed at massive scale to discover patterns of unusual behavior. 
  • Fraud detection: For financial services firms and insurers, fraud detection requires an enormous number of inputs and data types, and many intensive types of processing. Text mining, database searches, social network analysis and anomaly detection are coupled with predictive models at scale, and data lakes could potentially help with this enormously. Once in place, this could be extended to things like fraudulent use of cloud services, for example a Microsoft Office 365-based phishing attack from a hijacked account. 
  • Malware detection: Data lake event processing of data and file attributes could likely help with detection of ransomware and other malware variants today, particularly those without known signatures. Leading endpoint detection and response (EDR) companies like Carbon Black and CrowdStrike are leveraging cloud data processing for this, but there could be a case to be made for in-house sandbox processing engines using data analytics within a data lake, as well. 

Data Lake Integrations 

Data lake integration can be simple or very challenging, depending on the solutions involved. Issues to watch for include: 

  • Real-time data feeds: Some security products and platforms may require access to your real-time data feeds to do their own processing, rendering a data lake somewhat superfluous in some cases. Jobs like scoring and behavioral models, for example, often require access to a feed in real time. 
  • SIEM tuning: Most major SIEMs can be integrated with a security data lake, but parsing and processing requires a significant amount of tuning in most scenarios. 
  • Native connectors: EDR and network detection and response (NDR) tools can be less likely to have native connectors to data lakes, although some common forms of import/export (JSON, for example) are growing in popularity. 
  • MSSP/MDR integration: Integration to MSSPs and MDR providers can be hit-or-miss. We suggest having detailed discussions with any providers you’re looking to work with, because this is not a common type of platform that MSSPs and MDR providers manage or work with natively. 

Security Data Lake Technologies and Skills 

Many technologies in the space today could serve as the basis for a security data lake. These include, but are not limited to, Hadoop, Apache MapReduce, YARN, Spark, Hive, Elasticsearch, Apache Storm and the ELK stack. 

When investigating any of these technologies, be aware of the skills needed to build and maintain this technology stack for the future as these can be rare, expensive skills to acquire. 

Security Data Lake Success Tips 

While not all organizations currently have security data lakes in place, they can help teams manage the ever-growing storage, processing and analysis needs of modern security organizations. 

However, security data lakes aren’t for everyone. To be successful, we recommend security teams ensure: 

  • Current security tools will work with the data lake, and not require their own separate real-time data feeds. 
  • Threat intelligence analysis: Threat intelligence data provides perspective on things like attacker sources, indicators of compromise (IoCs), behavioral trends related to cloud account usage and attacks against various types of cloud services, etc. Threat intelligence feeds can be aggregated into a data lake, analyzed at scale using machine learning techniques, and processed for likelihood/predictability models. 
  • Current SIEMs can integrate easily, without an onerous amount of tuning. 
  • Current skillsets can manage the lake, both now and into the future. 
Although reasonable efforts will be made to ensure the completeness and accuracy of the information contained in our blog posts, no liability can be accepted by IANS or our Faculty members for the results of any actions taken by individuals or firms in connection with such information, opinions, or advice. 

Find additional resources from our security practitioners.


2021 CISO Compensation Benchmark Report