Citisoft Blog

Implementing Data Lakes: 3 Steps to Avoid Creating a Swamp

Written by Chris Guild | May 21, 2019

Kicking off a new data lake implementation and looking for the best place to start? Well, as the adage goes, prior preparation prevents poor performance. There are several factors asset managers should consider early on in an implementation to avoid turning the new lake into a one-way repository of useless data. It is essential to consider lake organization, data lifecycles, and usability to help ensure the lake is a strategic asset.

Prior to jumping into your data lake, it is important to establish a common definition. I have found the best way to describe a data lake is the “lake” is a data construct in its most natural state which accepts data from source systems in an untransformed format. The primary benefit of this approach, compared to a data warehouse or enterprise data management (EDM) tool, is that it helps overcome cost and storage problems that may exist within these other tools. To realize the full benefits of implementing a data lake, below are several points to consider before launching your implementation.

Lake Organization

As a business user, one of the first misconceptions of a data lake is that it’s an unstructured object. In an unstructured approach, it is easy to quickly diminish the ability to analyze data and derive meaningful insights to help drive your business.

Instead, consider that the data lake can be organized into smaller components called ponds. An easy way to wrap your head around this structure is to view ponds as spaces within the data lake to help begin organizing data. It is essential to recognize that a firm can organize its data in many ways. For example, the business can structure their lake by data types like analog, application, and textual.

Another core question a firm should answer before creating a lake is how will data move between ponds? As a best practice, a raw data pond can be put in place as a staging area for data that has not been classified or conditioned by the lake. Also, data should only transfer to an archive pond when it ends its useful life across ponds. As a result, a lake should include raw and archival data ponds to support the data lifecycle.

Data Lifecycle

Data, like so many other aspects of the financial services domain, goes through a lifecycle. The firm should consider core lifecycle questions like how data should move from pond to pond, the type of activity allowed in each pond, and the kind of conditioning that can occur in each pond.

In the case of data transformation in the lake, your firm should carefully consider how to move the data conditioning process. A data lake is not simply a repository for data, but also a system that focuses on transforming raw data into information that is used in analytical processes. Organizing the lake into ponds allows the firm to apply different conditioning tasks, which is why most firms should start with creating a plan for data ponds. It is also critical to note that data has a shelf life for a business. Up to a point, the data is useful to the firm. But eventually that usefulness runs its course, at which point data can move into an archival pond.

Lake Usability

A data lake that is created without assessing how the data will be used limits its strategic potential. The lake can quickly become a one-way path where data is loaded but cannot be extracted to add value to the firm. There are several steps a firm can take to mitigate this risk:

  • Establish Context: Certain data types can lack context if viewed in a vacuum. It is essential to ensure data, like text, is given proper context to prevent ambiguity.
  • Create Metadata: Generate a map of the data that resides in the lake. This will allow data users to better understand and utilize the conditioned data in the lake.
  • Develop a Metaprocess: Produce and tag data processes in the lake. This step answers questions like when data was created, how much was generated in the lake, and who produced the data.

These steps also allow the team to document how the data in the lake can be integrated, thus limiting the impact of data viewed in silos and helps ensure data will interact across ponds.

Remember: the first step to preventing your lake from turning into a swamp is to plan the lake’s structure, internal processes, and uses. These key points will also help asset managers ensure that the lake does not become a one-way data repository. With a few planning steps up front, you can build a data lake that is a strategic asset to your firm.