Data Warehousing—How Many Replications Are Necessary?

blue buildings in city during daytime with sky in background

Generally speaking, I view data as a shared responsibility between the technology and operations teams (though we have a lot more to say about holistic data governance on the blog). For a global asset manager, this is a 24/5 operation with many teams in different locations. In order to maximize operational efficiency, asset managers not only need to embrace a culture of data, they need to understand and optimize their data processes. In my experience, this is easier said than done, which I learned the hard way working with an extremely data-centric asset manager.

This $250B asset manager had truly embraced the culture of data. If the most recent analytics and holdings were not available when expected, they were quick to raise red flags at the highest levels. When it came time for a major front office conversion, there were concerns that the existing data warehouse/data mart would not be able to handle this data-centric environment for the business reporting systems.

The firm decided to purchase an additional data warehouse solution with a STAR schema utilizing a different database language. The architecture was set up to move/replicate data six times. The process was as follows:

The data was loaded into a data warehouse.
The data was transformed from the warehouse into the data mart.
The data from the data mart was extracted into a different format by a proprietary tool to prepare for the STAR schema.
The data was loaded into a STAR schema while users logged into the data were “flipped” to the disaster recovery environment.
Users were flipped back to the production STAR schema and a table copy job ran to update the disaster recovery environment.
The data was extracted into the client reporting systems.

As you can imagine, the processing of the same data into many environments took several hours, with technology and operations people hand-holding each step. In cases where a product, security, analytic, or holding update missed the cut-off of when the data replication processing started, it would need to wait until the next instance of the data replication process to run. The data consumers would usually wait until the next day to see the updates they’d expected.

Instead of identifying and fixing the root cause, which was the perceived slowness of the existing data warehouse, the solution was to replicate the data into another data warehouse with a different schema and language. At first the new schema worked well, but after a couple years, it was outperformed by the original data warehouse. The problem was only deferred (or you could say “replicated”).

This solution was well-intentioned and hindsight is 20/20, but this is an example where tackling the underlying issue would have been prudent. I offer a few suggestions to improve database performance before undertaking a costly and inefficient workaround:

Archiving/deletion—Establish a corporate policy on data retention. Historical data should be archived. Data that is archived can be brought back if necessary. Delete data that is repetitive or not required.

New hardware—Purchase new hardware to match your vendors’ recommendations. The life span of servers are three to five years. Servers are getting faster and smaller—they can drastically improve performance.

Performance optimization—Database analysts and developers at an asset manager can work together to performance-tune the database. This is a tricky and time-consuming task that oftentimes requires an outside specialist, but will pay huge dividends.

If the asset manager shared in my anecdote had followed these tips, it would have saved significant cost over the long-run. In cases like this, it’s not enough to champion data stewardship if inefficiencies are simply being covered up by new processes. If you’re taking a strategic view of your organization’s data processes, resist the urge to over-engineer a solution to a problem that can solved by optimizing performance. In my experience, the simplest solution is often the right one.