Offloading not-so-hot data from your data warehouse without losing value

Have you dreamt of gaining valuable insights from all the data you’ve collected over many years of business without adding an unbearable burden on you data warehouse?

There are many good reasons to limit the amount of data in a data warehouse. This includes costs, storage capacity, backup and restore limitations, to name only a few. Business users and IT personnel used to separate hot data from warm data and cold data. The terms are an analogy for the frequency this data had been queried by business users over the last couple of months:

  • Hot data: data accessed today and during the last week
  • Warm data: data accessed during the last month
  • Cold data: data accessed during the last year (or even longer)



The effort required for this categorization is quite high, since you can only act on what has been measured for a long enough period of time (monitoring databases and business intelligence platforms, searching in logs and more). The data supply chain can be long, complex and can potentially consist of many different systems, which need to be included in this consideration. Very often, clients come up with custom-programmed dashboards to show the top 100 queries accessing the data stored in the data warehouse. But who knows if a specific query that needs data from two years ago is less important than another that needs only current data?

Technology can help conquer this time consuming, cost intensive and boring task using (even more) cost intensive features like data temperature, which is available in some relational databases and allows the administrator to use cheaper storage for warm and cold data (such as slower disks or even tape).

Best practice

A valid best practice for many years was to archive cold data on tape (or other slow and cheap storage media). The most obvious disadvantage is the unavailability of the archived data when immediate access is required; the data has to be restored before a user can actually query it.

This restricts the business from potential valuable insight. Who knows if you will need the data sooner or later? Why should you give up these insights just because of technology and costs?

There is a solution available today that addresses this challenge without changing the relational database hosting your existing data warehouse.

InfoSphere BigInsights for Apache Hadoop as an active archive

With Apache Hadoop, a data warehouse archive is nothing less than an extension of the original data warehouse. Hadoop adds additional storage and processing capacity at a much lower total cost of ownership (TCO) than a data warehouse. The cold data can be extracted and off-loaded from the data warehouse and moved into the Hadoop cluster, residing in Hive tables, without any restriction on volume since storage for Hadoop is cheap.

Structured Query Language (SQL) to access the data

The big advantage compared to an archive residing on tape is that you can still access the data on Hadoop with standard ANSI Structured Query Language (SQL) The business user requesting data from years ago will not notice the difference, apart from a little longer access time. There is no change in the business application that needs to be implemented.

Does this sound good? Well, if you are open to reconsidering your database selection, I have another secret hint for you. Why not use an analytic data warehouse appliance without any compromise on performance and storage capacity?

IBM PureData System for Analytics: the analytic data warehouse appliance

Instead of trying to find the best trade-off for the challenges discussed above, you could consider using a technology that can dramatically reduce your efforts and costs for your data warehouse platform. IBM PureData System for Analytics, powered by Netezza technology, is the most advanced data warehouse appliance on the market, engineered to simplify analytics with lots of storage and processing power, including up to 1.5 petabytes of user data, more than 6 terabytes of RAM and hundreds of high-performance CPUs and Field Programmable Gate Arrays (FPGA), crunching all your business data in seconds.

PureData System for Analytics does not distinguish between hot, warm and cold data since it can hold and process all of it in a fraction of the time conventional databases need.

And (just in case) if you still need more processing and storage for your data, the integration between PureData System for Analytics and Hadoop is easy and mature. With both technologies you are well prepared for the age of big data.

With the right approach and platform, today’s big data requirements can be mastered. Why not start today? Connect with me on Twitter to continue the discussion.