Hadoop Warehouse
Your Next Data Warehouse should be on Hadoop
by Sridhar Kolinjavadi on September 16th, 2015

Get your act together, the time is now

Enterprise Data Warehouses and Operational Data stores have been the mainstay of many an Organization for an Enterprise view of data.

Getting there different organizations have had varying levels of success. And a whole lot of design, ETL and more than a little back and forth (in the style two steps forward if lucky and one back) goes into its development.

Behind it all more often than not there are committees of Data Architects from different parts of the business presenting their view of what data from their line of the business makes more sense in the EDW and varying levels of demand for what the EDW should provide in return. And once that is solidified there comes a time to agree on how the data from the sources get to the EDW, how often and how accuracy will be maintained in the EDW.

The EDW architect is the anchor around which these disjoint and competing interests revolve around and its a rare organization (and most probably small one) where a coherent vision for the EDW can be drawn, executed and accomplished.

EDW’s are more often than not, White Elephants, which while contesting to become the single source of an Enterprise view of data, almost always fail in that goal and end up being just another data store that downstream analytical stores go to left with no other option. And its a rare case or an unchanging business model where the EDW as envisioned fulfills that purpose by the time its implemented.

Add to that the difficulty in adapting changes to the EDW when an upstream system changes or is replaced, it is a Big ticking timer before the difficulties of maintaining, paying and sufficiently resourcing the EDW out-weigh its logically glorious yet in the light of day, limited benefits.

Hadoop is the first serious technology option we have had that allows a loosely coupled EDW to be built, perhaps the time for increasing the success rate of EDW’s is right now.

Let’s look at how Hadoop is different from a EDW built on a RDBMS or other similar database technology, let me warn you that I don’t intend to highlight the Hadoop toolset or the reliability of HDFS or the ways you can get data out, these are all topics for another day and quite honestly its possible a few new ones creeped in since I started writing this piece.

Hadoop allows you to consume just about anything you have to give it. This allows us to increase the number of sources that can be co-related within the EDW now and in the future and that in effect makes the E in EDW ring truer and more representative than ever before.

HCATALOG allows us to progressively manage the meta data layer on these varying sets of data which would not have lived together in a RDBMS. This tool or something like it is going to become more and more important to Hadoop, the lynchpin if you will.

The development of the EDW inside the Hadoop eco-system can be less stringent and more flexible than traditional initiatives of this kind.  At a minimum the EDW team should be aware of all that goes into the cluster and information on what goes in shared freely. A progressive implementation of meta-data standards across the organization will help this process become streamlined in due course. However this does not have to be overly rigid like it would be on a RDBMS, thus ensuring that the EDW team is not in the way of what users of the clusters want to achieve. 

There is no doubt that HCATALOG will help in disbursing this information across the organization only if there were some standards agreed on how this catalog is built and what subsidiary documentation is created and checked in with its creation. As in all things this has to evolve and there is no easy way around it.

The EDW team does not need to issue edicts on how data gets there as long as there is agreement on how often its get there. The only area where EDW team needs to determine the method of data acquisition, its schema and frequency are on data sources whose acquisition the EDW is responsible for.

This haze of uncertainty that you no doubt detect and dislike like haze over Singapore, I won’t deny, exists in this architecture. But given the speed of evolution of this ecosystem and the demands different stake-holders will impress upon it within an organization, it is likely that different teams using Hadoop are going to be using different sets of tools to put data in and take it out and to do everything in-between. To ignore this fact and assume that this is like an Oracle database with Informatica in the input and Cognos on the output ports will be akin to missing the Hadoop Elephant for the tall grass its foraging.

Immerse yourself in the flexible loosely coupled topology of the Hadoop landscape, read in the sights and sounds, take what you want and realize the sum of all this chaos is your EDW better than ever before, alive with information and possibility.

Posted in not categorized    Tagged with hadoop, data warehouse, hcatalog, data store, enterprise data warehouse, analytics, hortonworks, cloudera, hdp, skware, hadoopwarehouse, business intelligence


Leave a Comment