Hadoop Warehouse
Posted on October 2nd, 2015

The Hadoop World with Strata ( ) thrown in for broader relevance  was a great place to be. It was a whirlwind of activity, glimpses of fast changing landscape mixed with healthy doses of salesmanship and Sparkling (Spark was everywhere and in every mouth) reminder of what's in and what's out and how some older vendors (Syncsort and less so IBM) have been inventive enough to lay a stake to claim a piece of the pie everywhere, including here.

The conference floor was littered with BI, Data Discovery vendors supporting Hadoop and it was well nigh impossible to walk a meter or two without bumping into at least a vendor in that space if not two.

Of course there was talk of Data Lakes, Reservoirs, Irrigation (What? STOP! that's when you know you've gone too far with metaphors) and a horde of animals wild and domesticated running rampant across the terrain of the conference. Apache Kafka appeared to be a form of relief (in name only though) from this horde as you took a moment, Kafkaesque, if you like to ponder how are these Lakes, Kudu's, Hawq's, RedPoints, tamr's, Dato's and countless others require you to "Splunk" your money in for returns that may or may not be worth your time.

There were some interesting speakers in the Keynotes and two stood out in my mind, first Joseph Sirosh, "What 0-50 million users in 7 days can teach us about big data", was brilliant in getting the message across about how Azure can scale your business if you so need from 0 to 50 million users in a week without blatantly promoting Microsoft and more importantly that in todays business climate if you put something out to the world it scales up faster than you could envision its growth in your wildest dreams, this is especially true for any small person putting an App or Website out there with just a tiny hope that it will become a thing substantial in due course. Due course can now be a matter of minutes to just a few hours, be prepared for it. Here is the link to his entertaining and informative talk. ( )

The second and most outstanding was Maciej Ceglowski with "Haunted by Data". It was a passionate plea for some reflection to everybody working in Data Sciences and an early warning to not lose our sense of right and wrong while power of the data beckons us to arm ourselves further to compete. I will say no more except to say, it was rollickingly hilarious and yet the most meaningful oration of the entire conference. Watch it ( ).

Lastly the Security layers and infrastructure in and around Hadoop has to improve and there were some promising trends in this area though this is surprisingly still lagging given as one of the speakers I pointed out above shows, this things grows like Hulk given the right conditions.
​In conclusion, I'll say I'll be there next year to find how this industry has progressed and what I can get my clients from it.

​The large majority of vendors barring Microsoft, Intel, Tableu and a handful of others were short on appealing to all Enterprises except those willing to spare money for experimentation. I found most vendors lacking a show of comparison to established tools in the Enterprise infrastructure such as Kafka (Vendor:Confluent) against MQ Series or MSMQ, MemSQL and VoltDB against Oracle Times 10 (or whatever Oracle sells now in this space), or some of the Database (Cloudera and other Hadoop ecosystem vendors) offerings against say Oracle Real Application Clusters. They appeared to just assume that since its from the "Cool" Open Source world, its got to be good, with all these corporations running rampant in this "cool" world, I don't think its quite so cool any longer, so lets recognize that and get down to business and fast.

I have no doubt that the architecture of Hadoop and many of these vendors whose products are utilizing the amazing engineering within Hadoop (directly or in spirit) in most cases beats those slightly older technologies, however it will be good to see a head-to-head comparisons to help move some of these pre-Hadoop systems into the Hadoop infrastructure. Thats just the plea from this lone Enterprise Data and Solutions Architect.

by Sridhar Kolinjavadi on September 16th, 2015

Get your act together, the time is now

Enterprise Data Warehouses and Operational Data stores have been the mainstay of many an Organization for an Enterprise view of data.

Getting there different organizations have had varying levels of success. And a whole lot of design, ETL and more than a little back and forth (in the style two steps forward if lucky and one back) goes into its development.

Behind it all more often than not there are committees of Data Architects from different parts of the business presenting their view of what data from their line of the business makes more sense in the EDW and varying levels of demand for what the EDW should provide in return. And once that is solidified there comes a time to agree on how the data from the sources get to the EDW, how often and how accuracy will be maintained in the EDW.

The EDW architect is the anchor around which these disjoint and competing interests revolve around and its a rare organization (and most probably small one) where a coherent vision for the EDW can be drawn, executed and accomplished.

EDW’s are more often than not, White Elephants, which while contesting to become the single source of an Enterprise view of data, almost always fail in that goal and end up being just another data store that downstream analytical stores go to left with no other option. And its a rare case or an unchanging business model where the EDW as envisioned fulfills that purpose by the time its implemented.

Add to that the difficulty in adapting changes to the EDW when an upstream system changes or is replaced, it is a Big ticking timer before the difficulties of maintaining, paying and sufficiently resourcing the EDW out-weigh its logically glorious yet in the light of day, limited benefits.

Hadoop is the first serious technology option we have had that allows a loosely coupled EDW to be built, perhaps the time for increasing the success rate of EDW’s is right now.

Let’s look at how Hadoop is different from a EDW built on a RDBMS or other similar database technology, let me warn you that I don’t intend to highlight the Hadoop toolset or the reliability of HDFS or the ways you can get data out, these are all topics for another day and quite honestly its possible a few new ones creeped in since I started writing this piece.

Hadoop allows you to consume just about anything you have to give it. This allows us to increase the number of sources that can be co-related within the EDW now and in the future and that in effect makes the E in EDW ring truer and more representative than ever before.

HCATALOG allows us to progressively manage the meta data layer on these varying sets of data which would not have lived together in a RDBMS. This tool or something like it is going to become more and more important to Hadoop, the lynchpin if you will.

The development of the EDW inside the Hadoop eco-system can be less stringent and more flexible than traditional initiatives of this kind.  At a minimum the EDW team should be aware of all that goes into the cluster and information on what goes in shared freely. A progressive implementation of meta-data standards across the organization will help this process become streamlined in due course. However this does not have to be overly rigid like it would be on a RDBMS, thus ensuring that the EDW team is not in the way of what users of the clusters want to achieve. 

There is no doubt that HCATALOG will help in disbursing this information across the organization only if there were some standards agreed on how this catalog is built and what subsidiary documentation is created and checked in with its creation. As in all things this has to evolve and there is no easy way around it.

The EDW team does not need to issue edicts on how data gets there as long as there is agreement on how often its get there. The only area where EDW team needs to determine the method of data acquisition, its schema and frequency are on data sources whose acquisition the EDW is responsible for.

This haze of uncertainty that you no doubt detect and dislike like haze over Singapore, I won’t deny, exists in this architecture. But given the speed of evolution of this ecosystem and the demands different stake-holders will impress upon it within an organization, it is likely that different teams using Hadoop are going to be using different sets of tools to put data in and take it out and to do everything in-between. To ignore this fact and assume that this is like an Oracle database with Informatica in the input and Cognos on the output ports will be akin to missing the Hadoop Elephant for the tall grass its foraging.

Immerse yourself in the flexible loosely coupled topology of the Hadoop landscape, read in the sights and sounds, take what you want and realize the sum of all this chaos is your EDW better than ever before, alive with information and possibility.