Search This Blog

Pages

Wednesday, December 31, 2014

The Case for Big Data: Redux

It's December 31st, 2014, and everyone is getting introspective or creating their "top N" lists for 2015. So, not having been active for the past year, I figured I would start writing again with my own end-of-year predictions. -PX-

I've often said that data analytics and cloud computing were made for each other. In fact, I believe this so strongly that I've included this in talks I've given at various conferences and academic institutions. Of course, it's clear that the user community has adopted Hadoop as a de facto standard analytics tool running on whatever cloud service provider's infrastructure. The trick is, how will this evolve in the coming years?

AWS was the first cloud service provider (CSP) to offer a data analytics platform on demand. Soon thereafter, and as Hadoop matured, other CSPs such as Google, Rackspace and Greenplum (acquired by EMC in 2010 for $300 million and subsequently spun off under Pivotal and rebranded to HAWQ), among others. I don't think there is anyone who will dispute that this has signaled that the initial wave of adoption of large scale data analytics by innovators and early adopters has begun.

Clearly, larger organizations constrained by regulations and laws will opt to build in house The corollary to this is that, for smaller scale use and for the vast majority of adopters, the clear path is to use these low cost service providers as a test bed for on boarding this new paradigm. 

Accenture took a stab at this (video of presentation here, deck here, white paper here) and does a relatively good job (albeit slightly dated to 2013) of describing the TCO of an on-premises deployment --$21,845-- and then uses this as the budget for AWS which results in an estimated number of instances that can be purchased using three potential flavors (68x m1.xlarge, 20x m2.4xlarge, 13x cc.8xlarge). This model slightly oversimplifies the acquisition of a number of servers and assumes a refresh cycle of some sort. Take that for what it's worth.


That said, organizations using public CSP services need to do some extra math to figure out if there is an inflection point; in other words, does it become more expensive to grow beyond a certain threshold in an on-demand environment than to deploy in-house on bare metal? There is no easy answer to this question because it's going to depend on the organization's preference for hardware vendor, utilization rates, labor, etc.

So how does this evolve in the coming years? Data analytics adoption will continue to grow and, I think at least, that we'll see:

  • More public service providers entering the market with analytics offerings;
  • New tools being offered as a service;
  • A growing skill gap that organizations will have to scramble to fill.
You decide what comes of these predictions.

















Friday, April 11, 2014

To know the future is to change the future

Earlier this week I was stuck in traffic and was thinking about time travel (don't ask) and the time travel paradox--basically, if you travel back in time and prevent your Grandfather from meeting your Grandmother, you would never exist because your parent would never be born. At the same time, there was a discussion on the radio about social issues and how to break negative outcome cycles (like dropping out of school). So, naturally, I wondered what effect having knowledge of a likely future outcome would have on that future. I know, geeky, dorky and confusing all at the same time...

The current state of Predictive Analytics and Big Data is that Data Scientists study and manipulate data to create models in order to test hypotheses. So, it stands to reason that the better a model becomes at predicting future behavior, the closer we get to seeing into the future; in other words, predictive models are akin to the arcane art of predicting the future. Thus, according to the time travel paradox, changing something as a consequence of this knowledge would necessarily change the future and therefore break the model or invalidate the prediction. 

Let's come back to social issues like high school drop out rates. If we create a model to predict high school dropout rates it will help us determine what segment of the teenage population is at risk of dropping out. Now, if we instruct social workers to monitor and educate at risk teens about dropping out which (ideally) causes rates to drop, we will have broken the model as it no longer reflects reality and therefore we would no longer see into the future.

This is a little naive, of course, because statistical models can be adjusted, and SHOULD be adjusted, in an iterative fashion. In fact, if we take this adjustment into account along with the availability of real-time data streams, we would expect the resulting predictions to evolve, which would mean that no matter what action we take based on our model, it would always accurate (within statistical parameters, of course).

Just don't try going back in time. I advise against it.

Tuesday, January 7, 2014

Data science and ethical considerations

I think we can all agree that data science is growing in importance and popularity as a means to increase insight into and meaningful interaction with customers. The use of personal information and browsing histories are common inputs for recommendation engines. However, the technology has evolved to include a very subversive tool in efforts to more efficiently market to you: YOU.

Marketers conceive of marketing campaigns all the time. Sometimes, if they have access to a Data Scientist, they test the campaign before rolling it out. Let's say the campaign is a holiday coupon, either 10% or 20% off. The Data Scientist would conceive of a basic experiment to determine whether profits are higher with the 10% or 20% coupon, and Marketing would decide which discount to offer based on the outcome. Sounds innocuous. Right? Actually, it's a bit more complicated than that. I'll explain.

In the pharma/biotech industry, experiments are subject to regulations, are sanctioned by government entities and are overseen by ethics and acceptable use committees. This provides the necessary control to prevent unnecessary and potentially harmful experimentation on humans.

In business, and increasingly in consumer marketing, there is a distinct lack of this control. Meaning that companies with access to data can experiment on consumers without the oversight applied to the health sciences. This becomes even more important when we consider that digital marketing efforts are now experimenting on and exploiting what can arguably be called psychological vulnerabilities such as subverting an individual's decision making process by presenting them with an ad that contains the consumer's likeness (i.e., the person depicted in the ad is morphed to look like the consumer). And all this can be done using private information and individuals' likenesses from sites on which pictures are posted such as Facebook.

Should Data Scientists and Marketers be held to a higher standard than they are currently? If they are manipulating consumers and testing on them, perhaps some sort of oversight or code of ethics is in order.