The Case for Big Data

Monday, November 28, 2016

Oh, Spotify, much is to learn about your users.

Anyone else use Spotify's free version? Anyone else annoyed by the ads? That's what I thought. ((Note: before someone spouts off that I should spring for a premium account or switch music streaming service if I don't like the ads, read this post through first.))

What is it about those ads that we find annoying? Their frequency? Their duration? While those are annoying, they are part and parcel with a free service. So, no that's not my problem. What is of concern to me is that Spotify doesn't know its customers. Oh sure, they have demographic information, but they don't "get" me. So, you're asking, what does that mean?

Screen capture from Spotify desktop app showing ad playing.

The ads that are run by Spotify are largely meaningless to me and, I suspect, to many of its users. Clearly these are paid spots (I'm unaware of how Spotify monetizes its ads), but their relevance to me is low and therefore I am unlikely to click through. The screen capture above shows what I mean: an ad for Bruno Mars' new album and a sampling of the music in my playlist - the ad is for a pop artist, my music selection is largely hard rock (e.g., Soundgarden), metal (e.g., Metallica), electronic (e.g., Skrillex, Chemical Brothers) and some rap (e.g., Beastie Boys). Pop music really doesn't figure in my rotation.

A better ad would be the banner below. I am male, in my 40's, two kids, and am conscious about my expenses and safety net I provide for my family. (While the content leaves a lot to be desired, insurance is surely on my mind.) Unfortunately, this appears to be on a random banner rotation since the two banners before were for hip-hop artists.

Screen capture from Spotify desktop app showing banner ad.

My playlist is hosted by Spotify which means they have access to my selections and all the meta data that is attached to each selection; genre, artist, album, etc. Based on this meta data, Spotify could understand my preferences and musical tastes and therefore present me with relevant ads that would interest me enough to click through and learn more.

While Spotify is struggling financially, it is looking to score more advertisers to increase its ad-based revenue stream. ("Spotify’s ad sales revenue has doubled year-over-year for the past two years. But the plan is to dial things up and start exploring larger deals with major ad holding companies...") However, advertisers will be looking to measure their ROI for the ads placed with Spotify. If Spotify continues to place ads more or less randomly, that ROI will be lower than it could be. However, if Spotify were to "get" its customers better, those ads could be targeted to the correct segments rather than use a shotgun approach and hope that some users click through.

This all amounts to providing additional value to Spotify's users and to Spotify's advertising customers. Knowing its user segmentation and preferences will allow Spotify to:

Target ads better, and
Offer advertisers/marketers more insight into their customers.

Then, maybe Spotify can improve its ad-based revenue stream by charging premium placement for targeted ads in front of relevant customers who have a high click-through history.

Know thy user.

Wednesday, November 9, 2016

Polls, Analytics and Backlashes

It's been a while that I've posted anything here (for various reasons), but I thought that the fact that we've just seen an historic election in the United States of America warranted jotting down a few thoughts. This election completely upset most, if not all, predictions made by expert pundits, pollsters and analytics experts. And, this is because of the effect that massive leads and one-sided reporting have on the human psyche.

Nate Silver's (@NateSilver538) FiveThirtyEight.com has followed the election campaign closely and analyzed the available data in polls-plus forecasts ("What polls, the economy and historical data tell us about Nov. 8.") providing a view on what the likely outcome of the election would be. Twenty-four hours ago, on November 8th, 2016, here's what that forecast looked like:

On November 9th, 2016, we now know that this forecast was incorrect and that Mr. Trump is President-Elect. So how can this prediction, rooted in polls, economics and historical data, have been so wrong?

First, let me bring another prediction and result to the discussion. Consider the Alberta general election of 2012:

Prediction

Source.

The numbers in white above is a forecast of the number of seats that the parties will win in the Provincial Legislature and the bars indicate the possible range of seats won for each party. Green is the Wildrose Party, blue is the Conservative Party.

Result

Source.

The coloured numbers are the actual seats won. Notice how the Wildrose Party was limited to 17 seats while the Conservative Party won significantly more seats than forecasted. The reason for this is that polling frequency was not adequate, meaning that pollsters missed a last minute shift in voter intention - and by last minute, we mean the last two days of the campaign.

"Wildrose's support simply cratered, and to an extent that no model or method could have anticipated." Source. Eric Grenier, ThreeHundredEight.com

"There is the possibility the polls somewhat over-estimated Wildrose support in the final week of the campaign and that the swing was not as dramatic as the numbers would suggest. But it seems very likely that Danielle Smith would have won an election held last week – and that a large enough number of Albertans changed their minds and opted for the Tories to swing the election at the very last moment."Source. Eric Grenier, ThreeHundredEight.com

This is what I call a backlash vote. Pollsters indicated that an upstart, further-right-wing party, would replace the long standing incumbent Conservatives. This may have caused a change in voter intention, and/or possibly the mobilization of voters who did not originally intend to vote at all, in an effort to temper the predicted change.

I don't believe that Mr. Trump defied the odds. Rather, I believe that there was a backlash vote, like in Alberta, that was intended to send a message as well as temper the forecasts and foregone conclusions of a landslide win. In fact, landslide wins of these proportions are not common and are often accompanied by some prevailing social context. In the case of this Presidential election, the context was that of an anti-establishment election, as noted by some in social media.

The bottom line is that, regardless of the advances in technology and ever increasing use of social media, there will always be a human component that is relatively unpredictable and can thwart even the most sophisticated analysis and data modelling. Hopefully, the AI eventually used to analyze and predict election outcomes won't decide that we're too erratic to govern ourselves and decide to wipe us out...

Disclaimer: I am Canadian. This post is intended to simply point out how trusting analytics and data analysis can be dangerous without context and supporting anecdotal evidence. This is why we say that Data Scientists working in an industry should have industry specific knowledge that can provide context to results.

Friday, April 17, 2015

Marketing Up in the Air

You know how we're always talking about on-to-one marketing, big data, analytics and the long-tail? And how we've made a science of marketing to defined demographics? Well, I think we've overlooked an opportunity to leverage tailor made marketing to a very captive population of consumers.

Air Canada and United Airlines list a combined average of 5,800 flights per day (1) each carrying an assumed average of 120 passengers; if only half of these flights are equipped to deliver in-flight entertainment, that's still 2,900 flights. Mathematically, that's about 348,000 passengers per day and more than 127,000,000 per year. (For perspective, IATA states that over 3.1 billion people took commercial flights in 2013 and that number was slated to grow to 3.3 billion in 2014.) (2)

Spafax lists in-flight advertising rates in its brief on Emirates and the relatively high level demographics of passengers (3). A 60 second spot run for 10+ months will cost an advertiser $160,650/month; being conservative, let's say $125,000/month (so for a year $1,500,000; all dollar amounts in USD unless otherwise specified). That's an average cost of about $0.012 per passenger.

Now imagine segmenting these passengers into demographics based on traditional parameters, such as those listed in Spafax's document, and add to it additional granularity on spending habits, loyalty program information, social media usage, travel destination, class of travel, content viewed in-flight, seat position, duration of stay and other possible parameters. Getting the idea here?

Airlines are missing out on the opportunity to offer marketers and advertisers the ability to target their ads to passengers who are otherwise unable to avoid the marketing while in flight; there is no fridge to raid, the bathrooms leave a lot to be desired (sorry AC and UAL, but airline bathrooms are a far cry from what is acceptable in my home) and passengers aren't likely to strike up a conversation with their seat mate while waiting out the ads.

Granted, the numbers may not add up because we're using conservative, assumed and estimated figures, but the logic stands:

captive audience + individual data + catalog of ads = increased reach

Would it be such a stretch to use readily available data to deliver content that is more relevant to a passenger? I don't think so, especially since the BMW, Jaguar and Land Rover ads that Air Canada has run over the past year may appeal more to Business and First Class passengers than the large majority of Sardine Class passengers who fly every year. Targeting ads can only help marketers and, ideally, make the flying experience much more enjoyable for passengers who might actually relate to the ads being presented to them.

(1) Source: Air Canada and United Airlines web sites
(2) http://www.iata.org/pressroom/pr/Pages/2013-12-30-01.aspx Accessed April 17, 2015.
(3) http://www.spafax.com/download/media-kits/emirates-inflight-mediakit.pdf Accessed April 17, 2015.

Wednesday, December 31, 2014

The Case for Big Data: Redux

It's December 31st, 2014, and everyone is getting introspective or creating their "top N" lists for 2015. So, not having been active for the past year, I figured I would start writing again with my own end-of-year predictions. -PX-

I've often said that data analytics and cloud computing were made for each other. In fact, I believe this so strongly that I've included this in talks I've given at various conferences and academic institutions. Of course, it's clear that the user community has adopted Hadoop as a de facto standard analytics tool running on whatever cloud service provider's infrastructure. The trick is, how will this evolve in the coming years?

AWS was the first cloud service provider (CSP) to offer a data analytics platform on demand. Soon thereafter, and as Hadoop matured, other CSPs such as Google, Rackspace and Greenplum (acquired by EMC in 2010 for $300 million and subsequently spun off under Pivotal and rebranded to HAWQ), among others. I don't think there is anyone who will dispute that this has signaled that the initial wave of adoption of large scale data analytics by innovators and early adopters has begun.

Clearly, larger organizations constrained by regulations and laws will opt to build in house The corollary to this is that, for smaller scale use and for the vast majority of adopters, the clear path is to use these low cost service providers as a test bed for on boarding this new paradigm.

Accenture took a stab at this (video of presentation here, deck here, white paper here) and does a relatively good job (albeit slightly dated to 2013) of describing the TCO of an on-premises deployment --$21,845-- and then uses this as the budget for AWS which results in an estimated number of instances that can be purchased using three potential flavors (68x m1.xlarge, 20x m2.4xlarge, 13x cc.8xlarge). This model slightly oversimplifies the acquisition of a number of servers and assumes a refresh cycle of some sort. Take that for what it's worth.

That said, organizations using public CSP services need to do some extra math to figure out if there is an inflection point; in other words, does it become more expensive to grow beyond a certain threshold in an on-demand environment than to deploy in-house on bare metal? There is no easy answer to this question because it's going to depend on the organization's preference for hardware vendor, utilization rates, labor, etc.

So how does this evolve in the coming years? Data analytics adoption will continue to grow and, I think at least, that we'll see:

More public service providers entering the market with analytics offerings;
New tools being offered as a service;
A growing skill gap that organizations will have to scramble to fill.

You decide what comes of these predictions.

Friday, April 11, 2014

To know the future is to change the future

Earlier this week I was stuck in traffic and was thinking about time travel (don't ask) and the time travel paradox--basically, if you travel back in time and prevent your Grandfather from meeting your Grandmother, you would never exist because your parent would never be born. At the same time, there was a discussion on the radio about social issues and how to break negative outcome cycles (like dropping out of school). So, naturally, I wondered what effect having knowledge of a likely future outcome would have on that future. I know, geeky, dorky and confusing all at the same time...

The current state of Predictive Analytics and Big Data is that Data Scientists study and manipulate data to create models in order to test hypotheses. So, it stands to reason that the better a model becomes at predicting future behavior, the closer we get to seeing into the future; in other words, predictive models are akin to the arcane art of predicting the future. Thus, according to the time travel paradox, changing something as a consequence of this knowledge would necessarily change the future and therefore break the model or invalidate the prediction.

Let's come back to social issues like high school drop out rates. If we create a model to predict high school dropout rates it will help us determine what segment of the teenage population is at risk of dropping out. Now, if we instruct social workers to monitor and educate at risk teens about dropping out which (ideally) causes rates to drop, we will have broken the model as it no longer reflects reality and therefore we would no longer see into the future.

This is a little naive, of course, because statistical models can be adjusted, and SHOULD be adjusted, in an iterative fashion. In fact, if we take this adjustment into account along with the availability of real-time data streams, we would expect the resulting predictions to evolve, which would mean that no matter what action we take based on our model, it would always accurate (within statistical parameters, of course).

Just don't try going back in time. I advise against it.

Tuesday, January 7, 2014

Data science and ethical considerations

I think we can all agree that data science is growing in importance and popularity as a means to increase insight into and meaningful interaction with customers. The use of personal information and browsing histories are common inputs for recommendation engines. However, the technology has evolved to include a very subversive tool in efforts to more efficiently market to you: YOU.

Marketers conceive of marketing campaigns all the time. Sometimes, if they have access to a Data Scientist, they test the campaign before rolling it out. Let's say the campaign is a holiday coupon, either 10% or 20% off. The Data Scientist would conceive of a basic experiment to determine whether profits are higher with the 10% or 20% coupon, and Marketing would decide which discount to offer based on the outcome. Sounds innocuous. Right? Actually, it's a bit more complicated than that. I'll explain.

In the pharma/biotech industry, experiments are subject to regulations, are sanctioned by government entities and are overseen by ethics and acceptable use committees. This provides the necessary control to prevent unnecessary and potentially harmful experimentation on humans.

In business, and increasingly in consumer marketing, there is a distinct lack of this control. Meaning that companies with access to data can experiment on consumers without the oversight applied to the health sciences. This becomes even more important when we consider that digital marketing efforts are now experimenting on and exploiting what can arguably be called psychological vulnerabilities such as subverting an individual's decision making process by presenting them with an ad that contains the consumer's likeness (i.e., the person depicted in the ad is morphed to look like the consumer). And all this can be done using private information and individuals' likenesses from sites on which pictures are posted such as Facebook.

Should Data Scientists and Marketers be held to a higher standard than they are currently? If they are manipulating consumers and testing on them, perhaps some sort of oversight or code of ethics is in order.

Monday, August 26, 2013

Why is it so difficult to find Data Scientists?

Remember my friend Sue, the self-proclaimed big data "Stuper User"? Well, as you may recall, Sue's company was interested in customer data analytics and the ability to extract insights for their marketing campaigns. The problem is, there is a shortage of people with the necessary skills to provide those insights.

This is not surprising given that data science is only now coming into its own. Harvard Business Review called the Data Scientist the "sexiest job of the 21st century" but not just anyone can be a Data Scientist: as Joel greenhouse writes in the Huffington Post, statistical know-how is the foundation of the profession.

This is what differentiates an analyst from the scientist: the analyst will run the queries and statistical tests, but the scientist will design experiments and tease out those oh-so-valuable insights that everyone is talking about. And this, folks, is why it is so difficult to find an individual who:

Understands your business and industry;
Has the necessary statistical background;
Is technically savvy enough to understand how databases work;
Can write the programs used to test hypotheses (such as SAS programming language, R, Erlang, etc.);
Is able to craft simple and coherent reports that are actionable.

Educational institutions are falling over themselves to create/capitalize on master's level certificate and graduate degree programs in data science, business intelligence, or business analytics, etc. And they're apparently not cheap! They range from $10,000 to $60,000 and anywhere from 10 months to a couple of years in duration. Time will tell if these programs are graduating data scientists or analysts.

In the meantime, Sue's company will continue to search for someone who can fill its need for customer insight in a market that has a shortage of available candidates.