Friday, November 8, 2013

Thrive on the Data Deluge! - A Review of the Book, Big Data - Part I

The authors, Viktor Mayer-Schonberger and Kenneth Cukier, have done a great job in tracing the emergence of Big Data. They also support the facts about Big Data with interesting live case studies from various fields.  Though the writing is not at par with Tim Harford and Malcolm Gladwell in keeping the readers glued to the book, I found it comprehensively informative! The thorough research by the authors has made this book rich in information and insight. The insights and information in the book more than compensate the lack of racy style.

The book takes off with 2009 outbreak of flu, which could be traced on real time basis by leveraging Big Data by Google! This typical Big Data case study captures all the elements of the Big Data phenomenon. Out of Google's 3 billion searches per day, 50 million common searches in USA per day were examined and found correlation of many unexpected terms with the outbreak! CDC (Centers for Disease Control and Preventioncouldn't trace the outbreak on real time basis! The case study has all the three characteristics of Big Data;
  1. More: Humongous data base almost touching n=All
  2. Messiness: No exactitude in the data, though for 'Small Data' applications exactitude is the prerequisite
  3. Correlation: Unexpected co-relations, not necessarily the causes. Hence the stress is on whats but not on whys. 
To me the last characteristic seems to be restrictive. Probably, in the initial years of Big Data, the stress will be on 'what's but when we have enough resources we move towards not only 'why's we also go to the next levels of 'why's!

The authors offer a good application case study of Farecast by Oren Etzioni, which was later acquired by Microsoft. In 2003, Farecast started with analyzing just 12 thousand airfares to save money for airline ticket buyers. Later, Farecast analyzed 200 billion transactions to offer better price advice for the air travelling customers! The authors also list Oren Etzioni's other ventures: Metacrawler which was taken over by InfoSpace; Netbot for price comparison, which was taken over by Excite and Cleanforest for extracting meaning from the text content, which was taken over by Reuters!

The data deluge we are experiencing is pronounced in astronomy, genomics and stock exchanges but it is spreading to all others. In astronomy, while Sloan Digital Sky Survey collected 140 Terabytes (TB) of data between 2000 and 2010, an upcoming telescope in Chile due to open in 2016 will be collecting the same quantum ie. 140 TB in 5 days! That's nearly 800X improvement in 16 years! The authors point out that while it took nearly a decade and whole lot of scientists to sequence 2 billion pairs human genome in 2003. Today, similar sequencing can be done by individual computer in a day! That's minimum 3,000X improvement in 10 years! Similarly in US 7 billion shares are traded every day and nearly 5 billion of them are traded by the algorithms, not human beings! 

According to the authors, Google generates 1 Petabyte (PB) data per hour; 10 million images are uploaded every hour in Facebook, 1 hour video is uploaded every second on Youtube and 600 million tweets are being sent every day in 2013 with annual growth of 200%. That gives an idea of rapid strides of Big Data in everyday life! The tools available to leverage this Big Data are mainly Mapreduce by Google, the open tool Hadoop and NoSQL. The authors conclude that data is growing at 4X speed and processing power is increasing by 9X compared to other economic activities. 

They offer an interesting example of images when available in huge quantity ie. more than 24 fps, then they change into videos. A live visual example of big quantity turning into qualitative difference. In my view, going forward  this corresponds to the 'augmented reality' apps we have already started experiencing. The data deluge is leading to  augmented reality in real time. Presently, the decisions are taken and executed in split seconds literally in the case of algorithm based share trades! And as Internet of Things (IoT) grows, we see similar split second executions in the other fields too.

In the first chapter, the authors offer a brief overview of the characteristics of Big Data. In the chapters on each characteristic of Big Data, the authors offer live exciting case studies to provide better understanding of the concepts.  

In the chapter about more data, they offer history of data collection which started with Egyptians and Chinese in large scale. Largely, the purpose of such data collection for the states was to identify sources of income for the states. US Started its census operation 1880 which went on for 8 years and the data was obsolete by the time it was available! The authors trace emergence of IBM to adoption of Herman Hollerith's punch cards and tabulating machines for the census of 1890, which could be completed in a year. 

While collecting and processing humongous data was extremely difficult, 'sample' data was sufficient for useful interpretations. 300 years back the foundation for statistical methods were laid to glean useful information from sample data. The randomness of the data trumped the quantity, as 95% of the time the results showed hardly 3% error margin. The randomness of data ensured better interpretation of sample data. The exactitude of data is the necessity for the interpretation as even the smallest error would affect the results largely. The fascination for exactitude continued even while the availability and process-ability of data kept on increasing.

But sample data interpretation can't be used to drill down data deeper. The sample data gets collected for specific objective/s in optimal manner. In short, the sampling was the solution for the world suffering from data scarcity. 

To differentiate between sampling and using almost all data (n=all), the authors offer convincing examples. Whereas, 23andMe offers the analysis of a few markers for a few hundreds of dollars, Steve Jobs got his entire genome sequenced at a six figure price. Steve Jobs and his medical team had the advantage of addressing the specific sequences to cure him. Whereas data sampling excludes the outliers for efficiency, the Big Data offers exciting insights based on the outliers!

The authors exemplify these advantages through Steven Levitt's study of Sumo wrestling matches in Japan. Steven's team collected all the data about Sumo matches and could conclusively show that the cheating was happening in the matches leading to the final matches.  Though the data in this case was not humongous, but they could collect all data, which was a few gigabytes. When FareCast used more airfare data  they could attain better accuracy in spite of messiness of data.

Today, the data streams through sensors, clicks and likes on web, web-cameras and CCTVs beam images and videos. There is ever accelerating trend of data being captured and communicated and shared. We are at the cusp of 'Internet of Things', which heralds communication between 'devices' to facilitate necessary decisions and actions.  

The authors offer a very apt example of Lytro camera. This revolutionary camera captures the entire light field consisting of 11 million rays! So the photographer can choose to focus, refocus, change perspective, ... the possibilities are immense. The same applies for the humongous data collected by today's sensors. Both these developments can be attributed to exponential growth in the complexity of chips and corresponding cost reduction in manufacturing them. 

We have entered the data deluge (Big Data) era, hence we can drill down data through multiple levels to get multiple insights at every level. Additionally, we have the advantages of serendipitous discoveries. We need not hypothesize in advance. We can get insights through outliers too, which gives distinct edge over the Small Data practitioners.

Talking about messiness of data, the authors offer a glimpse on history of data collection and the necessity of exactitude. This was due to paucity of data, storage and processing power too! The exactitude was and is required in many fields. The journey of exactitude started in 13th century and reached the height in the 19th century France, where the international standards of physical measurement were proudly possessed! The emergence of quantum mechanics in 1920s  moved the focus on uncertainty!

The messiness creeps into data through many ways, it could be myriad ways of referring to the same thing. The authors quote citing a Big Data expert from IBM, D J Patil. IBM may be referred differently as International Business Machines, Tom Watson's Laboratories,... Precision from the sensors in a refinery or a vineyard may vary. The format and type may vary in relational database. A sentiment analysis of tweets may be utterly messy by conventional standards. But the authors assure that messiness is better if humongous data can be acquired, stored and processed. And the data deluge era has already reached there! Not only the data, it gets generated, conveyed, shared, stored and processed (thanks to Hadoop!) almost on real time basis! The authors offer many examples. If computers almost end up winning that is because all the combinations of the endgame of six pieces or less can be accommodated in 1 Terabyte!

The authors offer the case study of development of grammar checker by two scientists from Microsoft, Michele Banko and Eric Brill. When they fed 10 million sentences to two different algorithms they could get 75% and 86% accuracy. When they increased the sentences to 1 billion sentences, they could get the accuracy of above 95% and 94%, respectively! The authors offer exciting stories from the world of machine translation. In 1954, IBM researchers announced possible machine translation within 3 - 5 years, fresh from the victories due to translation 205 pairs of well chosen sentences but had to concede defeat. In 1990s, IBM scientists used 3 million English French pairs of sentences from Canadian parliament discussions but the plug was pulled by late '90s! Google Translate project started with 1 trillion words making 95 billion English sentences. By mid 2012, the project covered 60 languages and was ready to accept voice inputs in 12 languages. The result of the project could offer translation between Hindi and Catalan too!

The authors emphasize the challenges in achieving neatness or exactitude in huge data. In Flickr 6 billion images are uploaded by 75 million users. It is impossible to categorize such huge data neatly, hence the 'tags'. Many times these tags with incorrect spelling will be ok.  In real life, not many things are neatly categorized. Approximation is fine. While Facebook shows exact number while the numbers of 'like's are within 100s the moment they cross 1,000 approximation sets in as 2K, 5K,... The same goes with emails in Gmail, 12min., 1hr., 2 days....

Whereas structured query based solution give precise answers, Big Data solutions like Hadoop offer quick answers, which are approximate by processing messy data.   The authors offer an example of 73 billion transactions of Visa which took 1 month to provide precise answers while big data solution offered approximate solution in 13 minutes! Almost 2,000X improvement in speed. When the decisions and executions can be done on real time basis, the organizations get the power beat the market comfortably. 

They also offer another example of Zestfinance founded by an ex-CIO of Google, Douglas Merrill. The organization analyzes huge number of 'weaker variables' to decide lending and the default rate is better by 33% compared to industry standard practices. The authors also share that only 5% of digital data is structured. Hence Big Data analysis with sample size nearing all gives us the answers with sufficient 'plasticity' which is closer to reality! In the instance as this, the authors deep understanding shines.

The chapter on correlation takes off with Greg Linden in 1997 who wrote code for a startup to recommend products. The startup was Amazon. Greg Linden discovered that the recommendations based on similar items bought by others was optimal recommendation. The 'item to item collaborative filtering' technique was duly patented by him! Greg Linden could develop codes for recommendation by using 'what' rather than 'why'. The correlation rather than causality provides optimum solution while leveraging big data. The authors share that Amazon found content generated by the code was more beneficial in recommending then the reviews by the in-house reviewers. Netflix's 75% of new orders are based on such recommendations.

The authors trace correlations back to Francis Galton, Charles Darwin's cousin, in 1888 who found correlation between a person's height and length of forearm. Such correlations could not be verified because of paucity of data. The authors offer case study of Walmart, which had almost became the largest consignment shop in the world with $450Bn., which is far more than individual GDPs of more than 80% of the countries around the world. In 2004, Teradata (earlier National Cash Register) could 'datamine' Walmart's servers to correlate the hurricane warnings with the sale of 'flashlights' (which was expected) and pop-tarts (which was unexpected). Nevertheless, Walmart gained from the insight by selling  pop-tarts along with flashlights while hurricanes were predicted!

The authors offer numerous correlations found using big data. Fair Isaac Corporation (now FICO) could correlate Medication Adherence Score with duration of stay at the same address, duration of the same job, marital status and possession of car! The authors talk about WSJ's reporting that Experian could offer Income Insight by knowing the credit history. The Income Insight would cost less than 1/10th at $1! Aviva achieved 25X cost benefit by correlating health status with credit reports, customer's marketing data and hobbies. The authors also refer to Target's way of identifying the pregnant women from Charles Duhigg's book Power of Habit. UPS can undertake preventive maintenance of more than 60,000 vehicles' fleet by continuously collecting data about various parts of the vehicles. The same method is applied in continuously monitoring bridges, chemical plants, buildings amongst others. 

In healthcare field ECGs generate 1000 data points every second but hardly a few datapoints are used by the medical practitioners  Dr. Carolyn McGregor of Ontario Institute of Technology along with IBM conducted a study by collecting 1,260 data-points, including BP, O2 level, heart rate, etc., of premature babies (preemies). The study could reveal the problems suffered by preemies 24 hours before the outward symptoms showed up. The system could beat even the experienced doctors. Such systems are helpful in healthcare of preemies, patients and older people. Nowadays, lifeblogging, the trend of recording almost all vital signals through myriad sensors has taken off well with most of the techno aficionados, the data can be leveraged to help them and the wider population too!

Our quest for simple linear relations and causality is  understandable, as we lived in 'small data' or 'data scarcity' era. But the reality is complex, even though a few factors show causality and linearity there is a limitation to that linearity. The authors do a wonderful job of highlighting this through good case studies. The relation between happiness and income was established in a linear passion but up to some income level, after that income ceases to be impacting happiness, linearly. Measles immunization showed similar tendency. The authors also bring the aspect of network effect and analysis, which can be taken care of through the data available from social networks like Facebook and Linkedin. Daniel Kahneman's Fast and Slow thinking systems are used to illustrate the fallacy about finding the causes too fast which may turn out to be wrong. They take up the case of Joseph Meister, the first to be inoculated for rabies by Louis Pasteur on 6th July, 1885. It turns out that only 1 in 7 dog bites lead to rabies. The authors also note the finding of a study on Kaggle which indicated that orange cars experienced less accidents, our natural tendency is to find causes for the same! In a way, the authors do a commendable job in effectively demonstrating the limitations of 'small data' era practices. The authors conclude that in Big Data era, the causality won't be discarded but definitely will be knocked down from the pedestal it occupies, now.

To demonstrate the benefits of correlation in solving real life complex problems through the case study of exploding manholes in New York City. The manhole lids weigh up to 300 pounds and when they explode they rise up to many floors! 94,000 miles (which is 3.5X the Earth's circumference!') of underground cables are laid by Con Edison. In Manhattan alone, there are 51,000 manholes and service boxes. Since those cables are being laid since 1930s; the data available is messy and humongous. The Service Boxes are noted in 38 different ways as SB, SX, S?B,... Con Edison approached Columbia University to help them on this. The team used all the available data and mined it to arrive with 106 predictors. 44% of the severe incidents were covered in the top 10% manholes predicted by the team. The team could identify 2 main predictors, the age of cables and prior accidents. One may surmise these factors to be obvious. The authors quote Duncan Watts, a network theorist, " Everything is obvious once you know the answer!'. If we consider that there were 106 parameters to be considered to isolate only two major factors; then one would appreciate the complexity. 

The authors address the interesting debate, which has opened up after the emergence of big data. Does Big Data herald the end of theory? Chris Anderson, the Editor-in-Chief of Wired wrote a cover story in 2008, 'Petabyte Age' claiming that 'Petabyte of data allows us to say that correlation is enough.' The authors rightly argue that though the data abundance doesn't force us to hypothesize, Big Data itself is based on theory. Big Data affects the methods and outcomes. They also quote Danah Boyd and Kate Crowford saying that Google used 'search terms' as proxy for flu, which shows that a rational basis grounded on a theory is still needed. The need of theory continues but definitely Big Data has effected a fundamental shift in how we address our opportunities and challenges emerging around us!

The rest of the book covers 'datafication', values to be gained, implications, means of control and what's next. The review on these topics will be covered in the next part.