Big Data: History, Development, Application, and Dangers

By: The FHE Team

History & Development

The background of big data traces back to the beginning of measurement itself. Measurement and basic counting were practiced in the Indus Valley, Mesopotamia and Egypt as early as the third century B.C. the dawn of the earliest modern civilization. Over time accuracy and use of measurement continued to increase and made the later discoveries of volume and area measurements possible.

A new first century numeral system from India was improved by Persians before the Arabs refined it to become the Arabic numerals that we use today. Translations into Latin spread the system all over Europe by the 12th century and led to the explosion of mathematics.

Mathematics would eventually ally with data.  One of the earliest recorded such pairings occurred in 1494 when Luca Pacioli, a Franciscan monk published a book on the commercial application of mathematics and explained a new accounting format called double-entry book keeping which enabled the merchant to compare profits and losses. This revolutionized business, particularly banking.

The Scientific Revolution in the 1600 and 1700’s saw a greater interest in measurement and mathematics as a very powerful tool for understanding the world and reality.

The 1888 discovery by Sir Francis Galton that a man’s height correlated with the length of his forearm was a big precedent of big data. The authors explain that two data values related statistically can be quantified by a correlation so that a change in one data value could predict a change in the second.

Currently, statistical correlation is one of the primary uses of big data in understanding cause and effect and to use predictive analysis to understand, prepare for, and in some instances, influence the future. Supercomputers use algorithms to identify correlations in uploaded data to come up with valuable insights.

By the close of the 19th century, all components of big data were in place and data management and analysis were in vogue, although some limitations continued to exist such as how much information could be analyzed and kept. The computer age would come to solve that problem.

Application

Computers expanded the variety and range of the things we could record and capture, which was multiplied exponentially by the introduction of the internet (see Disruptive Technologies: The Internet of Things). Additionally, our online behavior is now used to determine our tastes and preferences with sites such as Google, Twitter, Facebook, and LinkedIn (among many others) recording, analyzing and storing even private information relating to our health, relationships, financial records, and anything you can think of. Sensors, even in our phones take care of tracking our every move and computers allow us to quantify everything under the sun-from location, to heart rate, to engine vibration.

Computers are able to measure, record, analyze and store data on a near limitless scale, with faster processing speeds and greater storage capacity improving and increasing daily. The authors note that it took a decade to originally sequence three billion base pairs of the human genome while the same amount of DNA could be sequenced in a single day by 2012.

Computers have enabled us to move on from the previous method of analyzing small samples of data and drawing conclusions with varying margins of error, and basing entire theories based on those limited samples, to the current ability to analyze the entire data set thereby getting an exact insight into a given subject. This is expressed by N=all.

Even in the recent past, we had to manually look for correlations in data values, opt for proxies, and run correlations against those to validate correlations.  Today we can load up even disorganized data and have “intelligent” algorithms find correlations we may not have even suspected. And while that potential is great, gleaning insights from correlations have their downside; sometimes correlations can be coincidental and not properly reflective of causal relationships. Careful interpretation of computer output is still necessary.

 Practical Application

Google applied big data to search for terms that were run through it in order to track the spread of the pandemic H1N1 in real time in 2009. The researchers used data from the Center for Disease Control (CDC) on the pandemic between 2003 and 2008, against the most popular search terms entered into Google during that time.  Google’s system correlated between the frequency with which certain queries were entered and how the flu spread over time and space. The software found a combination of some 45 search terms that showed a high correlation between Google’s predictions and official figures countrywide. This predictive model actually proved more reliable than even state figures developed at the height of the pandemic in 2009.

Google has also used big data to track down translation pages by the billions off the internet in order to come up with their own database that would allow any uploaded text to be translated; the service is known as Google translate. Google uses much more data than other translation services, making their system superior to the competition.

Other Google projects such as Google books use this same big data model to ensure thorough service.

As one would expect, some if not most of the information that Google collects is also diverted toward its profit making ventures. For example, while Google’s street view cars service is advertised as a service used for Google maps, they have been known to collect data from open wifi connections and use it to develop their driverless car technology. They are certainly in the driver’s seat for some serious profits on this disruptive technology.

Targeted Advertisement

Amazon.com, a pioneer of targeted advertising became a big data user when Greg Linden, one of its software engineers realized the potential of book reviewing from the average results of their in-house review project. About a dozen critics were hired to review Amazon.com books in what was called ‘the Amazon voice,’ when they first came online in 1995. Linden designed software that could identify associations between books, and recommend books to people based on their previous choices. When Amazon compared the results of the computer sales against the in house reviews, the results were much better for the data-derived material, and revolutionized e-commerce.

Facebook also tracks user locations, ‘status updates’ and ‘likes’ to determine which ads to display for a user. These target ads can seem invasive, and to most a bit creepy. An analytical team reviews people’s behaviors and locations and determine which ads to display to them. Determining correlations between users and needs has since become a model form of advertising.

Manufacturers have started using big data to streamline their operations while improving safety. The sensors placed on machinery help to monitor the data patterns which the machines provide on vibration, stress, heat and sound, plus they help to detect any changes that might show problems in the future. These early-detection/prediction systems help avert breakdowns and ensure timely maintenance.

Big data can also boost efficiency outside the factory by using data algorithms to determine more efficient and safer routes for trucks, as is routinely done by UPS. This has proven useful and accidents have reduced in addition to fuel consumption and other negative issues/cost factors. UPS also has sensors that identify potential breakdowns using the method the authors explain above.

Car makers use the same model to determine car use on the roads and use the information to try and improve vehicles depending on drivers’ behaviors.

Predictive models and sensors have also been employed by governments to predict possible dangers in infrastructure and when maintenance could be done to avert possible disaster. It was even used in 2009 by Michael Flowers, head of New York City’s first analytics department as appointed by Mayor Michael Bloomberg. Flowers used big data to fight crime rates by coming up with a model that analyzed and predicted likely false calls and legit calls.

Startups, Health Care and Governments

Big data has also helped to launch predictive business models such as Farecast, a business that predicts air ticket prices saving its customers a lot of money while raking in the profits. The owner, Oren Etzioni simply collected data on air ticket prices from travel websites and analyzed the price changes towards the actual flight to come up with his business model that could predict prices saving travelers an average of $50 per ticket. He would later sell Farecast to Microsoft for $110 million, before opening up Decide.com, which dealt with consumer goods using the same successful model, this time saving consumers some $100 for every product purchased.

The ability to measure each and everything in the human body inspired IBM and a team of researchers from University of Ontario Institute of Technology to come up with software that can analyze physiological data from premature babies. This can help to determine the likelihood of infections and how well the infants respond to treatment.

Big data has also helped to reduce hospital readmission rates.  For example MedStar Washington Hospital used Microsoft’s Amalga software and “analyzed several years of its anonymized medical records—patient demographics, tests, diagnoses, treatments, and more—” which helped to track the factors that most caused readmission.  A common factor in this hospital was found to be mental distress. Treating mental distress before discharge helped reduce readmission rates.

Companies such as 23andMe have already been sequencing individual genomes of people to help detect specific genetic susceptibilities. Sequencing DNA in this format is still costly, but people like Apple’s Steve Jobs have undergone it in their battle against cancer. The procedure bought Steve Jobs a few extra years, thanks to big data. This technology will be very useful as soon as it becomes affordable to everyone else.

Governments have been slow to catch the scent of the enormous use of big data (for other than surveillance purposes, that is), especially since they can have access to a lot of information about their citizens. Some governments are catching on though, especially those interested in curbing costs and ensuring safety and efficiency. The American government- naturally- has led the way, using big data to estimate consumer price index (CPI) and to measure the inflation rate.

Open Data

Since some governments are taking their sweet time applying big data methods to state business, many trust that making much of the information that they can access free to organizations and individuals would be more useful. The authors say governments only act as custodians of information in their collections and would need to publicly release that data for commercial and civic purposes since the private sector would be more innovative with it.

To that end, the US government responded by opening up a free data website data.gov where information from federal government can be freely accessed. From 47 datasets in 2009, the site had 450,000 in 172 agencies three years later. The UK has also made strides in this regard as have Australia, Chile, Brazil and Kenya.

Big Data Ends Privacy: Enter Profiling

Some governments maybe be releasing more information but all of them are stacking away a lot more than they give away, and most of it is personal, private information. For example, “the U.S. National Security Agency (NSA) is said to intercept and store 1.7 billion emails, phone calls, and other communications every day, according to a Washington Post investigation in 2010…” As we have seen before, even the private sector does this; the internet tracks our locations, what we like and everything else in between. We are losing our privacy.

Now, parole boards and police are using big data to profile people, predicting chances of where crime might be higher, and even whether or not to release a prisoner on parole. In many US states, people are ‘questioned’ based on location and whether they fall into a certain statistical category as pointed out by an algorithm that they may be likely to commit a crime.

The Solutions and Conclusion

The authors, as would be expected, suggest a few solutions. The regulation of the use of big data should be given to internal and independent auditors (they have even suggested a name- ‘algorithmists’) to impartially and confidentially scrutinize all big data practices for any ethical or legal infractions as well as technical errors. They also suggest the amendment of the law to accommodate big data.

While the big data era is just beginning, the authors remind that not every societal problem is one that big data can address. Obviously business models and governments continue to make use of big data, but it has been used and misused, and our privacy is all but evaporated (governments spy on citizens and other states, social media spy on people’s likes and dislikes, etc.) even as it has made communication very easy and its predictive model has helped us to avert looming danger. While the public opinion jury is still out on big data, with many people unsure whether to love it or hate it, it is a reality.  Perhaps, they suggest, if regulation can help us keep some privacy maybe it will be more welcomed; otherwise paranoia is very high and not without reason.  People are being categorized according to where big data puts them (statistically rather than any specific personal action), and some have been arrested or denied freedom unnecessarily.