Create a free account, or log in

Garbage in, garbage out in the age of big data

As we move into the age of big data, it’s important that we understand the limitations and risks of bad information and poor programming. UK tech site The Register reports that Google Flu Trends has been a dismal failure with the service over-reporting the incidence of influenza by a factor of nearly 12. The reason […]
Paul Wallbank
Paul Wallbank
Garbage in, garbage out in the age of big data

As we move into the age of big data, it’s important that we understand the limitations and risks of bad information and poor programming.

UK tech site The Register reports that Google Flu Trends has been a dismal failure with the service over-reporting the incidence of influenza by a factor of nearly 12.

The reason for this problem is the algorithm used to determine the existence of a flu outbreak relies on people searching for the terms ‘flu’ or ‘influenza’ and it turns out we tend to over-react to a dose of the sniffles.

Google Flu Trends’ failure illustrates two important things about big data – the veracity of the data coming into the system and the validity of the assumptions underlying the algorithms processing the information.

In the case of Google Flu Trends both were flawed; the algorithm was based on incorrect assumptions while the incoming data was at best dubious.

The latter point is an important factor for the Internet of Machines. Instead of humans entering search terms, millions of sensors are pumping data into the system, so bad data from one sensor can have catastrophic effects on the rest of the network.

As managing data becomes a greater task for businesses and governments, making sure that data is trustworthy will be essential and the rules that govern how the information is used will have to be robust.

In a small business environment, it’s possible to see a situation where a bad bank feed flags customers as not having paid their bills, triggering the accounting program to send out late payment letters while warning the proprietor that the company is about to go broke.

That example alone should be a warning why we should be careful of bad data entering our business.

Hopefully the lessons of Google Flu Trends will save us from more serious mistakes as we come to depend on what algorithms tell us about the data.

Paul Wallbank is the publisher of Networked Globe, his personal blog Decoding The New Economy charts how our society is changing in the connected century.