The Hidden Risk of Artificial intelligence and Big Data

Risk of Artificial intelligence

Recent advances in artificial intelligence have been made possible by access to “big data” and cheap computing power. But can things go wrong?

Big data is suddenly everywhere. Due to the scarcity and complexity of finding data (and information), we now have a data stream. In recent years, the amount of available data has grown exponentially. This, in turn, has been made possible by the huge growth in the number of devices that record data, as well as the ability to communicate between all these devices through the Internet of things. Everyone seems to be collecting, analyzing, and making money out of recognition (or fear) of the power of big data. By combining the power of modern computing, it promises to solve almost any problem – just by calculating the numbers.

But can big data help with this hype? In some cases yes, in others no. On the one hand, there is no doubt that big data has already had a critical impact in certain areas. For example, almost every successful AI solution involves some serious number processing.

Hidden Risk of Artificial intelligence and Big Data

First of all, it should be noted that while AI is currently very good at discovering patterns and relationships in large datasets, it is still not very smart (depending on your definition of intelligence, but that’s another story!). Number processing can effectively identify and find subtle patterns in our data, but cannot directly tell us which of these correlations are truly significant.

Correlation vs Causation

We all know (or should know!) that “Correlation does not imply causation. However, the human mind is programmed to look for patterns, and when we see sloping lines coming together and obvious patterns in our data, we find it hard to resist the urge to pinpoint a reason.

Data Science and Artificial intelligence- Clear Your Doubt


However, according to statistics, we cannot make such a leap. Tyler Wiegen, the author of Spurious Correlations, makes fun of this on his website (which I highly recommend visiting for some interesting statistics!). Some examples of such spurious correlations can be found in the figures below, where I’ve put together a few examples showing that ice cream seems to cause a lot of bad things, from wildfires to shark attacks to polio outbreaks.

Sale of Ice Cream

Ice Cream Sales-Shark Attacks




The real Cause of Polio

Looking at these stories, one could argue that we probably should have banned ice cream a long time ago. Indeed, in the 1940s, when it came to polio, public health experts advised people to forego ice cream as part of the “polio diet.” Fortunately, they eventually concluded that the correlation between polio outbreaks and ice cream consumption was simply because polio outbreaks most often occurred in the summer.

In statistics, a spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are related, but not causally related due to coincidence or the presence of some third, invisible factor (called a common reaction variable, confounding factors, or latent variable). An example of such hidden variables could, for example, be a seeming correlation between ice cream sales and shark attacks (I’m pretty sure that rising ice cream sales don’t cause sharks to attack people). However, there is a common connection behind these two numbers, namely temperature. Warmer temperatures encourage more people to buy ice cream and more people to go swimming. Thus, this latent variable is indeed the cause of the apparent correlation. Fortunately, we have learned to separate correlation from causation, and we can still enjoy ice cream on a hot summer day without fear of polio outbreaks and shark attacks!

Strength and limits of correlations

Given enough data, patterns of computing power and statistical algorithms will be found. But are these regularities of any interest? Not all of them will be, as false patterns can easily outnumber meaningful ones. Big data combined with algorithms can be an extremely useful tool when applied correctly to the right problems. However, no scientist thinks that you can solve the problem by processing only the data, no matter how powerful the statistical analysis is. You should always start your analysis based on a deep understanding of the problem you are trying to solve.

Data science is the end of science (or is it?)

In June 2008, K. Anderson, former Editor-in-Chief of Wired Magazine, wrote a provocative essay titled The End of Theory: The flood of data makes the scientific method obsolete, in which he states that enough data, the numbers speak for themselves. Correlation replaces causality, and science can progress without consistent models and unified theories.

The strength and versatility of this approach depend on the amount of data: the more data, the more powerful and effective the method based on the correlations found through calculations. We can simply feed the numbers into powerful computers and let the statistical algorithms automatically find interesting patterns and insights.

Unfortunately, this simplistic way of analyzing has some potential pitfalls, which can be well illustrated with an example found on John Poppelaars’ blog :

Suppose we want to create a predictive model for some variable Y. This could be, for example, the price of a company’s stock, the click-through rate of an online ad, or the weather next week. We then collect all the data that is available to us and put it into some statistical procedure to find the best possible predictive model for Y. The usual procedure is to first evaluate the model using all variables and weed out the irrelevant (those that are not significant at some predetermined level of significance) and re-evaluate the model with the selected subset of variables and repeat this procedure until a significant model is found. Simple enough, right?

However, the method of analysis proposed by Anderson has several serious drawbacks. Let me illustrate. Following the example above, I created a set of data points for Y by drawing 100 samples from a uniform distribution between zero and one, so it’s random noise. I then created a set of 50 independent variables X(i) by drawing 100 samples from a zero to one uniform distribution for each of them. So, all 50 independent variables are also random noise. I’m evaluating a linear regression model using all X(i) variables to predict Y. Since nothing is related (all uniformly distributed and independent variables), R squared is expected to be zero, but it isn’t. It turns out 0.5. Not bad for random noise regression! Luckily the model doesn’t matter. Minor variables are removed step by step and the model is re-estimated. This procedure is repeated until a significant model is found. After several steps, a significant model is found with an adjusted R-squared of 0.4 and 7 variables with a significance level of at least 99%. Again, we are regressing random noise, there is no connection in it, but still, we find a significant model with 7 significant parameters. That’s what happens if we just feed the data into statistical algorithms to find patterns.” Again, we are regressing random noise, there is no connection in it, but still, we find a significant model with 7 significant parameters. That’s what happens if we just feed the data into statistical algorithms to find patterns.” Again, we are regressing random noise, there is no connection in it, but still, we find a significant model with 7 significant parameters. That’s what happens if we just feed the data into statistical algorithms to find patterns.”

The larger the dataset, the louder the noise

A recent study proved that as data volumes increase, they should contain arbitrary correlations. These correlations appear simply due to the size of the data, indicating that many correlations will be spurious. Unfortunately, too much information behaves like too little information.

Information systems of artificial intelligence

This is a severe problem for applications where you work with multidimensional data. As an example, let’s say you collect sensor data from thousands of sensors in an industrial plant and then extract patterns from that data to optimize performance. In such cases, you can be easily fooled if you act on phantom correlations rather than actual operating performance. This could potentially be very bad news, both financially and in terms of the safe operation of the plant.

Adding Data vs. Adding Information

As data scientists, we can often argue that the best solution to improve our AI model is to “add more data”. However, the idea that simply “adding more data” will magically improve the performance of your model may not be true. What we should be focusing on is “adding more information”. The distinction between “adding data” and “adding the information” is crucial: adding more data does not mean adding more information (at least not useful and correct information). On the contrary, by blindly adding more and more data, we run the risk of adding data containing misinformation, which can accordingly reduce the performance of our models. Given the extensive access to data,


So, should the above issues prevent you from making a data-driven decision? No, this is far from true. Data-driven decision-making is here to stay. It will become more and more valuable as we gain more knowledge on how to best use all available data and information to improve performance, such as clicks on your website or the optimal operation of an industrial plant.


Please enter your comment!
Please enter your name here