Big Data never gives the exact answer

Big Data as the concept includes 4Vs listed at the IBM’s slide in the header of this post. There is an interesting observation: each of these components means that Big Data Analyst (or Data Scientist) is never able to get exact results.

Indeed, if we are talking about the analysis, for example, of GDP, then having the complete and exhaustive list of the countries we can calculate the mean and standard deviation, build a short-term forecasts and so on — and each result will be accurate, it will be calculated using the arithmetic formula. Next year, World Bank updates all the GDP figures and we can do calculations again.

Big Data is different. Each individual V attribute has the specific feature of giving vague answers to even the most simple analytical tasks.

Volume means that conventional algorithms will process the data for too long, and therefore, most likely, you will only analyze a random subset of your data, hoping for their uniformity. Yes, you can work in the system, such as Hadoop; yes, you can write arbitrarily complex queries, but it’s always important to balance the resources spent and the importance of the accuracy of the response. And most often the cost of resources outweighs.

Velocity means that the data is coming at a high speed so while your arithmetic formula for calculating the mean is still working, several new values come into your database; and the result of calculation of the average no longer reflects reality.

Variety means that some of the calculations do not make sense for other data types in your array. Either they have to be interpreted differently. Thus, it is impossible to speak of accuracy as a whole, only about a precision within the specified data type.

Veracity simply means that you shouldn’t trust in any data at 100%, and therefore any results are wrong at some level just because of data’s nature.

Thus, one can argue that the Big Data Analysis is a kind of job where a precise answer is impossible in principle. This is just an observation, and every Data Scientist should keep this in mind when communicate insights to management.