“640KB ought to be enough for anybody” this is an assertion allegedly made by Bill Gates at a computer trade show in 1981.
And here we are in 2020 and 1000KB (1MB) won’t be enough to save one high quality photo of yourself.
Though, the “640KB is enough for anybody” statement seems more likely a myth attributed to Gates. But this goes a long way in showing how far we came in terms of data saving capability.
And not just our data saving capability, the volume of data we create and process per day worldwide is so voluminous that it would be a job to just write the figure down.
According to the World Economic Forum “the entire digital universe is expected to reach 44 Zettabytes by 2020”.
Ever heard of Zettabytes?
One Zettabytes is equal to 10^21 bytes = 1,000,000,000,000,000,000,000 bytes
Hence 44 Zettabytes = 44,000,000,000,000,000,000,000 bytes
In fact, “if this number is correct, it will mean there are 40 times more bytes than there are stars in the observable universe” according to WE Forum.
I could go on and on and on with this stats.
65 billion messages are sent on WhatsApp per day. 5 billion searches are made per day on Google and 4 petabytes of data are created on Facebook per day.
We are generating so much data at an incredible speed today that we gave it a name: Big Data.
It makes sense to call it Big Data, but if it were up to me to do the naming, I would have called it “Gigantic Data”. But Big Data it is.
Now don’t get short-sighted, big data is not just about the volume of data available today, it’s more than just ‘volume’.
What is Big Data?
Big Data are voluminous, veracious and valuable data that contains great variety and arrives at a great velocity enhancing insights and enabling smart decisions.
I told you, there is more to big data than just volume.
Volume, Veracity, Value, Variety and Velocity are makes up what are known as the 5vs of big data. Some choose to stick with 3Vs (Volume, Variety and Velocity) and others 4Vs (Volume, Variety, Veracity and Velocity).
I will touch on the 5Vs of Big Data later on.
For many people and businesses, big data are large volume of data, this is true, but stereotypical. Like Chimamanda Adichie said in her TedTalk on the Single Story, “the problem with stereotypes are not that they are not true but it is that they are incomplete”.
Let say it took us 200 centuries to generate 1 petabyte of data that is a huge volume of data, right? But in today’s world, that such data would not be considered as big data it took so long to arrive.
Why Big Data Matters?
“Data is the new currency”, this phrase has come to become popular amongst tech enthusiast and it has some other variations. Like, data is the new gold and data is the new oil.
Michele Evans even pushed it further by writing on Forbes that “Data is the most important currency”.
All these just heads towards making the same point: Data is highly valuable in today’s world.
Big Data matters because it’s valuable. It is valuable to individuals, businesses, government and the society at large.
Big Data initiatives are rated as “extremely important to 92% of companies over $250 million” according to a 2014 Accenture Big Data study.
Big Data when analyzed begets valuable information that helps businesses understand their market and better strategize their marketing efforts.
It helps individuals understand themselves and the world around them, enabling them improve their life greatly and more effectively.
And helps governments understand their people, their needs and what are required to help push their nation forward.
Information empowers and big data begets lots of information.
Examples of Big Data
There are many ways big data are generated in today’s world. As you know, there are different kinds of data and as such different kinds of big data.
With that in mind, Here some examples of big data:
Social Media Big Data
This is perhaps the most popular example of big data. Social media platforms like Facebook, YouTube, Twitter, Instagram and Pinterest produces millions (in some instance, billions) of user generated contents on daily basis.
These user generated contents can be textual, graphical, audio or video contents.
Consider YouTube that sees 300 hours of video uploads every minute, if you do the math, you know that that’s a lot of data on daily basis, owing to the fact that there are 86,400 minutes per day.
Multiple 86,400 by 300 and you get almost 26 million hours per day.
To put things in perspective, 26 million hours is almost 30 centuries.
Allow that to sink!
30 centuries of videos are uploaded on YouTube per day. And this value is according to a 2019 stats, the numbers of hourly uploads on YouTube is expected to go higher in 2020.
In May, 2019, Google AI researchers in a blog post said they used 2,000 “mannequin challenge” YouTube videos as a training dataset to create an AI model capable of depth prediction from videos in motion.
This instance is just a surface scratch that shows how valuable those 300 hours per minute uploads are.
Not to mention the use of social media big data by advertisers to target customers.
Autonomous Vehicle Test Data
For most of us who live in the AI world, we are conversant with the developmental trends of Autonomous vehicles.
And we know that the development of autonomous vehicles relies heavily on Big Data, predominately from test vehicles sensor data.
According to Texura autonomous and ADAs test cars produces over 11 Terabytes (TB) of data per day.
Leading self-driving car companies like Waymo, generates tons of data annually from their test vehicles alone. S
inception (in 2009), Waymo’s test cars has drove over 20 million miles on public roads and tens of billions of miles through computer simulations more shocking is the fact that in 2019 alone. I
test vehicles drove 10 million miles in public road. 100% the miles they accumulated in the past ten years.
I’m a fan math, plus math is fun.
So let’s do some math.
According to a report by Texura, the sensors in an autonomous vehicle records between 1.4 TB to around 19TB per hour.
10 million miles in a year is the equivalent of 1,142 miles per hour and 27,400 miles per day.
Although, Waymo have not revealed its fleet size, 200 vehicles driving 8 hours per day at 17mph would match its 10 million miles per year rate.
Using the above estimated data, we can estimate the amount of data generated by Waymo’s test cars to range between 2.2 Petabytes (PB) to 30.4 PB of sensor data per day.
And that’s just for data generated on public roads, not considering the annual billions of miles it accumulates through computer simulations.
Stock Exchange Data
Stock Exchange data are a prime example of Big Data. Just picture the scene at the headquarters of your country’s stock exchange. What is the predominant thing that comes to your mind? Numbers. Stock prices going up and down.
Because you are smart, you know that those numbers are valuable data and voluminous too, right? Yes, they are.
As a case study, the New York Stock Exchange generates 4-5 terabytes of data per day. Of course, you know that these data are being used by corporations and individuals for various purposes, like stock investing and predictions.
Data generated from popular apps that requires internet connection, like the Google app and Chrome Browser.
Characteristics of Big Data
Big Data are characterized by the 5Vs: Volume, Variety, Velocity, Veracity and Value.
Some characterization of big data are based on the 3Vs or the 4Vs, but as understanding of big data evolved, most business characterize big data with the 5Vs or at the very least recognizes the other Vs.
Volume of data is a fundamental characteristics of big data. It’s just common sense that big data should be enormous in size.
What volume of data is considered big enough to be represented as big data?
How big is big data?
The reality is that there is no predefined minimum volume of data that is considered as big data.
There is a variation between organizations. For one organization, big data may be 10 TB of Data and for another, it may be 10PB of data.
As someone who has an engineering foundation, when I think of velocity, I think speed; “how fast”, frequency of data arrival in this sense.
Voluminous data has to arrive fast enough to be big data. In fact, for some business if data doesn’t arrive on time, it becomes useless (invaluable).
Decades ago, organizations relied on historic data make decisions. Because of technological limitation to collect and analyze these data in real time.
But in this era, with technological advancements, business now have the technological capabilities to collect and analyze data in real time.
Velocity in Big data is not just about data arriving with great frequency, it’s more about data arriving as real time as possible.
Real time data gives organizations, especially business a strategic and competitive advantages.
Real-time data collection and analysis is greatly down to technology; so much why we praise technology for the advent of big data.
Enormous data are being produced at a great frequency today; more to it, a lot of these data are different.
When the word variety is stoned at me, what comes what immediately comes to my mind is “different kind of something”.
So when it’s stated that over 4 petabytes of data are created everyday on Facebook; these are different kinds of data.
They may include texts, graphics, videos and audios; they may be structured and unstructured.
So when the word “Big Data” is summoned, you should have the understanding that they are data that contains great variety: Voluminous data that are distinctively different.
Though there are different kinds of big data, big data analysts generally group them into two categories: structured and unstructured data.
Structure data are traditional data types that has been in existence, since “back in the days”. These are data that are stored in form of spreadsheets (rows and columns) and databases, like a standard bank statement.
Structured data has well-defined relational structures (I guess that’s why it is referred to as structured).
But rarely does voluminous data present itself in a well-defined order.
Pictures, tweets and voice recordings that are created on social media platforms today, arrives in no specific order: They arrive unstructured.
Consequentially, most variety of data in big data are unstructured.
With the great variety of data (structured and unstructured) in big data, big data analyst has a job of collecting, analyzing and making sense of the data.
This defines or relates to the reliability of the data. It’s not just enough that voluminous data are arriving in real time. How trustworthy are these data?
Organizations needs to rely on the fact that the data they are collecting are truly representative.
Big data veracity in general, relates to the accuracy (quality and preciseness) of a dataset, and degree of trustworthiness of the data source and processing.
The last piece of the puzzle that seems obvious. Of course, the data has to be valuable, but value comes in different shapes and sizes.
Consider a scenario where a self-driving car company collects voluminous data that relates plant diseases and pest at real time.
Of course, that data is valuable, but is it valuable to the self-driving car company that is working on developing fully autonomous vehicles?
The primary reason why organizations carryout big data projects is to generate some sort of value for its self.
If the big data project is not generating any form of value for the organization then that project is a waste of the organization’s resources.