The first discussion of Big Data appeared in an article written by Mr Doug Laney, an analyst at Forrester Research at the time, in 2001. The paper did not mention Big Data but discussed for the first time the three main characteristics of Big Data: Volume, Velocity and Variety. The term Big Data only started appearing online in 2006-2007 and has taken hold since then.
Today, a Google search for Big Data definition will produce 1.96Bn results. However, this type of variety inevitably results in significant confusion around what Big Data is and when organisations need to start looking at specific applications and solutions related to Big Data. Furthermore, considering only the amount of data is not always sufficient since some organisations routinely process hundreds of terabytes per month, while others struggle with hundreds of gigabytes.
Instead, one way is to look at Big Data in a business-centric manner and consider its effectiveness within an organisational context. This perspective leads to a definition focused purely on business value and not on technical aspects – We are dealing with Big Data when we cannot obtain the required information within the timeframes necessary for it to be adding value to organisational activities. Or, to rephrase, organisations need the information to be available before certain events; otherwise, it is useless.
The three Big Data characteristics or 3V’s, identified by Mr Laney in his work, form the foundation used to build Big Data business initiatives and technology infrastructure. While new characteristics constantly appear and broaden the original definition, they often seem redundant and pretentious, created with a marketing purpose in mind. The original 3V’s are discussed below.
As the name implies, this characteristic refers to the size of the datasets that need to be processed. When discussing volume, first, we need to define how it is measured. As consumers and professionals, we are familiar with kilobyte, megabyte, and gigabyte.
However, Big Data volumes go well beyond any of these quantities. Thus, a definition is necessary at this stage. The list below provides an explanation of terms used to measure data quantities, expressed as bytes, at present:
• Kilobyte – 1,000 bytes
• Megabyte – 1,000 kilobytes
• Gigabyte – 1,000 megabytes
• Terabyte – 1,000 gigabytes
• Petabyte – 1,000 terabytes
• Exabyte – 1,000 petabytes
• Zettabyte – 1,000 exabytes
• Yottabyte – 1,000 zettabytes
The above definitions are the so-called “decimal definitions” considered by the law courts to be the most appropriate in trade and commerce. The highlighted volumes are the ones that are considered Big Data.
To put volume in context, it is worth noting that, according to IDC, in 2018, the amount of all data on Earth was 33 Zettabytes1. This amount will grow to 175 Zettabytes by 2025. Moreover, the emergence of COVID-19 and the associated rise in digital technology usage are likely to increase this figure even further.
Velocity is the 2nd characteristic of Big Data. It refers to the speed of creating data and the rate of processing and consuming data. The emergence of new business models, innovative applications and widespread use of portable devices has increased velocity significantly.
The US Federal Reserve3 estimates that in 2012 a total of 24.4Bn general-purpose credit card transactions were made, while in 2018, that figure grew to 40.9Bn, an increase of 68%. Moreover, the electronic payments trend will further accelerate because of COVID-19, since electronic transactions were the only option for most during the lockdowns and now people are very comfortable with digital technology. This trend, however, was visible even before the pandemic when banks started reducing the number of their Automated Teller Machines (ATMs) in some countries, like Australia.
The increased e-payment volumes are just some examples of increasing data velocity. Another example is social media. For example, Microsoft, LinkedIn’s parent company, reports that in Q4-2020, the engagements are up by 31% on LinkedIn2. These engagements include text and other types of data, such as video, audio, graphics, etc. And, this assortment of data brings us to the last characteristic of Big Data – variety.
When related to Big Data, Variety refers to the type of data sources that need to be processed. There are three main types of data sources we need to deal with:
Structured – this data resides within enterprise systems, and its structure is well defined. Examples include Payroll, Finance, or other ERP systems. In each case, a database stores all data. An example of such a data record is an HR system’s employee record. It will contain as a minimum an employee ID, first name, last name and other fields, as required.
Structured data has been around since the early 80s. It is the easiest to process and is the smallest of the three types in terms of quantity.
Semi-structured – this data type consists of large volumes of individual records with small size and a simple record structure. An example would be the data sent by an intelligent power meter to a central system. Each packet has the same format: timestamp – 10 bytes, location – 10 bytes, consumption – 10 bytes + other information – 80 bytes.
Thus, information about electricity consumption takes 110 bytes. However, the 110 bytes is misleading since the daily volume in a city of 500,000 households with 5-sec intervals will be 950GB (110*12*60*24*500,000). Within a month, this dataset will grow to 11.4 Terabytes, and after one year, its size will reach 137 Terabytes.
Intelligent electricity meters are just one example of semi-structured data. With the continued proliferation of Internet-of-Things (IoT) devices, semi-structured data will be the fastest-growing one of the three types.
Unstructured – strictly speaking, this data is still structured. However, in this case, we’re dealing with many different structures and formats. A more accurate term will be multi-structured; however, unstructured is currently used for one reason or another.
Figure 1. Internet Activity per Minute of Day in 2021. Source: domo.com
Examples of unstructured data include social media posts, such as audio, video, graphics, and text. Additionally, external systems, and data from enterprise sources, such as Word files, emails, and PDFs, are included here.
Figure 1 shows the wide variety of data items generated every minute in 2021. Some highlights contributing to Big Data include users sharing 240k Facebook photos, watching 16 million TikTok videos, and hosting 856 minutes of Zoom webinars.
Figure 1 highlights the continuous significant growth in Big Data in all three characteristics – volume, velocity, and variety. However, this infographic presents only part of the picture – the data generated by the activities of individual consumers. Even higher data volumes are coming from organisations in various industries. And, this growth in Big Data is sure to accelerate significantly during COVID-19 and afterwards, as organisations adopt new technologies and deploy new infrastructure, while “connected” consumers adopt new ways of connecting, shopping and working with great confidence.