Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, a new platform of "big data" tools has arisen to handle sensemaking over large quantities of data, as in the Apache Hadoop Big Data Platform.
In 2012, Gartner updated its definition as follows: "Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."
A 2016 definition states that "Big data represents the information assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value".
A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd’s relational model."
(Source: http://en.wikipedia.org/wiki/Big_data )
Big Data Repositories
Big data repositories have existed in many forms for year built by corporations for their use with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s.
Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves according to Kryder's Law. Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types including XML, JSON, and Avro.
In 2000, Seisint Inc. (now LexisNexis Group) developed a C++-based distributed file-sharing framework for data storage and query. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can build queries in a C++ dialect called ECL. In 2004, LexisNexis acquired Seisint Inc. and in 2008 acquired ChoicePoint, Inc.and their high-speed parallel processing platform. The two platforms were merged into HPCC (or High-Performance Computing Cluster) Systems and in 2011, HPCC was open-sourced under the Apache v2.0 License. Quantcast File System was available about the same time.
CERN and other physics experiments have collected big data sets and they analyzed via high performance computing (supercomputers). But big data movement presently uses the commodity map-reduce architectures.
In 2004, Google published a paper on a process called MapReduce. The MapReduce concept provides a parallel processing model to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered as the output (the Reduce step). An implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop. Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reduce).
Big Data - Dimensions
Big data - Four dimensions: Volume, Velocity, Variety, and Veracity (IBM document)
Examples of big data in enterprises
Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.
12 terabytes of Tweets created each day has to analysed to get improved product sentiment analysis
Convert 350 billion annual meter readings to better predict power consumption
Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
Scrutinize 5 million trade events created each day to identify potential fraud
Analyze 500 million daily call detail records in real-time to predict customer churn faster
Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
Monitor 100’s of live video feeds from surveillance cameras to target points of interest
Exploit the 80% data growth in images, video and documents to improve customer satisfaction
Veracity: Establishing trust in big data presents a huge challenge as the variety and number of sources grows.
McKinsey Article on Big Data
11 Feb 2016
Evolution of Big
Analytics 1.0—the era of “business intelligence.”
Analytics 1.0 started gaining an objective, deep understanding of important business phenomena and giving managers the fact-based comprehension to go beyond intuition when making decisions. For the first time, data about production processes, sales, customer interactions, and more were recorded, aggregated, and analyzed.
Updated 20 November 2018, 11 Feb 2016, 28 Feb 2013