The Data Life Cycle

To put data science in context, we present phases of the data life cycle, from data generation to data interpretation. These phases transform raw bits into value for the end user. Data science is thus much more than data analysis, e

and "extracting" represents the work done in all phases of the data life cycle (see Figure 1). 1   The cycle starts with the generation of data.People generate data: every search query we perform, link we click, movie we watch, book we read, picture we take, message we send, and place we go contribute to the massive digital footprint we each generate.Walmart collects 2.5 petabytes of unstructured data from 1 million customers every hour (DeZyre, 2015).Sensors generate data: more and more sensors monitor the health of our physical infrastructure, e.g., bridges, tunnels, and buildings; provide ways to be energy efficient, e.g., automatic lighting and temperature control in our rooms at work and at home; and ensure safety on our roads and in public spaces, e.g., video cameras used for traffic control and for security protection.As the promise of the Internet of Things plays out, we will have more and more sensors generating more and more data.At the other extreme from small, cheap sensors, we also have large, expensive, one-of-a-kind scientific instruments, which also generate unfathomable amounts of data.The latest round of the Intergovernmental Panel on Climate Change (IPCC) will produce up to 80 petabytes of data (Balaji et al., 2018).The Large Synoptic Survey Telescope is expected to build over a period of 10 years a 500 petabyte database of images and a 15 petabyte catalog of text data (LSST Project Office, 2018).The total amount of Large Hadron Collider data already collected is close to one exabyte (Albrecht et al., 2019).
After generation comes collection.Not all data generated is collected, perhaps out of choice because we do not need or want to, or for practical reasons because the data streams in faster than we can process.Consider how data are sent from expensive scientific instruments, such as the IceCube Neutrino Detector at the South Pole.
Since there are only five polar-orbiting satellites, there are only certain windows of opportunities to transmit restricted amounts of data from the ground to the air (IceCube South Pole Neutrino Observatory, 2019).
Suppose we drop data between the generation and collection stages: could we possibly miss the very event we are trying to detect?Deciding what to collect defines a filter on the data we generate.
After collection comes processing.Here we mean everything from data cleaning, data wrangling, and data formatting to data compression, for efficient storage, and data encryption, for secure storage.
After processing comes storage.Here the bits are laid down in memory.Today we think of storage in terms of magnetic tape and hard disk drives, but in the future, especially for long-term, infrequently accessed storage, we will see novel uses of optical technology (Anderson et al., 2018) and even DNA storage devices (Bornholt et al., 2016).
After storage comes management.We are careful to store our data in ways both to optimize expected access patterns and to provide as much generality as possible.Decades of work in database systems have led us to optimal systems for managing relational databases, but the kinds of data we generate are not always a good fit for such systems.We now have structured and unstructured data, data of many types (e.g., text, audio, image, video), and data that arrive at different velocities.We need to create and use different kinds of metadata for these dimensions of heterogeneity to maximize our ability to access and modify the data for subsequent analysis.Now comes analysis.When most people think of what data science is, what they mean is data analysis.Here, we include all the computational and statistical techniques for analyzing data for some purpose: the algorithms and methods that underlie artificial intelligence (AI), data mining, machine learning, 2 and statistical inference, be they to gain knowledge or insights, build classifiers and predictors, or infer causality.For sure, data analysis is at the heart of data science.Large amounts of data power today's machine learning algorithms.The recent successes of the application of deep learning to different domains, from image and language understanding to

Figure 1 .
Figure 1.The Data Life Cycle