1.1. Chapter 1: Introduction#

Data are descriptions of the world around us, collected through observation and
stored on computers. Computers enable us to infer properties of the world from
these descriptions. Data science is the discipline of drawing conclusions from
data using computation. There are three core aspects of effective data
analysis: exploration, prediction, and inference. This text develops a
consistent approach to all three, introducing statistical ideas and fundamental
ideas in computer science concurrently. We focus on a minimal set of core
techniques that can be applied to a vast range of real-world
applications. A foundation in data science requires not only understanding
statistical and computational techniques, but also recognizing how they apply
to real scenarios.

For whatever aspect of the world we wish to study—whether it’s the Earth’s
weather, the world’s markets, political polls, or the human mind—data we
collect typically offer an incomplete description of the subject at hand. A
central challenge of data science is to make reliable conclusions using this
partial information.

In this endeavor, we will combine two essential tools: computation and
randomization. For example, we may want to understand climate change trends
using temperature observations. Computers will allow us to use all available
information to draw conclusions. Rather than focusing only on the average
temperature of a region, we will consider the whole range of temperatures
together to construct a more nuanced analysis. Randomness will allow us to
consider the many different ways in which incomplete information might be
completed. Rather than assuming that temperatures vary in a particular way, we
will learn to use randomness as a way to imagine many possible scenarios that
are all consistent with the data we observe.

Applying this approach requires learning to program a computer, and so this
text interleaves a complete introduction to programming that assumes no prior
knowledge. Readers with programming experience will find that we cover several
topics in computation that do not appear in a typical introductory computer
science curriculum. Data science also requires careful reasoning about numerical
quantities, but this text does not assume any background in mathematics or
statistics beyond basic algebra. You will find very few equations in this text.
Instead, techniques are described to readers in the same language in which they
are described to the computers that execute them—a programming language.