1.1.2. Statistical Techniques#

The discipline of statistics has long addressed the same fundamental challenge
as data science: how to draw robust conclusions about the world using incomplete
information. One of the most important contributions of statistics is a
consistent and precise vocabulary for describing the relationship between
observations and conclusions. This text continues in the same tradition,
focusing on a set of core inferential problems from statistics: testing
hypotheses, estimating confidence, and predicting unknown quantities.

Data science extends the field of statistics by taking full advantage of
computing, data visualization, machine learning, optimization, and access
to information. The combination of fast computers and the Internet gives
anyone the ability to access and analyze
vast datasets: millions of news articles, full encyclopedias, databases for
any domain, and massive repositories of music, photos, and video.

Applications to real data sets motivate the statistical techniques that we
describe throughout the text. Real data often do not follow regular patterns or
match standard equations. The interesting variation in real data can be lost by
focusing too much attention on simplistic summaries such as average values.
Computers enable a family of methods based on resampling that apply to a wide
range of different inference problems, take into account all available
information, and require few assumptions or conditions. Although these
techniques have often been reserved for advanced courses in statistics, their
flexibility and simplicity are a natural fit for data science applications.