Practical, accessible guide to becoming a data scientist, updated to include the latest advances in data science and related fields.
Becoming a data scientist is hard. The job focuses on mathematical tools, but also demands fluency with software engineering, understanding of a business situation, and deep understanding of the data itself. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
The focus of The Data Science Handbook is on practical applications and the ability to solve real problems, rather than theoretical formalisms that are rarely needed in practice. Among its key points are:
- An emphasis on software engineering and coding skills, which play a significant role in most real data science problems.
- Extensive sample code, detailed discussions of important libraries, and a solid grounding in core concepts from computer science (computer architecture, runtime complexity, and programming paradigms).
- A broad overview of important mathematical tools, including classical techniques in statistics, stochastic modeling, regression, numerical optimization, and more.
- Extensive tips about the practical realities of working as a data scientist, including understanding related jobs functions, project life cycles, and the varying roles of data science in an organization.
- Exactly the right amount of theory. A solid conceptual foundation is required for fitting the right model to a business problem, understanding a tool’s limitations, and reasoning about discoveries.
Data science is a quickly evolving field, and this 2nd edition has been updated to reflect the latest developments, including the revolution in AI that has come from Large Language Models and the growth of ML Engineering as its own discipline. Much of data science has become a skillset that anybody can have, making this book not only for aspiring data scientists, but also for professionals in other fields who want to use analytics as a force multiplier in their organization.
Inhoudsopgave
Preface to the First Edition xvii
Preface to the Second Edition xix
1 Introduction 1
1.1 What Data Science Is and Isn’t 2
1.2 This Book’s Slogan: Simple Models Are Easier to Work With 3
1.3 How Is This Book Organized? 4
1.4 How to Use This Book? 4
1.5 Why Is It All in Python, Anyway? 4
1.6 Example Code and Datasets 5
1.7 Parting Words 5
Part I The Stuff You’ll Always Use 7
2 The Data Science Road Map 9
2.1 Frame the Problem 10
2.2 Understand the Data: Basic Questions 11
2.3 Understand the Data: Data Wrangling 12
2.4 Understand the Data: Exploratory Analysis 12
2.5 Extract Features 13
2.6 Model 14
2.7 Present Results 14
2.8 Deploy Code 14
2.9 Iterating 15
2.10 Glossary 15
3 Programming Languages 17
3.1 Why Use a Programming Language? What Are the Other Options? 17
3.2 A Survey of Programming Languages for Data Science 18
3.3 Where to Write Code 20
3.4 Python Overview and Example Scripts 21
3.5 Python Data Types 25
3.6 GOTCHA: Hashable and Unhashable Types 30
3.7 Functions and Control Structures 31
3.8 Other Parts of Python 33
3.9 Python’s Technical Libraries 35
3.10 Other Python Resources 39
3.11 Further Reading 39
3.12 Glossary 40
3a Interlude: My Personal Toolkit 41
4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 43
4.1 The Worst Dataset in the World 43
4.2 How to Identify Pathologies 44
4.3 Problems with Data Content 44
4.4 Formatting Issues 46
4.5 Example Formatting Script 49
4.6 Regular Expressions 50
4.7 Life in the Trenches 53
4.8 Glossary 54
5 Visualizations and Simple Metrics 55
5.1 A Note on Python’s Visualization Tools 56
5.2 Example Code 56
5.3 Pie Charts 56
5.4 Bar Charts 58
5.5 Histograms 59
5.6 Means, Standard Deviations, Medians, and Quantiles 61
5.7 Boxplots 62
5.8 Scatterplots 64
5.9 Scatterplots with Logarithmic Axes 65
5.10 Scatter Matrices 67
5.11 Heatmaps 68
5.12 Correlations 69
5.13 Anscombe’s Quartet and the Limits of Numbers 71
5.14 Time Series 72
5.15 Further Reading 75
5.16 Glossary 75
6 Overview: Machine Learning and Artificial Intelligence 77
6.1 Historical Context 77
6.2 The Central Paradigm: Learning a Function from Example 78
6.3 Machine Learning Data: Vectors and Feature Extraction 79
6.4 Supervised, Unsupervised, and In-Between 79
6.5 Training Data, Testing Data, and the Great Boogeyman of Overfitting 80
6.6 Reinforcement Learning 81
6.7 ML Models as Building Blocks for AI Systems 82
6.8 ML Engineering as a New Job Role 82
6.9 Further Reading 83
6.10 Glossary 83
7 Interlude: Feature Extraction Ideas 85
7.1 Standard Features 85
7.2 Features that Involve Grouping 86
7.3 Preview of More Sophisticated Features 86
7.4 You Get What You Measure: Defining the Target Variable 87
8 Machine-Learning Classification 89
8.1 What Is a Classifier, and What Can You Do with It? 89
8.2 A Few Practical Concerns 90
8.3 Binary Versus Multiclass 90
8.4 Example Script 91
8.5 Specific Classifiers 92
8.6 Evaluating Classifiers 102
8.7 Selecting Classification Cutoffs 105
8.8 Further Reading 106
8.9 Glossary 106
9 Technical Communication and Documentation 109
9.1 Several Guiding Principles 109
9.2 Slide Decks 112
9.3 Written Reports 114
9.4 Speaking: What Has Worked for Me 115
9.5 Code Documentation 117
9.6 Further Reading 117
9.7 Glossary 117
Part II Stuff You Still Need to Know 119
10 Unsupervised Learning: Clustering and Dimensionality Reduction 121
10.1 The Curse of Dimensionality 121
10.2 Example: Eigenfaces for Dimensionality Reduction 123
10.3 Principal Component Analysis and Factor Analysis 125
10.4 Skree Plots and Understanding Dimensionality 127
10.5 Factor Analysis 127
10.6 Limitations of PCA 128
10.7 Clustering 128
10.8 Further Reading 133
10.9 Glossary 134
11 Regression 135
11.1 Example: Predicting Diabetes Progression 136
11.2 Fitting a Line with Least Squares 137
11.3 Alternatives to Least Squares 139
11.4 Fitting Nonlinear Curves 139
11.5 Goodness of Fit: R 2 and Correlation 141
11.6 Correlation of Residuals 142
11.7 Linear Regression 142
11.8 LASSO Regression and Feature Selection 144
11.9 Further Reading 145
11.10 Glossary 145
12 Data Encodings and File Formats 147
12.1 Typical File Format Categories 147
12.2 CSV Files 149
12.3 JSON Files 150
12.4 XML Files 151
12.5 HTML Files 153
12.6 Tar Files 154
12.7 GZip Files 155
12.8 Zip Files 155
12.9 Image Files: Rasterized, Vectorized, and/or Compressed 156
12.10 It’s All Bytes at the End of the Day 157
12.11 Integers 158
12.12 Floats 158
12.13 Text Data 159
12.14 Further Reading 161
12.15 Glossary 161
13 Big Data 163
13.1 What Is Big Data? 163
13.2 When to Use – And not Use – Big Data 164
13.3 Hadoop: The File System and the Processor 165
13.4 Example Py Spark Script 165
13.5 Spark Overview 166
13.6 Spark Operations 168
13.7 Py Spark Data Frames 169
13.8 Two Ways to Run Py Spark 170
13.9 Configuring Spark 170
13.10 Under the Hood 172
13.11 Spark Tips and Gotchas 172
13.12 The Map Reduce Paradigm 173
13.13 Performance Considerations 174
13.14 Further Reading 175
13.15 Glossary 176
14 Databases 177
14.1 Relational Databases and My SQL® 178
14.2 Key–Value Stores 183
14.3 Wide-Column Stores 183
14.4 Document Stores 184
14.5 Further Reading 186
14.6 Glossary 186
15 Software Engineering Best Practices 187
15.1 Coding Style 187
15.2 Version Control and Git for Data Scientists 189
15.3 Testing Code 191
15.4 Test-Driven Development 193
15.5 AGILE Methodology 194
15.6 Further Reading 194
15.7 Glossary 194
16 Traditional Natural Language Processing 197
16.1 Do I Even Need NLP? 197
16.2 The Great Divide: Language Versus Statistics 198
16.3 Example: Sentiment Analysis on Stock Market Articles 198
16.4 Software and Datasets 200
16.5 Tokenization 201
16.6 Central Concept: Bag-of-Words 201
16.7 Word Weighting: TF-IDF 202
16.8 n-Grams 202
16.9 Stop Words 203
16.10 Lemmatization and Stemming 203
16.11 Synonyms 204
16.12 Part of Speech Tagging 204
16.13 Common Problems 204
16.14 Advanced Linguistic NLP: Syntax Trees, Knowledge, and Understanding 206
16.15 Further Reading 207
16.16 Glossary 207
17 Time Series Analysis 209
17.1 Example: Predicting Wikipedia Page Views 210
17.2 A Typical Workflow 213
17.3 Time Series Versus Time-Stamped Events 213
17.4 Resampling and Interpolation 214
17.5 Smoothing Signals 216
17.6 Logarithms and Other Transformations 217
17.7 Trends and Periodicity 217
17.8 Windowing 217
17.9 Brainstorming Simple Features 218
17.10 Better Features: Time Series as Vectors 219
17.11 Fourier Analysis: Sometimes a Magic Bullet 220
17.12 Time Series in Context: The Whole Suite of Features 222
17.13 Further Reading 222
17.14 Glossary 222
18 Probability 225
18.1 Flipping Coins: Bernoulli Random Variables 225
18.2 Throwing Darts: Uniform Random Variables 226
18.3 The Uniform Distribution and Pseudorandom Numbers 227
18.4 Nondiscrete, Noncontinuous Random Variables 228
18.5 Notation, Expectations, and Standard Deviation 230
18.6 Dependence, Marginal, and Conditional Probability 231
18.7 Understanding the Tails 232
18.8 Binomial Distribution 234
18.9 Poisson Distribution 234
18.10 Normal Distribution 235
18.11 Multivariate Gaussian 236
18.12 Exponential Distribution 237
18.13 Log-Normal Distribution 238
18.14 Entropy 238
18.15 Further Reading 240
18.16 Glossary 240
19 Statistics 243
19.1 Statistics in Perspective 243
19.2 Bayesian Versus Frequentist: Practical Tradeoffs and Differing Philosophies 244
19.3 Hypothesis Testing: Key Idea and Example 245
19.4 Multiple Hypothesis Testing 246
19.5 Parameter Estimation 247
19.6 Hypothesis Testing: t-Test 248
19.7 Confidence Intervals 250
19.8 Bayesian Statistics 252
19.9 Naive Bayesian Statistics 253
19.10 Bayesian Networks 253
19.11 Choosing Priors: Maximum Entropy or Domain Knowledge 254
19.12 Further Reading 255
19.13 Glossary 255
20 Programming Language Concepts 257
20.1 Programming Paradigms 257
20.2 Compilation and Interpretation 264
20.3 Type Systems 266
20.4 Further Reading 267
20.5 Glossary 267
21 Performance and Computer Memory 269
21.1 A Word of Caution 269
21.2 Example Script 270
21.3 Algorithm Performance and Big-O Notation 272
21.4 Some Classic Problems: Sorting a List and Binary Search 273
21.5 Amortized Performance and Average Performance 276
21.6 Two Principles: Reducing Overhead and Managing Memory 277
21.7 Performance Tip: Use Numerical Libraries When Applicable 278
21.8 Performance Tip: Delete Large Structures You Don’t Need 280
21.9 Performance Tip: Use Built-In Functions When Possible 280
21.10 Performance Tip: Avoid Superfluous Function Calls 280
21.11 Performance Tip: Avoid Creating Large New Objects 281
21.12 Further Reading 281
21.13 Glossary 281
Part III Specialized or Advanced Topics 283
22 Computer Memory and Data Structures 285
22.1 Virtual Memory, the Stack, and the Heap 285
22.2 Example C Program 286
22.3 Data Types and Arrays in Memory 286
22.4 Structs 287
22.5 Pointers, the Stack, and the Heap 288
22.6 Key Data Structures 292
22.7 Further Reading 297
22.8 Glossary 297
23 Maximum-Likelihood Estimation and Optimization 299
23.1 Maximum-Likelihood Estimation 299
23.2 A Simple Example: Fitting a Line 300
23.3 Another Example: Logistic Regression 301
23.4 Optimization 302
23.5 Gradient Descent 303
23.6 Convex Optimization 306
23.7 Stochastic Gradient Descent 307
23.8 Further Reading 308
23.9 Glossary 308
24 Deep Learning and AI 309
24.1 A Note on Libraries and Hardware 310
24.2 A Note on Training Data 310
24.3 Simple Deep Learning: Perceptrons 311
24.4 What Is a Tensor? 314
24.5 Convolutional Neural Networks 315
24.6 Example: The MNIST Handwriting Dataset 317
24.7 Autoencoders and Latent Vectors 318
24.8 Generative AI and GANs 321
24.9 Diffusion Models 323
24.10 RNNs, Hidden State, and the Encoder–Decoder 324
24.11 Attention and Transformers 325
24.12 Stable Diffusion: Bringing the Parts Together 326
24.13 Large Language Models and Prompt Engineering 327
24.14 Further Reading 328
24.15 Glossary 329
25 Stochastic Modeling 331
25.1 Markov Chains 331
25.2 Two Kinds of Markov Chain, Two Kinds of Questions 333
25.3 Hidden Markov Models and the Viterbi Algorithm 334
25.4 The Viterbi Algorithm 336
25.5 Random Walks 337
25.6 Brownian Motion 338
25.7 ARIMA Models 339
25.8 Continuous-Time Markov Processes 339
25.9 Poisson Processes 340
25.10 Further Reading 341
25.11 Glossary 341
26 Parting Words: Your Future as a Data Scientist 343
Index 345
Over de auteur
Field Cady is a data scientist, researcher and author based in Seattle, WA, USA. He has worked for a range of companies including Google, the Allen Institute for Artificial Intelligence, and several startups. He received a BS in physics and math from Stanford and did graduate work computer science at Carnegie Mellon. He is the author of The Data Science Handbook (Wiley 2017).