Contents
List of Figures ix
List of Tables xiii
Preface xv
About the Authors xxiii
1 Introduction 1
1.1 A Brief History of Data Science . . . . . . . . . . 1
1.2 Data Science Role and Skill Tracks . . . . . . . . 5
1.2.1 Engineering . . . . . . . . . . . . . . . . . 7
1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8
1.2.3 Modeling/Inference . . . . . . . . . . . . . 10
1.3 What Kind of Questions Can Data Science Solve? 15
1.3.1 Prerequisites . . . . . . . . . . . . . . . . 15
1.3.2 Problem Type . . . . . . . . . . . . . . . 18
1.4 Structure of Data Science Team . . . . . . . . . 20
1.5 Data Science Roles . . . . . . . . . . . . . . . . . 24
2 Soft Skills for Data Scientists 31
2.1 Comparison between Statistician and Data Scientist . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Beyond Data and Analytics . . . . . . . . . . . . 33
2.3 Three Pillars of Knowledge . . . . . . . . . . . . 35
2.4 Data Science Project Cycle . . . . . . . . . . . . 36
2.4.1 Types of Data Science Projects . . . . . . 36
2.4.2 Problem Formulation and Project Planning
Stage . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Project Modeling Stage . . . . . . . . . . 40
iii
iv Contents
2.4.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 41
2.4.5 Project Cycle Summary . . . . . . . . . . 42
2.5 Common Mistakes in Data Science . . . . . . . . 43
2.5.1 Problem Formulation Stage . . . . . . . . 43
2.5.2 Project Planning Stage . . . . . . . . . . . 44
2.5.3 Project Modeling Stage . . . . . . . . . . 45
2.5.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 46
2.5.5 Summary of Common Mistakes . . . . . . 47
3 Introduction to the Data 49
3.1 Customer Data for a Clothing Company . . . . . 49
3.2 Swine Disease Breakout Data . . . . . . . . . . . 51
3.3 MNIST Dataset . . . . . . . . . . . . . . . . . . 53
3.4 IMDB Dataset . . . . . . . . . . . . . . . . . . . 53
4 Big Data Cloud Platform 57
4.1 Power of Cluster of Computers . . . . . . . . . . 58
4.2 Evolution of Cluster Computing . . . . . . . . . 59
4.2.1 Hadoop . . . . . . . . . . . . . . . . . . . 59
4.2.2 Spark . . . . . . . . . . . . . . . . . . . . 60
4.3 Introduction of Cloud Environment . . . . . . . 60
4.3.1 Open Account and Create a Cluster . . . 61
4.3.2 R Notebook . . . . . . . . . . . . . . . . . 62
4.3.3 Markdown Cells . . . . . . . . . . . . . . 63
4.4 Leverage Spark Using R Notebook . . . . . . . . 64
4.5 Databases and SQL . . . . . . . . . . . . . . . . 71
4.5.1 History . . . . . . . . . . . . . . . . . . . 71
4.5.2 Database, Table and View . . . . . . . . . 72
4.5.3 Basic SQL Statement . . . . . . . . . . . 74
4.5.4 Advanced Topics in Database . . . . . . . 78
5 Data Pre-processing 79
5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 81
5.2 Missing Values . . . . . . . . . . . . . . . . . . . 84
5.2.1 Impute missing values with median/mode 85
5.2.2 K-nearest neighbors . . . . . . . . . . . . 86
Contents v
5.2.3 Bagging Tree . . . . . . . . . . . . . . . . 88
5.3 Centering and Scaling . . . . . . . . . . . . . . . 88
5.4 Resolve Skewness . . . . . . . . . . . . . . . . . 90
5.5 Resolve Outliers . . . . . . . . . . . . . . . . . . 93
5.6 Collinearity . . . . . . . . . . . . . . . . . . . . . 97
5.7 Sparse Variables . . . . . . . . . . . . . . . . . . 100
5.8 Re-encode Dummy Variables . . . . . . . . . . . 101
6 Data Wrangling 105
6.1 Summarize Data . . . . . . . . . . . . . . . . . . 107
6.1.1 dplyr package . . . . . . . . . . . . . . . . 107
6.1.2 apply(), lapply() and sapply() in base R . . 116
6.2 Tidy and Reshape Data . . . . . . . . . . . . . . 120
7 Model Tuning Strategy 125
7.1 Variance-Bias Trade-Off . . . . . . . . . . . . . . 126
7.2 Data Splitting and Resampling . . . . . . . . . . 134
7.2.1 Data Splitting . . . . . . . . . . . . . . . 135
7.2.2 Resampling . . . . . . . . . . . . . . . . . 145
8 Measuring Performance 151
8.1 Regression Model Performance . . . . . . . . . . 151
8.2 Classification Model Performance . . . . . . . . . 155
8.2.1 Confusion Matrix . . . . . . . . . . . . . . 157
8.2.2 Kappa Statistic . . . . . . . . . . . . . . . 159
8.2.3 ROC . . . . . . . . . . . . . . . . . . . . . 161
8.2.4 Gain and Lift Charts . . . . . . . . . . . . 163
9 Regression Models 167
9.1 Ordinary Least Square . . . . . . . . . . . . . . . 168
9.1.1 The Magic P-value . . . . . . . . . . . . . 173
9.1.2 Diagnostics for Linear Regression . . . . . 176
9.2 Principal Component Regression and Partial Least
Square . . . . . . . . . . . . . . . . . . . . . . . 180
10 Regularization Methods 189
10.1 Ridge Regression . . . . . . . . . . . . . . . . . . 190
10.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . 195
vi Contents
10.3 Elastic Net . . . . . . . . . . . . . . . . . . . . . 199
10.4 Penalized Generalized Linear Model . . . . . . . 201
10.4.1 Introduction to glmnet package . . . . . . 201
10.4.2 Penalized logistic regression . . . . . . . . 206
11 Tree-Based Methods 217
11.1 Tree Basics . . . . . . . . . . . . . . . . . . . . . 217
11.2 Splitting Criteria . . . . . . . . . . . . . . . . . . 221
11.2.1 Gini impurity . . . . . . . . . . . . . . . . 222
11.2.2 Information Gain (IG) . . . . . . . . . . . 223
11.2.3 Information Gain Ratio (IGR) . . . . . . . 224
11.2.4 Sum of Squared Error (SSE) . . . . . . . . 226
11.3 Tree Pruning . . . . . . . . . . . . . . . . . . . . 228
11.4 Regression and Decision Tree Basic . . . . . . . . 232
11.4.1 Regression Tree . . . . . . . . . . . . . . . 232
11.4.2 Decision Tree . . . . . . . . . . . . . . . . 236
11.5 Bagging Tree . . . . . . . . . . . . . . . . . . . . 241
11.6 Random Forest . . . . . . . . . . . . . . . . . . . 245
11.7 Gradient Boosted Machine . . . . . . . . . . . . 249
11.7.1 Adaptive Boosting . . . . . . . . . . . . . 250
11.7.2 Stochastic Gradient Boosting . . . . . . . 252
12 Deep Learning 259
12.1 Feedforward Neural Network . . . . . . . . . . . 263
12.1.1 Logistic Regression as Neural Network . . 263
12.1.2 Stochastic Gradient Descent . . . . . . . . 265
12.1.3 Deep Neural Network . . . . . . . . . . . 266
12.1.4 Activation Function . . . . . . . . . . . . 270
12.1.5 Optimization . . . . . . . . . . . . . . . . 274
12.1.6 Deal with Overfitting . . . . . . . . . . . . 282
12.1.7 Image Recognition Using FFNN . . . . . . 284
12.2 Convolutional Neural Network . . . . . . . . . . 298
12.2.1 Convolution Layer . . . . . . . . . . . . . 299
12.2.2 Padding Layer . . . . . . . . . . . . . . . 303
12.2.3 Pooling Layer . . . . . . . . . . . . . . . . 304
12.2.4 Convolution Over Volume . . . . . . . . . 308
12.2.5 Image Recognition Using CNN . . . . . . 311
Contents vii
12.3 Recurrent Neural Network . . . . . . . . . . . . 317
12.3.1 RNN Model . . . . . . . . . . . . . . . . . 320
12.3.2 Long Short Term Memory . . . . . . . . . 323
12.3.3 Word Embedding . . . . . . . . . . . . . . 326
12.3.4 Sentiment Analysis Using RNN . . . . . . 328
Appendix 337
13 Handling Large Local Data 339
13.1 readr . . . . . . . . . . . . . . . . . . . . . . . . 339
13.2 data.table— enhanced data.frame . . . . . . . . . 347
14 R code for data simulation 359
14.1 Customer Data for Clothing Company . . . . . . 359
14.2 Swine Disease Breakout Data . . . . . . . . . . . 364
Bibliography 369
Index 37
1 Introduction 1
1.1 A Brief History of Data Science . . . . . . . . . . 1
1.2 Data Science Role and Skill Tracks . . . . . . . . 5
1.2.1 Engineering . . . . . . . . . . . . . . . . . 7
1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8
1.2.3 Modeling/Inference . . . . . . . . . . . . . 10
1.3 What Kind of Questions Can Data Science Solve? 15
1.3.1 Prerequisites . . . . . . . . . . . . . . . . 15
1.3.2 Problem Type . . . . . . . . . . . . . . . 18
1.4 Structure of Data Science Team . . . . . . . . . 20
1.5 Data Science Roles . . . . . . . . . . . . . . . . . 24
2 Soft Skills for Data Scientists 31
2.1 Comparison between Statistician and Data Scientist . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Beyond Data and Analytics . . . . . . . . . . . . 33
2.3 Three Pillars of Knowledge . . . . . . . . . . . . 35
2.4 Data Science Project Cycle . . . . . . . . . . . . 36
2.4.1 Types of Data Science Projects . . . . . . 36
2.4.2 Problem Formulation and Project Planning
Stage . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Project Modeling Stage . . . . . . . . . . 40
iii
iv Contents
2.4.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 41
2.4.5 Project Cycle Summary . . . . . . . . . . 42
2.5 Common Mistakes in Data Science . . . . . . . . 43
2.5.1 Problem Formulation Stage . . . . . . . . 43
2.5.2 Project Planning Stage . . . . . . . . . . . 44
2.5.3 Project Modeling Stage . . . . . . . . . . 45
2.5.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 46
2.5.5 Summary of Common Mistakes . . . . . . 47
3 Introduction to the Data 49
3.1 Customer Data for a Clothing Company . . . . . 49
3.2 Swine Disease Breakout Data . . . . . . . . . . . 51
3.3 MNIST Dataset . . . . . . . . . . . . . . . . . . 53
3.4 IMDB Dataset . . . . . . . . . . . . . . . . . . . 53
4 Big Data Cloud Platform 57
4.1 Power of Cluster of Computers . . . . . . . . . . 58
4.2 Evolution of Cluster Computing . . . . . . . . . 59
4.2.1 Hadoop . . . . . . . . . . . . . . . . . . . 59
4.2.2 Spark . . . . . . . . . . . . . . . . . . . . 60
4.3 Introduction of Cloud Environment . . . . . . . 60
4.3.1 Open Account and Create a Cluster . . . 61
4.3.2 R Notebook . . . . . . . . . . . . . . . . . 62
4.3.3 Markdown Cells . . . . . . . . . . . . . . 63
4.4 Leverage Spark Using R Notebook . . . . . . . . 64
4.5 Databases and SQL . . . . . . . . . . . . . . . . 71
4.5.1 History . . . . . . . . . . . . . . . . . . . 71
4.5.2 Database, Table and View . . . . . . . . . 72
4.5.3 Basic SQL Statement . . . . . . . . . . . 74
4.5.4 Advanced Topics in Database . . . . . . . 78
5 Data Pre-processing 79
5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 81
5.2 Missing Values . . . . . . . . . . . . . . . . . . . 84
5.2.1 Impute missing values with median/mode 85
5.2.2 K-nearest neighbors . . . . . . . . . . . . 86
Contents v
5.2.3 Bagging Tree . . . . . . . . . . . . . . . . 88
5.3 Centering and Scaling . . . . . . . . . . . . . . . 88
5.4 Resolve Skewness . . . . . . . . . . . . . . . . . 90
5.5 Resolve Outliers . . . . . . . . . . . . . . . . . . 93
5.6 Collinearity . . . . . . . . . . . . . . . . . . . . . 97
5.7 Sparse Variables . . . . . . . . . . . . . . . . . . 100
5.8 Re-encode Dummy Variables . . . . . . . . . . . 101
6 Data Wrangling 105
6.1 Summarize Data . . . . . . . . . . . . . . . . . . 107
6.1.1 dplyr package . . . . . . . . . . . . . . . . 107
6.1.2 apply(), lapply() and sapply() in base R . . 116
6.2 Tidy and Reshape Data . . . . . . . . . . . . . . 120
7 Model Tuning Strategy 125
7.1 Variance-Bias Trade-Off . . . . . . . . . . . . . . 126
7.2 Data Splitting and Resampling . . . . . . . . . . 134
7.2.1 Data Splitting . . . . . . . . . . . . . . . 135
7.2.2 Resampling . . . . . . . . . . . . . . . . . 145
8 Measuring Performance 151
8.1 Regression Model Performance . . . . . . . . . . 151
8.2 Classification Model Performance . . . . . . . . . 155
8.2.1 Confusion Matrix . . . . . . . . . . . . . . 157
8.2.2 Kappa Statistic . . . . . . . . . . . . . . . 159
8.2.3 ROC . . . . . . . . . . . . . . . . . . . . . 161
8.2.4 Gain and Lift Charts . . . . . . . . . . . . 163
9 Regression Models 167
9.1 Ordinary Least Square . . . . . . . . . . . . . . . 168
9.1.1 The Magic P-value . . . . . . . . . . . . . 173
9.1.2 Diagnostics for Linear Regression . . . . . 176
9.2 Principal Component Regression and Partial Least
Square . . . . . . . . . . . . . . . . . . . . . . . 180
10 Regularization Methods 189
10.1 Ridge Regression . . . . . . . . . . . . . . . . . . 190
10.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . 195
vi Contents
10.3 Elastic Net . . . . . . . . . . . . . . . . . . . . . 199
10.4 Penalized Generalized Linear Model . . . . . . . 201
10.4.1 Introduction to glmnet package . . . . . . 201
10.4.2 Penalized logistic regression . . . . . . . . 206
11 Tree-Based Methods 217
11.1 Tree Basics . . . . . . . . . . . . . . . . . . . . . 217
11.2 Splitting Criteria . . . . . . . . . . . . . . . . . . 221
11.2.1 Gini impurity . . . . . . . . . . . . . . . . 222
11.2.2 Information Gain (IG) . . . . . . . . . . . 223
11.2.3 Information Gain Ratio (IGR) . . . . . . . 224
11.2.4 Sum of Squared Error (SSE) . . . . . . . . 226
11.3 Tree Pruning . . . . . . . . . . . . . . . . . . . . 228
11.4 Regression and Decision Tree Basic . . . . . . . . 232
11.4.1 Regression Tree . . . . . . . . . . . . . . . 232
11.4.2 Decision Tree . . . . . . . . . . . . . . . . 236
11.5 Bagging Tree . . . . . . . . . . . . . . . . . . . . 241
11.6 Random Forest . . . . . . . . . . . . . . . . . . . 245
11.7 Gradient Boosted Machine . . . . . . . . . . . . 249
11.7.1 Adaptive Boosting . . . . . . . . . . . . . 250
11.7.2 Stochastic Gradient Boosting . . . . . . . 252
12 Deep Learning 259
12.1 Feedforward Neural Network . . . . . . . . . . . 263
12.1.1 Logistic Regression as Neural Network . . 263
12.1.2 Stochastic Gradient Descent . . . . . . . . 265
12.1.3 Deep Neural Network . . . . . . . . . . . 266
12.1.4 Activation Function . . . . . . . . . . . . 270
12.1.5 Optimization . . . . . . . . . . . . . . . . 274
12.1.6 Deal with Overfitting . . . . . . . . . . . . 282
12.1.7 Image Recognition Using FFNN . . . . . . 284
12.2 Convolutional Neural Network . . . . . . . . . . 298
12.2.1 Convolution Layer . . . . . . . . . . . . . 299
12.2.2 Padding Layer . . . . . . . . . . . . . . . 303
12.2.3 Pooling Layer . . . . . . . . . . . . . . . . 304
12.2.4 Convolution Over Volume . . . . . . . . . 308
12.2.5 Image Recognition Using CNN . . . . . . 311
Contents vii
12.3 Recurrent Neural Network . . . . . . . . . . . . 317
12.3.1 RNN Model . . . . . . . . . . . . . . . . . 320
12.3.2 Long Short Term Memory . . . . . . . . . 323
12.3.3 Word Embedding . . . . . . . . . . . . . . 326
12.3.4 Sentiment Analysis Using RNN . . . . . . 328
Appendix 337
13 Handling Large Local Data 339
13.1 readr . . . . . . . . . . . . . . . . . . . . . . . . 339
13.2 data.table— enhanced data.frame . . . . . . . . . 347
14 R code for data simulation 359
14.1 Customer Data for Clothing Company . . . . . . 359
14.2 Swine Disease Breakout Data . . . . . . . . . . . 364
No comments:
Post a Comment