Thursday, November 16, 2023

Hui Lin and Ming Li - Practitioner’s Guide to Data Science - Book Information - Notes

Contents

List of Figures ix

List of Tables xiii

Preface xv

About the Authors xxiii

1 Introduction 1

1.1 A Brief History of Data Science . . . . . . . . . . 1

1.2 Data Science Role and Skill Tracks . . . . . . . . 5

1.2.1 Engineering . . . . . . . . . . . . . . . . . 7

1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8

1.2.3 Modeling/Inference . . . . . . . . . . . . . 10

1.3 What Kind of Questions Can Data Science Solve? 15

1.3.1 Prerequisites . . . . . . . . . . . . . . . . 15

1.3.2 Problem Type . . . . . . . . . . . . . . . 18

1.4 Structure of Data Science Team . . . . . . . . . 20

1.5 Data Science Roles . . . . . . . . . . . . . . . . . 24

2 Soft Skills for Data Scientists 31

2.1 Comparison between Statistician and Data Scientist . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Beyond Data and Analytics . . . . . . . . . . . . 33

2.3 Three Pillars of Knowledge . . . . . . . . . . . . 35

2.4 Data Science Project Cycle . . . . . . . . . . . . 36

2.4.1 Types of Data Science Projects . . . . . . 36

2.4.2 Problem Formulation and Project Planning

Stage . . . . . . . . . . . . . . . . . . . . 38

2.4.3 Project Modeling Stage . . . . . . . . . . 40

iii

iv Contents

2.4.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 41

2.4.5 Project Cycle Summary . . . . . . . . . . 42

2.5 Common Mistakes in Data Science . . . . . . . . 43

2.5.1 Problem Formulation Stage . . . . . . . . 43

2.5.2 Project Planning Stage . . . . . . . . . . . 44

2.5.3 Project Modeling Stage . . . . . . . . . . 45

2.5.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 46

2.5.5 Summary of Common Mistakes . . . . . . 47

3 Introduction to the Data 49

3.1 Customer Data for a Clothing Company . . . . . 49

3.2 Swine Disease Breakout Data . . . . . . . . . . . 51

3.3 MNIST Dataset . . . . . . . . . . . . . . . . . . 53

3.4 IMDB Dataset . . . . . . . . . . . . . . . . . . . 53

4 Big Data Cloud Platform 57

4.1 Power of Cluster of Computers . . . . . . . . . . 58

4.2 Evolution of Cluster Computing . . . . . . . . . 59

4.2.1 Hadoop . . . . . . . . . . . . . . . . . . . 59

4.2.2 Spark . . . . . . . . . . . . . . . . . . . . 60

4.3 Introduction of Cloud Environment . . . . . . . 60

4.3.1 Open Account and Create a Cluster . . . 61

4.3.2 R Notebook . . . . . . . . . . . . . . . . . 62

4.3.3 Markdown Cells . . . . . . . . . . . . . . 63

4.4 Leverage Spark Using R Notebook . . . . . . . . 64

4.5 Databases and SQL . . . . . . . . . . . . . . . . 71

4.5.1 History . . . . . . . . . . . . . . . . . . . 71

4.5.2 Database, Table and View . . . . . . . . . 72

4.5.3 Basic SQL Statement . . . . . . . . . . . 74

4.5.4 Advanced Topics in Database . . . . . . . 78

5 Data Pre-processing 79

5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 81

5.2 Missing Values . . . . . . . . . . . . . . . . . . . 84

5.2.1 Impute missing values with median/mode 85

5.2.2 K-nearest neighbors . . . . . . . . . . . . 86

Contents v

5.2.3 Bagging Tree . . . . . . . . . . . . . . . . 88

5.3 Centering and Scaling . . . . . . . . . . . . . . . 88

5.4 Resolve Skewness . . . . . . . . . . . . . . . . . 90

5.5 Resolve Outliers . . . . . . . . . . . . . . . . . . 93

5.6 Collinearity . . . . . . . . . . . . . . . . . . . . . 97

5.7 Sparse Variables . . . . . . . . . . . . . . . . . . 100

5.8 Re-encode Dummy Variables . . . . . . . . . . . 101

6 Data Wrangling 105

6.1 Summarize Data . . . . . . . . . . . . . . . . . . 107

6.1.1 dplyr package . . . . . . . . . . . . . . . . 107

6.1.2 apply(), lapply() and sapply() in base R . . 116

6.2 Tidy and Reshape Data . . . . . . . . . . . . . . 120

7 Model Tuning Strategy 125

7.1 Variance-Bias Trade-Off . . . . . . . . . . . . . . 126

7.2 Data Splitting and Resampling . . . . . . . . . . 134

7.2.1 Data Splitting . . . . . . . . . . . . . . . 135

7.2.2 Resampling . . . . . . . . . . . . . . . . . 145

8 Measuring Performance 151

8.1 Regression Model Performance . . . . . . . . . . 151

8.2 Classification Model Performance . . . . . . . . . 155

8.2.1 Confusion Matrix . . . . . . . . . . . . . . 157

8.2.2 Kappa Statistic . . . . . . . . . . . . . . . 159

8.2.3 ROC . . . . . . . . . . . . . . . . . . . . . 161

8.2.4 Gain and Lift Charts . . . . . . . . . . . . 163

9 Regression Models 167

9.1 Ordinary Least Square . . . . . . . . . . . . . . . 168

9.1.1 The Magic P-value . . . . . . . . . . . . . 173

9.1.2 Diagnostics for Linear Regression . . . . . 176

9.2 Principal Component Regression and Partial Least

Square . . . . . . . . . . . . . . . . . . . . . . . 180

10 Regularization Methods 189

10.1 Ridge Regression . . . . . . . . . . . . . . . . . . 190

10.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . 195

vi Contents

10.3 Elastic Net . . . . . . . . . . . . . . . . . . . . . 199

10.4 Penalized Generalized Linear Model . . . . . . . 201

10.4.1 Introduction to glmnet package . . . . . . 201

10.4.2 Penalized logistic regression . . . . . . . . 206

11 Tree-Based Methods 217

11.1 Tree Basics . . . . . . . . . . . . . . . . . . . . . 217

11.2 Splitting Criteria . . . . . . . . . . . . . . . . . . 221

11.2.1 Gini impurity . . . . . . . . . . . . . . . . 222

11.2.2 Information Gain (IG) . . . . . . . . . . . 223

11.2.3 Information Gain Ratio (IGR) . . . . . . . 224

11.2.4 Sum of Squared Error (SSE) . . . . . . . . 226

11.3 Tree Pruning . . . . . . . . . . . . . . . . . . . . 228

11.4 Regression and Decision Tree Basic . . . . . . . . 232

11.4.1 Regression Tree . . . . . . . . . . . . . . . 232

11.4.2 Decision Tree . . . . . . . . . . . . . . . . 236

11.5 Bagging Tree . . . . . . . . . . . . . . . . . . . . 241

11.6 Random Forest . . . . . . . . . . . . . . . . . . . 245

11.7 Gradient Boosted Machine . . . . . . . . . . . . 249

11.7.1 Adaptive Boosting . . . . . . . . . . . . . 250

11.7.2 Stochastic Gradient Boosting . . . . . . . 252

12 Deep Learning 259

12.1 Feedforward Neural Network . . . . . . . . . . . 263

12.1.1 Logistic Regression as Neural Network . . 263

12.1.2 Stochastic Gradient Descent . . . . . . . . 265

12.1.3 Deep Neural Network . . . . . . . . . . . 266

12.1.4 Activation Function . . . . . . . . . . . . 270

12.1.5 Optimization . . . . . . . . . . . . . . . . 274

12.1.6 Deal with Overfitting . . . . . . . . . . . . 282

12.1.7 Image Recognition Using FFNN . . . . . . 284

12.2 Convolutional Neural Network . . . . . . . . . . 298

12.2.1 Convolution Layer . . . . . . . . . . . . . 299

12.2.2 Padding Layer . . . . . . . . . . . . . . . 303

12.2.3 Pooling Layer . . . . . . . . . . . . . . . . 304

12.2.4 Convolution Over Volume . . . . . . . . . 308

12.2.5 Image Recognition Using CNN . . . . . . 311

Contents vii

12.3 Recurrent Neural Network . . . . . . . . . . . . 317

12.3.1 RNN Model . . . . . . . . . . . . . . . . . 320

12.3.2 Long Short Term Memory . . . . . . . . . 323

12.3.3 Word Embedding . . . . . . . . . . . . . . 326

12.3.4 Sentiment Analysis Using RNN . . . . . . 328

Appendix 337

13 Handling Large Local Data 339

13.1 readr . . . . . . . . . . . . . . . . . . . . . . . . 339

13.2 data.table— enhanced data.frame . . . . . . . . . 347

14 R code for data simulation 359

14.1 Customer Data for Clothing Company . . . . . . 359

14.2 Swine Disease Breakout Data . . . . . . . . . . . 364

Bibliography 369

Index 37





1 Introduction 1

1.1 A Brief History of Data Science . . . . . . . . . . 1

1.2 Data Science Role and Skill Tracks . . . . . . . . 5

1.2.1 Engineering . . . . . . . . . . . . . . . . . 7

1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8

1.2.3 Modeling/Inference . . . . . . . . . . . . . 10

1.3 What Kind of Questions Can Data Science Solve? 15

1.3.1 Prerequisites . . . . . . . . . . . . . . . . 15

1.3.2 Problem Type . . . . . . . . . . . . . . . 18

1.4 Structure of Data Science Team . . . . . . . . . 20

1.5 Data Science Roles . . . . . . . . . . . . . . . . . 24

2 Soft Skills for Data Scientists 31

2.1 Comparison between Statistician and Data Scientist . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Beyond Data and Analytics . . . . . . . . . . . . 33

2.3 Three Pillars of Knowledge . . . . . . . . . . . . 35

2.4 Data Science Project Cycle . . . . . . . . . . . . 36

2.4.1 Types of Data Science Projects . . . . . . 36

2.4.2 Problem Formulation and Project Planning

Stage . . . . . . . . . . . . . . . . . . . . 38

2.4.3 Project Modeling Stage . . . . . . . . . . 40

iii

iv Contents

2.4.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 41

2.4.5 Project Cycle Summary . . . . . . . . . . 42

2.5 Common Mistakes in Data Science . . . . . . . . 43

2.5.1 Problem Formulation Stage . . . . . . . . 43

2.5.2 Project Planning Stage . . . . . . . . . . . 44

2.5.3 Project Modeling Stage . . . . . . . . . . 45

2.5.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 46

2.5.5 Summary of Common Mistakes . . . . . . 47

3 Introduction to the Data 49

3.1 Customer Data for a Clothing Company . . . . . 49

3.2 Swine Disease Breakout Data . . . . . . . . . . . 51

3.3 MNIST Dataset . . . . . . . . . . . . . . . . . . 53

3.4 IMDB Dataset . . . . . . . . . . . . . . . . . . . 53

4 Big Data Cloud Platform 57

4.1 Power of Cluster of Computers . . . . . . . . . . 58

4.2 Evolution of Cluster Computing . . . . . . . . . 59

4.2.1 Hadoop . . . . . . . . . . . . . . . . . . . 59

4.2.2 Spark . . . . . . . . . . . . . . . . . . . . 60

4.3 Introduction of Cloud Environment . . . . . . . 60

4.3.1 Open Account and Create a Cluster . . . 61

4.3.2 R Notebook . . . . . . . . . . . . . . . . . 62

4.3.3 Markdown Cells . . . . . . . . . . . . . . 63

4.4 Leverage Spark Using R Notebook . . . . . . . . 64

4.5 Databases and SQL . . . . . . . . . . . . . . . . 71

4.5.1 History . . . . . . . . . . . . . . . . . . . 71

4.5.2 Database, Table and View . . . . . . . . . 72

4.5.3 Basic SQL Statement . . . . . . . . . . . 74

4.5.4 Advanced Topics in Database . . . . . . . 78

5 Data Pre-processing 79

5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 81

5.2 Missing Values . . . . . . . . . . . . . . . . . . . 84

5.2.1 Impute missing values with median/mode 85

5.2.2 K-nearest neighbors . . . . . . . . . . . . 86

Contents v

5.2.3 Bagging Tree . . . . . . . . . . . . . . . . 88

5.3 Centering and Scaling . . . . . . . . . . . . . . . 88

5.4 Resolve Skewness . . . . . . . . . . . . . . . . . 90

5.5 Resolve Outliers . . . . . . . . . . . . . . . . . . 93

5.6 Collinearity . . . . . . . . . . . . . . . . . . . . . 97

5.7 Sparse Variables . . . . . . . . . . . . . . . . . . 100

5.8 Re-encode Dummy Variables . . . . . . . . . . . 101

6 Data Wrangling 105

6.1 Summarize Data . . . . . . . . . . . . . . . . . . 107

6.1.1 dplyr package . . . . . . . . . . . . . . . . 107

6.1.2 apply(), lapply() and sapply() in base R . . 116

6.2 Tidy and Reshape Data . . . . . . . . . . . . . . 120

7 Model Tuning Strategy 125

7.1 Variance-Bias Trade-Off . . . . . . . . . . . . . . 126

7.2 Data Splitting and Resampling . . . . . . . . . . 134

7.2.1 Data Splitting . . . . . . . . . . . . . . . 135

7.2.2 Resampling . . . . . . . . . . . . . . . . . 145

8 Measuring Performance 151

8.1 Regression Model Performance . . . . . . . . . . 151

8.2 Classification Model Performance . . . . . . . . . 155

8.2.1 Confusion Matrix . . . . . . . . . . . . . . 157

8.2.2 Kappa Statistic . . . . . . . . . . . . . . . 159

8.2.3 ROC . . . . . . . . . . . . . . . . . . . . . 161

8.2.4 Gain and Lift Charts . . . . . . . . . . . . 163

9 Regression Models 167

9.1 Ordinary Least Square . . . . . . . . . . . . . . . 168

9.1.1 The Magic P-value . . . . . . . . . . . . . 173

9.1.2 Diagnostics for Linear Regression . . . . . 176

9.2 Principal Component Regression and Partial Least

Square . . . . . . . . . . . . . . . . . . . . . . . 180

10 Regularization Methods 189

10.1 Ridge Regression . . . . . . . . . . . . . . . . . . 190

10.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . 195

vi Contents

10.3 Elastic Net . . . . . . . . . . . . . . . . . . . . . 199

10.4 Penalized Generalized Linear Model . . . . . . . 201

10.4.1 Introduction to glmnet package . . . . . . 201

10.4.2 Penalized logistic regression . . . . . . . . 206

11 Tree-Based Methods 217

11.1 Tree Basics . . . . . . . . . . . . . . . . . . . . . 217

11.2 Splitting Criteria . . . . . . . . . . . . . . . . . . 221

11.2.1 Gini impurity . . . . . . . . . . . . . . . . 222

11.2.2 Information Gain (IG) . . . . . . . . . . . 223

11.2.3 Information Gain Ratio (IGR) . . . . . . . 224

11.2.4 Sum of Squared Error (SSE) . . . . . . . . 226

11.3 Tree Pruning . . . . . . . . . . . . . . . . . . . . 228

11.4 Regression and Decision Tree Basic . . . . . . . . 232

11.4.1 Regression Tree . . . . . . . . . . . . . . . 232

11.4.2 Decision Tree . . . . . . . . . . . . . . . . 236

11.5 Bagging Tree . . . . . . . . . . . . . . . . . . . . 241

11.6 Random Forest . . . . . . . . . . . . . . . . . . . 245

11.7 Gradient Boosted Machine . . . . . . . . . . . . 249

11.7.1 Adaptive Boosting . . . . . . . . . . . . . 250

11.7.2 Stochastic Gradient Boosting . . . . . . . 252

12 Deep Learning 259

12.1 Feedforward Neural Network . . . . . . . . . . . 263

12.1.1 Logistic Regression as Neural Network . . 263

12.1.2 Stochastic Gradient Descent . . . . . . . . 265

12.1.3 Deep Neural Network . . . . . . . . . . . 266

12.1.4 Activation Function . . . . . . . . . . . . 270

12.1.5 Optimization . . . . . . . . . . . . . . . . 274

12.1.6 Deal with Overfitting . . . . . . . . . . . . 282

12.1.7 Image Recognition Using FFNN . . . . . . 284

12.2 Convolutional Neural Network . . . . . . . . . . 298

12.2.1 Convolution Layer . . . . . . . . . . . . . 299

12.2.2 Padding Layer . . . . . . . . . . . . . . . 303

12.2.3 Pooling Layer . . . . . . . . . . . . . . . . 304

12.2.4 Convolution Over Volume . . . . . . . . . 308

12.2.5 Image Recognition Using CNN . . . . . . 311

Contents vii

12.3 Recurrent Neural Network . . . . . . . . . . . . 317

12.3.1 RNN Model . . . . . . . . . . . . . . . . . 320

12.3.2 Long Short Term Memory . . . . . . . . . 323

12.3.3 Word Embedding . . . . . . . . . . . . . . 326

12.3.4 Sentiment Analysis Using RNN . . . . . . 328

Appendix 337

13 Handling Large Local Data 339

13.1 readr . . . . . . . . . . . . . . . . . . . . . . . . 339

13.2 data.table— enhanced data.frame . . . . . . . . . 347

14 R code for data simulation 359

14.1 Customer Data for Clothing Company . . . . . . 359

14.2 Swine Disease Breakout Data . . . . . . . . . . . 364

No comments:

Post a Comment