Thursday, November 16, 2023

Data Science - Case Studies

https://hdsr.mitpress.mit.edu/pub/hnptx6lq/release/10

https://towardsdatascience.com/case-study-applying-a-data-science-process-model-to-a-real-world-scenario-93ae57b682bf

https://www.cambridgespark.com/en-gb/case-studies/carrefour-case-study

Inventory Analysis Case Study Data files: - PWC

https://www.pwc.com/us/en/careers/university-relations/data-and-analytics-case-studies-files.html

data science case study Google search results interesting

Data Analytics and Data Mining - Difference Explained

Data analytics can be classified into three categories:

Descriptive analytics: Describes the collected data or dataset with clear visualization and summary.

Predictive analytics: Predict the future behavior of interest. Provides scenario analysis.

Prescriptive analytics: Makes or suggests smart decisions based on the predictive results. Optimization of solution based on the results of predictive analytics.

The three steps or categories of data analytics have to be used to make a decision based on data. To make data analytics valid or effective within a company in many different decisions, the company needs to involve at least three different people with different skills:

Business experts: Some of them set the problem objective and some provide the decision model that which is based on domain knowledge. The decision model indicates the data to be collected, the processes from which the data will be collected and the period for which data needs to be collected.

Information technology experts: They design the database which is likely to be filled during transaction processing, and they also manage the database.

Data analysis experts: They understand data mining, statistical and OR techniques.

Data analytics as explained is objective-oriented process that aims to make smart decisions. The goal is set first and data is analyzed to take the decision that helps in achieving the goal in efficient manner.

Data mining focuses on identifying undiscovered patterns and establishing hidden relationships embedded in the dataset. Data mining is a part of predictive analytics method.

Ud. 16.11.2023

First published 15.5.2015

Hui Lin and Ming Li - Practitioner’s Guide to Data Science - Book Information - Notes

Contents

List of Figures ix

List of Tables xiii

Preface xv

About the Authors xxiii

1 Introduction 1

1.1 A Brief History of Data Science . . . . . . . . . . 1

1.2 Data Science Role and Skill Tracks . . . . . . . . 5

1.2.1 Engineering . . . . . . . . . . . . . . . . . 7

1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8

1.2.3 Modeling/Inference . . . . . . . . . . . . . 10

1.3 What Kind of Questions Can Data Science Solve? 15

1.3.1 Prerequisites . . . . . . . . . . . . . . . . 15

1.3.2 Problem Type . . . . . . . . . . . . . . . 18

1.4 Structure of Data Science Team . . . . . . . . . 20

1.5 Data Science Roles . . . . . . . . . . . . . . . . . 24

2 Soft Skills for Data Scientists 31

2.1 Comparison between Statistician and Data Scientist . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Beyond Data and Analytics . . . . . . . . . . . . 33

2.3 Three Pillars of Knowledge . . . . . . . . . . . . 35

2.4 Data Science Project Cycle . . . . . . . . . . . . 36

2.4.1 Types of Data Science Projects . . . . . . 36

2.4.2 Problem Formulation and Project Planning

Stage . . . . . . . . . . . . . . . . . . . . 38

2.4.3 Project Modeling Stage . . . . . . . . . . 40

iii

iv Contents

2.4.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 41

2.4.5 Project Cycle Summary . . . . . . . . . . 42

2.5 Common Mistakes in Data Science . . . . . . . . 43

2.5.1 Problem Formulation Stage . . . . . . . . 43

2.5.2 Project Planning Stage . . . . . . . . . . . 44

2.5.3 Project Modeling Stage . . . . . . . . . . 45

2.5.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 46

2.5.5 Summary of Common Mistakes . . . . . . 47

3 Introduction to the Data 49

3.1 Customer Data for a Clothing Company . . . . . 49

3.2 Swine Disease Breakout Data . . . . . . . . . . . 51

3.3 MNIST Dataset . . . . . . . . . . . . . . . . . . 53

3.4 IMDB Dataset . . . . . . . . . . . . . . . . . . . 53

4 Big Data Cloud Platform 57

4.1 Power of Cluster of Computers . . . . . . . . . . 58

4.2 Evolution of Cluster Computing . . . . . . . . . 59

4.2.1 Hadoop . . . . . . . . . . . . . . . . . . . 59

4.2.2 Spark . . . . . . . . . . . . . . . . . . . . 60

4.3 Introduction of Cloud Environment . . . . . . . 60

4.3.1 Open Account and Create a Cluster . . . 61

4.3.2 R Notebook . . . . . . . . . . . . . . . . . 62

4.3.3 Markdown Cells . . . . . . . . . . . . . . 63

4.4 Leverage Spark Using R Notebook . . . . . . . . 64

4.5 Databases and SQL . . . . . . . . . . . . . . . . 71

4.5.1 History . . . . . . . . . . . . . . . . . . . 71

4.5.2 Database, Table and View . . . . . . . . . 72

4.5.3 Basic SQL Statement . . . . . . . . . . . 74

4.5.4 Advanced Topics in Database . . . . . . . 78

5 Data Pre-processing 79

5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 81

5.2 Missing Values . . . . . . . . . . . . . . . . . . . 84

5.2.1 Impute missing values with median/mode 85

5.2.2 K-nearest neighbors . . . . . . . . . . . . 86

Contents v

5.2.3 Bagging Tree . . . . . . . . . . . . . . . . 88

5.3 Centering and Scaling . . . . . . . . . . . . . . . 88

5.4 Resolve Skewness . . . . . . . . . . . . . . . . . 90

5.5 Resolve Outliers . . . . . . . . . . . . . . . . . . 93

5.6 Collinearity . . . . . . . . . . . . . . . . . . . . . 97

5.7 Sparse Variables . . . . . . . . . . . . . . . . . . 100

5.8 Re-encode Dummy Variables . . . . . . . . . . . 101

6 Data Wrangling 105

6.1 Summarize Data . . . . . . . . . . . . . . . . . . 107

6.1.1 dplyr package . . . . . . . . . . . . . . . . 107

6.1.2 apply(), lapply() and sapply() in base R . . 116

6.2 Tidy and Reshape Data . . . . . . . . . . . . . . 120

7 Model Tuning Strategy 125

7.1 Variance-Bias Trade-Off . . . . . . . . . . . . . . 126

7.2 Data Splitting and Resampling . . . . . . . . . . 134

7.2.1 Data Splitting . . . . . . . . . . . . . . . 135

7.2.2 Resampling . . . . . . . . . . . . . . . . . 145

8 Measuring Performance 151

8.1 Regression Model Performance . . . . . . . . . . 151

8.2 Classification Model Performance . . . . . . . . . 155

8.2.1 Confusion Matrix . . . . . . . . . . . . . . 157

8.2.2 Kappa Statistic . . . . . . . . . . . . . . . 159

8.2.3 ROC . . . . . . . . . . . . . . . . . . . . . 161

8.2.4 Gain and Lift Charts . . . . . . . . . . . . 163

9 Regression Models 167

9.1 Ordinary Least Square . . . . . . . . . . . . . . . 168

9.1.1 The Magic P-value . . . . . . . . . . . . . 173

9.1.2 Diagnostics for Linear Regression . . . . . 176

9.2 Principal Component Regression and Partial Least

Square . . . . . . . . . . . . . . . . . . . . . . . 180

10 Regularization Methods 189

10.1 Ridge Regression . . . . . . . . . . . . . . . . . . 190

10.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . 195

vi Contents

10.3 Elastic Net . . . . . . . . . . . . . . . . . . . . . 199

10.4 Penalized Generalized Linear Model . . . . . . . 201

10.4.1 Introduction to glmnet package . . . . . . 201

10.4.2 Penalized logistic regression . . . . . . . . 206

11 Tree-Based Methods 217

11.1 Tree Basics . . . . . . . . . . . . . . . . . . . . . 217

11.2 Splitting Criteria . . . . . . . . . . . . . . . . . . 221

11.2.1 Gini impurity . . . . . . . . . . . . . . . . 222

11.2.2 Information Gain (IG) . . . . . . . . . . . 223

11.2.3 Information Gain Ratio (IGR) . . . . . . . 224

11.2.4 Sum of Squared Error (SSE) . . . . . . . . 226

11.3 Tree Pruning . . . . . . . . . . . . . . . . . . . . 228

11.4 Regression and Decision Tree Basic . . . . . . . . 232

11.4.1 Regression Tree . . . . . . . . . . . . . . . 232

11.4.2 Decision Tree . . . . . . . . . . . . . . . . 236

11.5 Bagging Tree . . . . . . . . . . . . . . . . . . . . 241

11.6 Random Forest . . . . . . . . . . . . . . . . . . . 245

11.7 Gradient Boosted Machine . . . . . . . . . . . . 249

11.7.1 Adaptive Boosting . . . . . . . . . . . . . 250

11.7.2 Stochastic Gradient Boosting . . . . . . . 252

12 Deep Learning 259

12.1 Feedforward Neural Network . . . . . . . . . . . 263

12.1.1 Logistic Regression as Neural Network . . 263

12.1.2 Stochastic Gradient Descent . . . . . . . . 265

12.1.3 Deep Neural Network . . . . . . . . . . . 266

12.1.4 Activation Function . . . . . . . . . . . . 270

12.1.5 Optimization . . . . . . . . . . . . . . . . 274

12.1.6 Deal with Overfitting . . . . . . . . . . . . 282

12.1.7 Image Recognition Using FFNN . . . . . . 284

12.2 Convolutional Neural Network . . . . . . . . . . 298

12.2.1 Convolution Layer . . . . . . . . . . . . . 299

12.2.2 Padding Layer . . . . . . . . . . . . . . . 303

12.2.3 Pooling Layer . . . . . . . . . . . . . . . . 304

12.2.4 Convolution Over Volume . . . . . . . . . 308

12.2.5 Image Recognition Using CNN . . . . . . 311

Contents vii

12.3 Recurrent Neural Network . . . . . . . . . . . . 317

12.3.1 RNN Model . . . . . . . . . . . . . . . . . 320

12.3.2 Long Short Term Memory . . . . . . . . . 323

12.3.3 Word Embedding . . . . . . . . . . . . . . 326

12.3.4 Sentiment Analysis Using RNN . . . . . . 328

Appendix 337

13 Handling Large Local Data 339

13.1 readr . . . . . . . . . . . . . . . . . . . . . . . . 339

13.2 data.table— enhanced data.frame . . . . . . . . . 347

14 R code for data simulation 359

14.1 Customer Data for Clothing Company . . . . . . 359

14.2 Swine Disease Breakout Data . . . . . . . . . . . 364

Bibliography 369

Index 37

1 Introduction 1

1.1 A Brief History of Data Science . . . . . . . . . . 1

1.2 Data Science Role and Skill Tracks . . . . . . . . 5

1.2.1 Engineering . . . . . . . . . . . . . . . . . 7

1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8

1.2.3 Modeling/Inference . . . . . . . . . . . . . 10

1.3 What Kind of Questions Can Data Science Solve? 15

1.3.1 Prerequisites . . . . . . . . . . . . . . . . 15

1.3.2 Problem Type . . . . . . . . . . . . . . . 18

1.4 Structure of Data Science Team . . . . . . . . . 20

1.5 Data Science Roles . . . . . . . . . . . . . . . . . 24

2 Soft Skills for Data Scientists 31

2.1 Comparison between Statistician and Data Scientist . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Beyond Data and Analytics . . . . . . . . . . . . 33

2.3 Three Pillars of Knowledge . . . . . . . . . . . . 35

2.4 Data Science Project Cycle . . . . . . . . . . . . 36

2.4.1 Types of Data Science Projects . . . . . . 36

2.4.2 Problem Formulation and Project Planning

Stage . . . . . . . . . . . . . . . . . . . . 38

2.4.3 Project Modeling Stage . . . . . . . . . . 40

iii

iv Contents

2.4.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 41

2.4.5 Project Cycle Summary . . . . . . . . . . 42

2.5 Common Mistakes in Data Science . . . . . . . . 43

2.5.1 Problem Formulation Stage . . . . . . . . 43

2.5.2 Project Planning Stage . . . . . . . . . . . 44

2.5.3 Project Modeling Stage . . . . . . . . . . 45

2.5.4 Model Implementation and Post Production Stage . . . . . . . . . . . . . . . . . . 46

2.5.5 Summary of Common Mistakes . . . . . . 47

3 Introduction to the Data 49

3.1 Customer Data for a Clothing Company . . . . . 49

3.2 Swine Disease Breakout Data . . . . . . . . . . . 51

3.3 MNIST Dataset . . . . . . . . . . . . . . . . . . 53

3.4 IMDB Dataset . . . . . . . . . . . . . . . . . . . 53

4 Big Data Cloud Platform 57

4.1 Power of Cluster of Computers . . . . . . . . . . 58

4.2 Evolution of Cluster Computing . . . . . . . . . 59

4.2.1 Hadoop . . . . . . . . . . . . . . . . . . . 59

4.2.2 Spark . . . . . . . . . . . . . . . . . . . . 60

4.3 Introduction of Cloud Environment . . . . . . . 60

4.3.1 Open Account and Create a Cluster . . . 61

4.3.2 R Notebook . . . . . . . . . . . . . . . . . 62

4.3.3 Markdown Cells . . . . . . . . . . . . . . 63

4.4 Leverage Spark Using R Notebook . . . . . . . . 64

4.5 Databases and SQL . . . . . . . . . . . . . . . . 71

4.5.1 History . . . . . . . . . . . . . . . . . . . 71

4.5.2 Database, Table and View . . . . . . . . . 72

4.5.3 Basic SQL Statement . . . . . . . . . . . 74

4.5.4 Advanced Topics in Database . . . . . . . 78

5 Data Pre-processing 79

5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 81

5.2 Missing Values . . . . . . . . . . . . . . . . . . . 84

5.2.1 Impute missing values with median/mode 85

5.2.2 K-nearest neighbors . . . . . . . . . . . . 86

Contents v

5.2.3 Bagging Tree . . . . . . . . . . . . . . . . 88

5.3 Centering and Scaling . . . . . . . . . . . . . . . 88

5.4 Resolve Skewness . . . . . . . . . . . . . . . . . 90

5.5 Resolve Outliers . . . . . . . . . . . . . . . . . . 93

5.6 Collinearity . . . . . . . . . . . . . . . . . . . . . 97

5.7 Sparse Variables . . . . . . . . . . . . . . . . . . 100

5.8 Re-encode Dummy Variables . . . . . . . . . . . 101

6 Data Wrangling 105

6.1 Summarize Data . . . . . . . . . . . . . . . . . . 107

6.1.1 dplyr package . . . . . . . . . . . . . . . . 107

6.1.2 apply(), lapply() and sapply() in base R . . 116

6.2 Tidy and Reshape Data . . . . . . . . . . . . . . 120

7 Model Tuning Strategy 125

7.1 Variance-Bias Trade-Off . . . . . . . . . . . . . . 126

7.2 Data Splitting and Resampling . . . . . . . . . . 134

7.2.1 Data Splitting . . . . . . . . . . . . . . . 135

7.2.2 Resampling . . . . . . . . . . . . . . . . . 145

8 Measuring Performance 151

8.1 Regression Model Performance . . . . . . . . . . 151

8.2 Classification Model Performance . . . . . . . . . 155

8.2.1 Confusion Matrix . . . . . . . . . . . . . . 157

8.2.2 Kappa Statistic . . . . . . . . . . . . . . . 159

8.2.3 ROC . . . . . . . . . . . . . . . . . . . . . 161

8.2.4 Gain and Lift Charts . . . . . . . . . . . . 163

9 Regression Models 167

9.1 Ordinary Least Square . . . . . . . . . . . . . . . 168

9.1.1 The Magic P-value . . . . . . . . . . . . . 173

9.1.2 Diagnostics for Linear Regression . . . . . 176

9.2 Principal Component Regression and Partial Least

Square . . . . . . . . . . . . . . . . . . . . . . . 180

10 Regularization Methods 189

10.1 Ridge Regression . . . . . . . . . . . . . . . . . . 190

10.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . 195

vi Contents

10.3 Elastic Net . . . . . . . . . . . . . . . . . . . . . 199

10.4 Penalized Generalized Linear Model . . . . . . . 201

10.4.1 Introduction to glmnet package . . . . . . 201

10.4.2 Penalized logistic regression . . . . . . . . 206

11 Tree-Based Methods 217

11.1 Tree Basics . . . . . . . . . . . . . . . . . . . . . 217

11.2 Splitting Criteria . . . . . . . . . . . . . . . . . . 221

11.2.1 Gini impurity . . . . . . . . . . . . . . . . 222

11.2.2 Information Gain (IG) . . . . . . . . . . . 223

11.2.3 Information Gain Ratio (IGR) . . . . . . . 224

11.2.4 Sum of Squared Error (SSE) . . . . . . . . 226

11.3 Tree Pruning . . . . . . . . . . . . . . . . . . . . 228

11.4 Regression and Decision Tree Basic . . . . . . . . 232

11.4.1 Regression Tree . . . . . . . . . . . . . . . 232

11.4.2 Decision Tree . . . . . . . . . . . . . . . . 236

11.5 Bagging Tree . . . . . . . . . . . . . . . . . . . . 241

11.6 Random Forest . . . . . . . . . . . . . . . . . . . 245

11.7 Gradient Boosted Machine . . . . . . . . . . . . 249

11.7.1 Adaptive Boosting . . . . . . . . . . . . . 250

11.7.2 Stochastic Gradient Boosting . . . . . . . 252

12 Deep Learning 259

12.1 Feedforward Neural Network . . . . . . . . . . . 263

12.1.1 Logistic Regression as Neural Network . . 263

12.1.2 Stochastic Gradient Descent . . . . . . . . 265

12.1.3 Deep Neural Network . . . . . . . . . . . 266

12.1.4 Activation Function . . . . . . . . . . . . 270

12.1.5 Optimization . . . . . . . . . . . . . . . . 274

12.1.6 Deal with Overfitting . . . . . . . . . . . . 282

12.1.7 Image Recognition Using FFNN . . . . . . 284

12.2 Convolutional Neural Network . . . . . . . . . . 298

12.2.1 Convolution Layer . . . . . . . . . . . . . 299

12.2.2 Padding Layer . . . . . . . . . . . . . . . 303

12.2.3 Pooling Layer . . . . . . . . . . . . . . . . 304

12.2.4 Convolution Over Volume . . . . . . . . . 308

12.2.5 Image Recognition Using CNN . . . . . . 311

Contents vii

12.3 Recurrent Neural Network . . . . . . . . . . . . 317

12.3.1 RNN Model . . . . . . . . . . . . . . . . . 320

12.3.2 Long Short Term Memory . . . . . . . . . 323

12.3.3 Word Embedding . . . . . . . . . . . . . . 326

12.3.4 Sentiment Analysis Using RNN . . . . . . 328

Appendix 337

13 Handling Large Local Data 339

13.1 readr . . . . . . . . . . . . . . . . . . . . . . . . 339

13.2 data.table— enhanced data.frame . . . . . . . . . 347

14 R code for data simulation 359

14.1 Customer Data for Clothing Company . . . . . . 359

14.2 Swine Disease Breakout Data . . . . . . . . . . . 364

What is Data Science - 2023 Multiple Explanations

BLOG@CACM

What is Data Science?

By Koby Mike, Orit Hazzan

Communications of the ACM, February 2023, Vol. 66 No. 2, Pages 12-13

https://cacm.acm.org/magazines/2023/2/268943-what-is-data-science/fulltext

References

1. Alvargonzález, D. Multidisciplinarity, interdisciplinarity, transdisciplinarity, and the sciences. International Studies in the Philosophy of Science, 25(4), 2011, 387–403. https://doi.org/10.1080/02698595.2011.623366

3. Chang, W. L., Grady, N., et al. Nist big data interoperability framework: Volume 1, big data definitions, 2015.

4. Conway, D. The data science venn diagram. Datist, 2010. http://www.dataists.com/2010/09/the-data-science-venn-diagram/

5. Davenport, T. H. and Patil, D. Data scientist: The sexiest job of the 21st century. Harvard Business Review, 90(5), 2010, 70–76.

7. Gray, J. EScience – A transformed scientific method. http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt, 2007f

9. Irizarry, R. A. The role of academia in data science education, 2020.

11. Skiena, S. S. The data science design manual. Springer, 2017.

12. Taylor, D. Battle of the Data Science Venn Diagrams. KDnuggets. https://www.kdnuggets.com/battle-of-the-data-science-venn-diagrams.html/, 2016.

https://www.linkedin.com/pulse/data-science-process-methodology-pratibha-kumari-jha/

https://www.linkedin.com/pulse/methodology-data-science-andre-luiz-coelho-da-silva/?trk=organization_guest_main-feed-card_reshare_feed-article-content

Data Science vs Data Analytics: What Are the Similarities & Differences?

The main differences between data science and data analytics involve the methods and tools for working with data, as well as career paths, titles & salaries.

RICE UNIVERSITY

Department of Computer Science

https://csweb.rice.edu/academics/graduate-programs/online-mds/blog/data-science-vs-data-analytics

https://www.nature.com/articles/s41562-023-01562-4

Overview of Data Science - Oracle Cloud

Oracle Cloud Infrastructure (OCI) Data Science is a fully managed and serverless platform for data science teams to build, train, and manage machine learning models.

https://docs.oracle.com/en-us/iaas/data-science/using/overview.htm

March 29, 2023

Data Science in Finance

CMU Explanation - Detailed explanation

https://www.cmu.edu/mscf/news/2023/data-science-in-finance.html

Wednesday, November 15, 2023

Data Science - Books - Bibliography - Introduction

https://towardsdatascience.com/learn-on-towards-data-science-52245bc91451

Practitioner’s Guide to Data Science

By Hui Lin, Ming Li

1st Edition

First Published 2023

eBook Published 24 May 2023

Based on industry experience, this book outlines real-world scenarios and discusses pitfalls that data science practitioners should avoid. It also covers the big data cloud platform and the art of data science, such as soft skills. The authors use R as the primary tool and provide code for both R and Python.　

This book is for readers who want to explore possible career paths and eventually become data scientists. This book comprehensively introduces various data science fields, soft and programming skills in data science projects, and potential career paths. Traditional data-related practitioners such as statisticians, business analysts, and data analysts will find this book helpful in expanding their skills for future data science careers. Undergraduate and graduate students from analytics-related areas will find this book beneficial to learn real-world data science applications. Non-mathematical readers will appreciate the reproducibility of the companion R and python codes.

Key Features:

• It is hands-on. We provide the data and　repeatable　R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform analyses with their data and problems, if possible. The best way to learn data science is to do it!

TABLE OF CONTENTS

Chapter 1|28 pages

Introduction

Chapter 2|18 pages

Soft Skills for Data Scientists

Chapter 3|8 pages

Introduction to the Data

Chapter 4|22 pages

Big Data Cloud Platform

Chapter 5|26 pages

Data Pre-processing

Chapter 6|22 pages

Data Wrangling

Chapter 7|26 pages

Model Tuning Strategy

Chapter 8|16 pages

Measuring Performance

Chapter 9|20 pages

Regression Models

Chapter 10|30 pages

Regularization Methods

Chapter 11|42 pages

Tree-Based Methods

Chapter 12|78 pages

Deep Learning

https://linhui.org/hui's_files/datascientist1#(20)

https://scholar.google.com/citations?user=PAArLQIAAAAJ&hl=en&oi=sra

https://scholar.google.com/citations?user=PAArLQIAAAAJ&hl=en

https://linhui.org/

https://github.com/happyrabbit

https://scientistcafe.com/

A Tour of Data Science: Learn R and Python in Parallel

Nailong Zhang

A Tour of Data Science: Learn R and Python in Parallel covers the fundamentals of data science, including programming, statistics, optimization, and machine learning in a single short book. It does not cover everything, but rather, teaches the key concepts and topics in Data Science. It also covers two of the most popular programming languages used in Data Science, R and Python, in one source.

Key features:

Allows you to learn R and Python in parallel

Cover statistics, programming, optimization and predictive modelling, and the popular data manipulation tools – data table and pandas

Provides a concise and accessible presentation

Includes machine learning algorithms implemented from scratch, linear regression, lasso, ridge, logistic regression, gradient boosting trees, etc.

Appealing to data scientists, statisticians, quantitative analysts, and others who want to learn programming with R and Python from a data science perspective.

https://books.google.co.in/books?id=zVYAEAAAQBAJ

A Hands-On Introduction to Data Science

Chirag Shah

Cambridge University Press, 02-Apr-2020 - Business & Economics - 424 pages

This book introduces the field of data science in a practical and accessible manner.

The foundational ideas and techniques of data science are provided allowing students to easily develop a firm understanding of the subject. The material that will have continual relevance even after tools and technologies change.

Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data. A suite of online material for both instructors and students provides a strong supplement to the book, including datasets, chapter slides, solutions, sample exams and curriculum suggestions. This entry-level textbook is ideally suited to readers from a range of disciplines wishing to build a practical, working knowledge of data science.

https://books.google.co.in/books?id=rljPDwAAQBAJ

Data Science Job: How to become a Data Scientist

Przemek Chojecki, 31-Jan-2020 - Computers - 100 pages

Data Scientist is one of the hottest job on the market right now. Demand for data science is huge and will only grow, and it seems like it will grow much faster than the actual number of data scientists. So if you want to make a career change and become a data scientist, now is the time.

This book will guide you through the process. From my experience of working with multiple companies as a project manager, a data science consultant or a CTO, I was able to see the process of hiring data scientists and building data science teams. I know what’s important to land your first job as a data scientist, what skills you should acquire, what you should show during a job interview.

https://books.google.co.in/books?id=h0PZDwAAQBAJ

Foundations of Data Science

Avrim Blum, John Hopcroft, Ravindran Kannan

Cambridge University Press, 23-Jan-2020 - Computers - 432 pages

This book provides an introduction to the mathematical and algorithmic foundations of data science, including machine learning, high-dimensional geometry, and analysis of large networks.

Topics include the counterintuitive nature of data in high dimensions, important linear algebraic techniques such as singular value decomposition, the theory of random walks and Markov chains, the fundamentals of and important algorithms for machine learning, algorithms and analysis for clustering, probabilistic models for large networks, representation learning including topic modelling and non-negative matrix factorization, wavelets and compressed sensing.

Important probabilistic techniques are developed including the law of large numbers, tail inequalities, analysis of random projections, generalization guarantees in machine learning, and moment methods for analysis of phase transitions in large random graphs. Additionally, important structural and complexity measures are discussed such as matrix norms and VC-dimension. This book is suitable for both undergraduate and graduate courses in the design and analysis of algorithms for data.

https://books.google.co.in/books?id=koHCDwAAQBAJ

Data Science and Intelligent Applications: Proceedings of ICDSIA 2020

Ketan Kotecha, Vincenzo Piuri, Hetalkumar N. Shah, Rajan Patel

Springer Nature, 17-Jun-2020 - Technology & Engineering - 576 pages

This book includes selected papers from the International Conference on Data Science and Intelligent Applications (ICDSIA 2020), hosted by Gandhinagar Institute of Technology (GIT), Gujarat, India, on January 24–25, 2020. The proceedings present original and high-quality contributions on theory and practice concerning emerging technologies in the areas of data science and intelligent applications. The conference provides a forum for researchers from academia and industry to present and share their ideas, views and results, while also helping them approach the challenges of technological advancements from different viewpoints.

The contributions cover a broad range of topics, including: collective intelligence, intelligent systems, IoT, fuzzy systems, Bayesian networks, ant colony optimization, data privacy and security, data mining, data warehousing, big data analytics, cloud computing, natural language processing, swarm intelligence, speech processing, machine learning and deep learning, and intelligent applications and systems. Helping strengthen the links between academia and industry, the book offers a valuable resource for instructors, students, industry practitioners, engineers, managers, researchers, and scientists alike.

p.217 Human activity recognition

https://books.google.co.in/books?id=eSbsDwAAQBAJ

Data Science and Productivity Analytics

Editors: Charles, Vincent, Aparicio, Juan, Zhu, Joe (Eds.)

Table of contents (15 chapters)

Data Envelopment Analysis and Big Data: Revisit with a Faster Method Pages 1-34

Khezrimotlagh, Dariush (et al.)

Data Envelopment Analysis (DEA): Algorithms, Computations, and Geometry Pages 35-56

Dulá, José H.

An Introduction to Data Science and Its Applications Pages 57-81

Rabasa, Alex (et al.)

Identification of Congestion in DEA Pages 83-119

Mehdiloo, Mahmood (et al.)

Data Envelopment Analysis and Non-parametric Analysis Pages 121-160

Villa, Gabriel (et al.)

The Measurement of Firms’ Efficiency Using Parametric Techniques Pages 161-199

Orea, Luis

Fair Target Setting for Intermediate Products in Two-Stage Systems with Data Envelopment Analysis

Pages 201-226

An, Qingxian (et al.)

Fixed Cost and Resource Allocation Considering Technology Heterogeneity in Two-Stage Network Production Systems Pages 227-249

Ding, Tao (et al.)

Efficiency Assessment of Schools Operating in Heterogeneous Contexts: A Robust Nonparametric Analysis Using PISA 2015 Pages 251-277

Cordero, Jose Manuel (et al.)

A DEA Analysis in Latin American Ports: Measuring the Performance of Guayaquil Contecon Port

Pages 279-309

Morales-Núñez, Emilio J. (et al.)

Effects of Locus of Control on Bank’s Policy—A Case Study of a Chinese State-Owned Bank

Pages 311-335

Xu, Cong (et al.)

A Data Scientific Approach to Measure Hospital Productivity Pages 337-358

Daneshvar Rouyendegh (B. Erdebilli), Babak (et al.)

Environmental Application of Carbon Abatement Allocation by Data Envelopment Analysis Pages 359-389

Yu, Anyu (et al.)

Pension Funds and Mutual Funds Performance Measurement with a New DEA (MV-DEA) Model Allowing for Missing Variables Pages 391-413

Badrizadeh, Maryam (et al.)

Sharpe Portfolio Using a Cross-Efficiency Evaluation Pages 415-439

Landete, Mercedes (et al.)

https://www.springer.com/gp/book/9783030433833

Special Issue on Data Science for Better Productivity

Data science for better productivity

Vincent Charles,Juan Aparicio &Joe Zhu

Journal of the Operational Research Society

Volume 72, 2021 - Issue 5: Special Issue Data Science for Better Productivity

Afsharian, M. (2019). A frontier-based facility location problem with a centralised view of measuring the performance of the network. Journal of the Operational Research Society, 72(5), 1058–1074. https://doi.org/10.1080/01605682.2019.1639476

Bougnol, M.-L., & Dulà, J. (2020). Improving productivity using government data: The case of US Centers for Medicare & Medicaid's ‘Nursing Home Compare. Journal of the Operational Research Society, 72(5), 1075–1086. https://doi.org/10.1080/01605682.2020.1724056

Del Vecchio, M., Kharlamov, A., Parry, G., & Pogrebna, G. (2020). Improving productivity in Hollywood with data science: Using emotional arcs of movies to drive product and service innovation in entertainment industries. Journal of the Operational Research Society, 72(5), 1110–1137. https://doi.org/10.1080/01605682.2019.1705194

Grimaldi, D., Fernandez, V., & Carrasco, C. (2019). Exploring data conditions to improve business performance. Journal of the Operational Research Society, 72(5), 1087–1098. https://doi.org/10.1080/01605682.2019.1590136

Ihrig, S., Ishizaka, A., Brech, C., & Fliedner, T. (2019). A new hybrid method for the fair assignment of productivity targets to indirect corporate processes. Journal of the Operational Research Society, 72(5), 989–1001. https://doi.org/10.1080/01605682.2019.1639477

Jiang, R., Yang, Y., Chen, Y., & Liang, L. (2019). Corporate diversification, firm productivity and resource allocation decisions: The data envelopment analysis approach. Journal of the Operational Research Society, 72(5), 1002–1014. https://doi.org/10.1080/01605682.2019.1568841

Li, Y., & Chen, W. (2019). Entropy method of constructing a combined model for improving loan default prediction: A case study in China. Journal of the Operational Research Society, 72(5), 1099–1109. https://doi.org/10.1080/01605682.2019.1702905

Lin, S.-W., Lu, W.-M., & Lin, F. (2020). Entrusting decisions to the public service pension fund: An integrated predictive model with additive network DEA approach. Journal of the Operational Research Society, 72(5), 1015–1032. https://doi.org/10.1080/01605682.2020.1718011

Routh, P., Roy, A., & Meyer, J. (2020). Estimating customer churn under competing risks. Journal of the Operational Research Society, 72(5), 1138–1155. https://doi.org/10.1080/01605682.2020.1776166

Shi, Y., Zhu, J., & Charles, V. (2020). Data science and productivity: A bibliometric review of data science applications and approaches in productivity evaluations. Journal of the Operational Research Society, 72(5), 975–988. https://doi.org/10.1080/01605682.2020.1860661

Summerfield, N. S., Deokar, A. V., Xu, M., & Zhu, W. (2020). Should drivers cooperate? Performance evaluation of cooperative navigation on simulated road networks using network DEA. Journal of the Operational Research Society, 72(5), 1042–1057. https://doi.org/10.1080/01605682.2019.1700766

Zhu, J. (2020). DEA under big data: Data enabled analytics and network data envelopment analysis. Annals of Operations Research, 1–23. In press. https://doi.org/10.1007/s10479-020-03668-8

Zhu, W., Liu, B., Lu, Z., & Yu, Y. (2020). A DEALG methodology for prediction of effective customers of internet financial loan products. Journal of the Operational Research Society, 72(5), 1033–1041. https://doi.org/10.1080/01605682.2019.1700188 [Taylor & Francis On

https://www.tandfonline.com/doi/full/10.1080/01605682.2021.1892466

Ud. 16.11,2023, 3.45 am Austin, Texas

Pub. 16.7.2021

What is Data Science? - An Introduction to Data Science - New Developments

What is Data Science? - An Introduction to Data Science

Data driven or data analysis driven decision making is age old. But new data processing technology allows people to process data in ways that was not done before. Hence data will drive business decisions much more intensively in the next decade.

IT departments are not content anymore with just providing technology for processing data. The discipline and the profession of IT is getting involved in finding and understanding the relevance of new data sources, big and small.

The practice of business intelligence is expanding to create to develop capabilities for analyzing and visualizing structured and unstructured data for their relevance for business decision making, and then building applications that can be run on a periodic basis which can be as small as even seconds to take crime or fraud prevention activities.

Data science is the name of this emerging discipline.

Data Science Tutorial 1 - Video

__________________________

__________________________
edureka!

More videos are available on YouTube on Data Science

Concise Visual Summary of Deep Learning Architectures
Basically neural network architectures
http://www.datasciencecentral.com/profiles/blogs/concise-visual-summary-of-deep-learning-architectures

http://www.datasciencecentral.com has number of articles on data science.

Data Science - New Developments

2023

https://www.unofficialgoogledatascience.com

50 Years of Data Science
David Donoho
Journal of Computational and Graphical Statistics
Volume 26, 2017 - Issue 4
Pages 745-766 Published online: 19 Dec 2017
https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734

2020
The 2020 Data Science Dictionary—Key Terms You Need to Know
https://www.datasciencecentral.com/profiles/blogs/top-data-science-skills-for-2020-1

Trends in Artificial Intelligence and Data Science for 2020
https://www.datasciencecentral.com/profiles/blogs/trends-in-artificial-intelligence-and-data-science-for-2020-by

Top 5 Data Science Trends for 2020
https://www.datasciencecentral.com/profiles/blogs/top-5-data-science-trends-for-2020

Updated in 2020: on 14 March 2020

7 June 2017, 2 September 2014

Wednesday, October 18, 2023

Search Results Spam

https://news.ycombinator.com/item?id=21622322

https://www.kevin-indig.com/the-problem-with-spam-and-search/

https://support.google.com/websearch/thread/144091989/continuous-spam-results-in-google-search-first-page-result?hl=en

https://searchengineland.com/google-web-spam-report-less-than-1-of-search-results-visited-spammy-301158

https://damonmccoy.com/papers/www2016-cloud.pdf