Machine Learning for Business Analytics
Machine learning—also known as data mining or data analytics—is a fundamental part of data science. It is used by organizations in a wide variety of arenas to turn raw data into actionable information.
Machine Learning for Business Analytics: Concepts, Techniques and Applications in Rapid Miner provides a comprehensive introduction and an overview of this methodology. This best-selling textbook covers both statistical and machine learning algorithms for prediction, classification, visualization, dimension reduction, rule mining, recommendations, clustering, text mining, experimentation and network analytics. Along with hands-on exercises and real-life case studies, it also discusses managerial and ethical issues for responsible use of machine learning techniques.
This is the seventh edition of Machine Learning for Business Analytics, and the first using Rapid Miner software. This edition also includes:
- A new co-author, Amit Deokar, who brings experience teaching business analytics courses using Rapid Miner
- Integrated use of Rapid Miner, an open-source machine learning platform that has become commercially popular in recent years
- An expanded chapter focused on discussion of deep learning techniques
- A new chapter on experimental feedback techniques including A/B testing, uplift modeling, and reinforcement learning
- A new chapter on responsible data science
- Updates and new material based on feedback from instructors teaching MBA, Masters in Business Analytics and related programs, undergraduate, diploma and executive courses, and from their students
- A full chapter devoted to relevant case studies with more than a dozen cases demonstrating applications for the machine learning techniques
- End-of-chapter exercises that help readers gauge and expand their comprehension and competency of the material presented
- A companion website with more than two dozen data sets, and instructor materials including exercise solutions, slides, and case solutions
This textbook is an ideal resource for upper-level undergraduate and graduate level courses in data science, predictive analytics, and business analytics. It is also an excellent reference for analysts, researchers, and data science practitioners working with quantitative data in management, finance, marketing, operations management, information systems, computer science, and information technology.
Содержание
Foreword by Ravi Bapna xxi
Preface to the Rapid Miner Edition xxiii
Acknowledgments xxvii
Part I Preliminaries
Chapter 1 Introduction 3
1.1 What Is Business Analytics? 3
1.2 What Is Machine Learning? 5
1.3 Machine Learning, AI, and Related Terms 5
1.4 Big Data 7
1.5 Data Science 8
1.6 Why Are There So Many Different Methods? 9
1.7 Terminology and Notation 9
1.8 Road Maps to This Book 12
1.9 Using Rapid Miner Studio 14
Chapter 2 Overview of the Machine Learning Process 19
2.1 Introduction 19
2.2 Core Ideas in Machine Learning 20
2.3 The Steps in a Machine Learning Project 23
2.4 Preliminary Steps 25
2.5 Predictive Power and Overfitting 32
2.6 Building a Predictive Model with Rapid Miner 37
2.7 Using Rapid Miner for Machine Learning 45
2.8 Automating Machine Learning Solutions 47
2.9 Ethical Practice in Machine Learning 52
Problems 57
Part II Data Exploration and Dimension Reduction
Chapter 3 Data Visualization 63
3.1 Introduction 63
3.2 Data Examples 65
3.3 Basic Charts: Bar Charts, Line Charts, and Scatter Plots 66
3.4 Multidimensional Visualization 75
3.5 Specialized Visualizations 87
3.6 Summary: Major Visualizations and Operations, by Machine Learning Goal 92
Chapter 4 Dimension Reduction 97
4.1 Introduction 97
4.2 Curse of Dimensionality 98
4.3 Practical Considerations 98
4.4 Data Summaries 100
4.5 Correlation Analysis 103
4.6 Reducing the Number of Categories in Categorical Attributes 105
4.7 Converting a Categorical Attribute to a Numerical Attribute 107
4.8 Principal Component Analysis 107
4.9 Dimension Reduction Using Regression Models 117
4.10 Dimension Reduction Using Classification and Regression Trees 119
Problems 120
Part III Performance Evaluation
Chapter 5 Evaluating Predictive Performance 125
5.1 Introduction 125
5.2 Evaluating Predictive Performance 126
5.3 Judging Classifier Performance 131
5.4 Judging Ranking Performance 146
5.5 Oversampling 151
Problems 158
Part IV Prediction and Classification Methods
Chapter 6 Multiple Linear Regression 163
6.1 Introduction 163
6.2 Explanatory vs. Predictive Modeling 164
6.3 Estimating the Regression Equation and Prediction 166
6.4 Variable Selection in Linear Regression 171
Problems 184
Chapter 7 k-Nearest Neighbors (k-NN) 189
7.1 The k-NN Classifier (Categorical Label) 189
7.2 k-NN for a Numerical Label 200
7.3 Advantages and Shortcomings of k-NN Algorithms 202
Appendix: Computing Distances Between Records in Rapid Miner 203
Problems 205
Chapter 8 The Naive Bayes Classifier 209
8.1 Introduction 209
8.2 Applying the Full (Exact) Bayesian Classifier 211
8.3 Solution: Naive Bayes 213
8.4 Advantages and Shortcomings of the Naive Bayes Classifier 224
Problems 226
Chapter 9 Classification and Regression Trees 229
9.1 Introduction 229
9.2 Classification Trees 232
9.3 Evaluating the Performance of a Classification Tree 240
9.4 Avoiding Overfitting 245
9.5 Classification Rules from Trees 255
9.6 Classification Trees for More Than Two Classes 256
9.7 Regression Trees 256
9.8 Improving Prediction: Random Forests and Boosted Trees 259
9.9 Advantages and Weaknesses of a Tree 261
Problems 265
Chapter 10 Logistic Regression 269
10.1 Introduction 269
10.2 The Logistic Regression Model 271
10.3 Example: Acceptance of Personal Loan 272
10.4 Logistic Regression for Multi-class Classification 283
10.5 Example of Complete Analysis: Predicting Delayed Flights 286
Appendix: Logistic Regression for Ordinal Classes 299
Problems 301
Chapter 11 Neural Networks 305
11.1 Introduction 306
11.2 Concept and Structure of a Neural Network 306
11.3 Fitting a Network to Data 307
11.4 Required User Input 321
11.5 Exploring the Relationship Between Predictors and Target Attribute 322
11.6 Deep Learning 323
11.7 Advantages and Weaknesses of Neural Networks 334
Problems 335
Chapter 12 Discriminant Analysis 337
12.1 Introduction 337
12.2 Distance of a Record from a Class 340
12.3 Fisher’s Linear Classification Functions 341
12.4 Classification Performance of Discriminant Analysis 346
12.5 Prior Probabilities 348
12.6 Unequal Misclassification Costs 348
12.7 Classifying More Than Two Classes 349
12.8 Advantages and Weaknesses 351
Problems 355
Chapter 13 Generating, Comparing, and Combining Multiple Models 359
13.1 Automated Machine Learning (Auto ML) 359
13.2 Explaining Model Predictions 367
13.3 Ensembles 373
13.4 Summary 381
Problems 383
Part V Intervention and User Feedback
Chapter 14 Interventions: Experiments, Uplift Models, and Reinforcement Learning 387
14.1 A/B Testing 387
14.2 Uplift (Persuasion) Modeling 393
14.3 Reinforcement Learning 400
14.4 Summary 405
Problems 406
Part VI Mining Relationships Among Records
Chapter 15 Association Rules and Collaborative Filtering 409
15.1 Association Rules 409
15.2 Collaborative Filtering 424
15.3 Summary 438
Problems 440
Chapter 16 Cluster Analysis 445
16.1 Introduction 445
16.2 Measuring Distance Between Two Records 449
16.3 Measuring Distance Between Two Clusters 455
16.4 Hierarchical (Agglomerative) Clustering 457
16.5 Non-Hierarchical Clustering: The k-Means Algorithm 466
Problems 473
Part VII Forecasting Time Series
Chapter 17 Handling Time Series 479
17.1 Introduction 480
17.2 Descriptive vs. Predictive Modeling 481
17.3 Popular Forecasting Methods in Business 481
17.4 Time Series Components 482
17.5 Data Partitioning and Performance Evaluation 486
Problems 493
Chapter 18 Regression-Based Forecasting 497
18.1 A Model with Trend 498
18.2 A Model with Seasonality 505
18.3 A Model with Trend and Seasonality 508
18.4 Autocorrelation and ARIMA Models 509
Problems 521
Chapter 19 Smoothing and Deep Learning Methods for Forecasting 533
19.1 Smoothing Methods: Introduction 534
19.2 Moving Average 534
19.3 Simple Exponential Smoothing 540
19.4 Advanced Exponential Smoothing 545
19.5 Deep Learning for Forecasting 549
Problems 553
Part VIII Data Analytics
Chapter 20 Social Network Analytics 563
20.1 Introduction 563
20.2 Directed vs. Undirected Networks 564
20.3 Visualizing and Analyzing Networks 567
20.4 Social Data Metrics and Taxonomy 571
20.5 Using Network Metrics in Prediction and Classification 576
20.6 Collecting Social Network Data with Rapid Miner 584
20.7 Advantages and Disadvantages 584
Problems 587
Chapter 21 Text Mining 589
21.1 Introduction 589
21.2 The Tabular Representation of Text: Term–Document Matrix and “Bag-of-Words’’ 590
21.3 Bag-of-Words vs. Meaning Extraction at Document Level 592
21.4 Preprocessing the Text 593
21.5 Implementing Machine Learning Methods 602
21.6 Example: Online Discussions on Autos and Electronics 602
21.7 Example: Sentiment Analysis of Movie Reviews 607
21.8 Summary 614
Problems 615
Chapter 22 Responsible Data Science 617
22.1 Introduction 617
22.2 Unintentional Harm 618
22.3 Legal Considerations 620
22.4 Principles of Responsible Data Science 621
22.5 A Responsible Data Science Framework 624
22.6 Documentation Tools 628
22.7 Example: Applying the RDS Framework to the COMPAS Example 631
22.8 Summary 641
Problems 643
Part IX Cases
Chapter 23 Cases 647
23.1 Charles Book Club 647
23.2 German Credit 654
23.3 Tayko Software Cataloger 659
23.4 Political Persuasion 663
23.5 Taxi Cancellations 667
23.6 Segmenting Consumers of Bath Soap 669
23.7 Direct-Mail Fundraising 673
23.8 Catalog Cross-Selling 676
23.9 Time Series Case: Forecasting Public Transportation Demand 678
23.10 Loan Approval 680
References 683
Data Files Used in the Book 687
Index 689
Об авторе
Galit Shmueli, Ph D, is Distinguished Professor at National Tsing Hua University’s Institute of Service Science, College of Technology Management. She has designed and instructed business analytics courses since 2004 at University of Maryland, Statistics.com, The Indian School of Business, and National Tsing Hua University, Taiwan.
Peter C. Bruce, is Founder of the Institute for Statistics Education at Statistics.com, and Chief Learning Officer at Elder Research, Inc.
Amit V. Deokar, Ph D, is Associate Dean of Undergraduate Programs and an Associate Professor of Management Information Systems at the Manning School of Business at University of Massachusetts Lowell. Since 2006, he has developed and taught courses in business analytics, with expertise in using the Rapid Miner platform. He is an Association for Information Systems Distinguished Member Cum Laude.
Nitin R. Patel, Ph D, is cofounder and lead researcher at Cytel Inc. He was also a co-founder of Tata Consultancy Services. A Fellow of the American Statistical Association, Dr. Patel has served as a visiting professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad, for 15 years.