Elements of Structured Data
Rectangular Data
Nonrectangular Data Structures
Estimates of Location
Median and Robust Estimates
Further Reading
Standard Deviation and Related Estimates
Example: Variability Estimates of State Population
Exploring the Data Distribution
Frequency Table and Histograms
Further Reading
Mode
Further Reading
Scatterplots
Exploring Two or More Variables
Two Categorical Variables
Visualizing Multiple Variables
Random Sampling and Sample Bias
Random Selection
Sample Mean versus Population Mean
Selection Bias
Further Reading
Central Limit Theorem
Further Reading
Resampling versus Bootstrapping
Confidence Intervals
Normal Distribution
Long-Tailed Distributions
Student’s t-Distribution
Binomial Distribution
Poisson and Related Distributions
Exponential Distribution
Weibull Distribution
Summary
A/B Testing
Why Just A/B? Why Not C, D…?
Hypothesis Tests
Alternative Hypothesis
Further Reading
Permutation Test
Exhaustive and Bootstrap Permutation Test
For Further Reading
P-Value
Type 1 and Type 2 Errors
Further Reading
Further Reading
Further Reading
Further Reading
F-Statistic
Further Reading
Chi-Square Test: A Resampling Approach
Fisher’s Exact Test
Further Reading
Further Reading
Sample Size
Summary
Simple Linear Regression
Fitted Values and Residuals
Prediction versus Explanation (Profiling)
Multiple Linear Regression
Assessing the Model
Model Selection and Stepwise Regression
Prediction Using Regression
Confidence and Prediction Intervals
Dummy Variables Representation
Ordered Factor Variables
Correlated Predictors
Confounding Variables
Testing the Assumptions: Regression Diagnostics
Influential Values
Partial Residual Plots and Nonlinearity
Polynomial
Generalized Additive Models
Naive Bayes
The Naive Solution
Further Reading
Covariance Matrix
A Simple Example
Logistic Regression
Logistic Regression and the GLM
Predicted Values from Logistic Regression
Linear and Logistic Regression: Similarities and Differences
Further Reading
Confusion Matrix
Precision, Recall, and Specificity
AUC
Further Reading
Undersampling
Data Generation
Exploring the Predictions
K-Nearest Neighbors
Distance Metrics
Standardization (Normalization, Z-Scores)
KNN as a Feature Engine
A Simple Example
Measuring Homogeneity or Impurity
Predicting a Continuous Value
Further Reading
Bagging
Variable Importance
Boosting
XGBoost
Hyperparameters and Cross-Validation
Principal Components Analysis
Computing the Principal Components
Further Reading
A Simple Example
Interpreting the Clusters
Hierarchical Clustering
The Dendrogram
Measures of Dissimilarity
Multivariate Normal Distribution
Selecting the Number of Clusters
Scaling and Categorical Variables
Dominant Variables
Problems with Clustering Mixed Dat