Learn to answer questions about statistical theory

Data science, a blend of computer science and statistical knowledge, is a growing field. Yet many who come up through computer science lack the basic statistical knowledge to put it to good use.

Many data scientists focus on machine learning and deep learning. Unfortunately very few actually understand statistical basics such as A/B testing and the shape of a normal distribution. Your code could be useless if you do not understand concepts such as over-fitting and under-fitting.

Supervised learning labels data, usually for prediction, while unsupervised learning uses unlabeled data, usually for analysis. This helps with classification, density estimation, and dimension reduction.
Most use the Anaconda distribution, found at https://www.anaconda.com/distribution/ to download. This includes NumPy arrays, Pandas data frames, Scikit-learn for machine learning algorithms, and matplotlib or plotly for visualizations.
Selection bias is non-random data causing results to be skewed on non-random demographics. Instrumental variables and propensity score matching helps offset this statistically.
Long format lists nested observations in multiple rows, while wide format lists them in a single row. For instance, a person-time can be expressed as individual columns on a single row (wide format) or as individual rows (long format)
A normal distribution is just another term for the bell-shaped curve. It’s centered on its mean, symmetrical, and asymptotic.
This is an experimental method that compares one method to another method by statistically analyzing the impact of a single attribute. It’s often used in marketing to compare different approaches.
Sensitivity validates the accuracy of a classifier, such as random forest or logistic expressions. Its calculated by dividing the number of predicted true events over actual total events.
Over-fitting causes predictions to be too sensitive to noise in data, while under-fitting simplifies the data too much. You counter those by using training and test sections of data, possibly with cross-validation.