Statistical Theory - AllSumJobs

Data Science Technical Questions Statistical Theory

Lesson 1 of 7

In Progress

The ASJ Team

Data Science Technical Questions Statistical Theory

Learn to answer questions about statistical theory

Data science, a blend of computer science and statistical knowledge, is a growing field. Yet many who come up through computer science lack the basic statistical knowledge to put it to good use.

Many data scientists focus on machine learning and deep learning. Unfortunately very few actually understand statistical basics such as A/B testing and the shape of a normal distribution. Your code could be useless if you do not understand concepts such as over-fitting and under-fitting.

What is the difference between supervised and unsupervised learning?

Supervised learning labels data, usually for prediction, while unsupervised learning uses unlabeled data, usually for analysis. This helps with classification, density estimation, and dimension reduction.

What are good packages in Python?

Most use the Anaconda distribution, found at https://www.anaconda.com/distribution/ to download. This includes NumPy arrays, Pandas data frames, Scikit-learn for machine learning algorithms, and matplotlib or plotly for visualizations.

What is selection bias?

Selection bias is non-random data causing results to be skewed on non-random demographics. Instrumental variables and propensity score matching helps offset this statistically.

What is the difference between Long and Wide data?

Long format lists nested observations in multiple rows, while wide format lists them in a single row. For instance, a person-time can be expressed as individual columns on a single row (wide format) or as individual rows (long format)

What is a normal distribution?

A normal distribution is just another term for the bell-shaped curve. It’s centered on its mean, symmetrical, and asymptotic.

What is A/B testing?

This is an experimental method that compares one method to another method by statistically analyzing the impact of a single attribute. It’s often used in marketing to compare different approaches.

What is statistical power of sensitivity?

Sensitivity validates the accuracy of a classifier, such as random forest or logistic expressions. Its calculated by dividing the number of predicted true events over actual total events.

What is over-fitting and under-fitting?

Over-fitting causes predictions to be too sensitive to noise in data, while under-fitting simplifies the data too much. You counter those by using training and test sections of data, possibly with cross-validation.