Member-only story

Imputing Data

Brandles
3 min readFeb 17, 2020

--

One ideal goal of data science is to distill analysis to one number when possible. Then use that number to draw conclusions about your data. But this can be a difficult task when your data set is missing values. Simply ignoring missing data can lead to significant statistical power loss: 35% loss for 10% missing data, 98% loss for 30% missing data. Ignoring missing data can also lead to bias solutions and unreliable parameter estimation. To overcome these challenges statisticians have developed sophisticated techniques to handle the missing data without ignoring them called imputation.

Imagine receiving a dataset and being able to measure the reliability of the data. We will call the reliability number Cronbach’s alpha (CA) . We can explore the effectiveness of different imputation methods using (CA).

“For each of the four modern missing data treatments (Mean imputation, Regression imputation, Maximum Likelihood, and Multiple imputation): the Estimated Cronbach Alpha’s (with 95% CI) for four different percentages of missing data (10%, 20%, 30%, 40%) under three different missing data mechanisms (MCAR, MAR, MNAR)” SOURCE

The imagine above shows four different imputation methods. The vertical axes is CA, our reliability number. The horizontal axes is the percentage of missing values from the data set. We plot the reliability of our original data set as a horizontal line, labeled “True Cronbach’s alpha”. We show the reliability of our data set after imputing as circles. The effectiveness of each imputation method is the distance between the circle…

--

--

No responses yet