Missing value patterns

Today’s blog post is about missing value patterns.

Knowing the number of missing values per column helps to get a feel for the completeness of the data and to quickly uncover potential problems. In order to get a better understanding, it also helps to look for missing value patterns, that is, groups of variables where, if one of the variables is missing, the others are usually missing as well.
This property is very common when the data has been generated by joining multiple tables together. For example, if you join a table containing the names of customers with a table containing their addresses, there might be some customers in the name table that do not have an entry in the address table. For these customers, all the columns with address data will be empty.

A good way to identify missing value patterns is by replacing each column with an indicator column that is equal to 1 if the value is not missing and 0 if the value is missing. Each observation basically becomes a vector of ones and zeroes and if two of these vectors are identical, it means that the corresponding observations exhibit the same missing value pattern. If you then do a GROUP BY using all indicator columns, you will get a distinct list of all missing value patterns along with the number of observations in each pattern. You can sort them by the number of observations in descending order to get an idea of the most frequent missing value patterns.

Leave a comment