Today’s blog article is about missing value representations.
Identifying missing values is not always as easy as finding all NULL values in a database or all NA values in a data frame in R. Sometimes, empty fields are encoded by specific values that represent the missing data point. For example, an empty string, that is, a string of length zero, can represent a missing string value or it can simply be an empty string with no characters. To make things even more complicated, some databases automatically convert zero-length strings to NULL while others treat them as different objects.
I have also encountered data where missing values in text columns were encoded with ‘-‘, ‘NA’, ‘na’, ‘NULL’, ‘null’ among others. It’s always a good idea to calculate counts for all values of a text column, look at the most frequent values and check for patterns like the above-mentioned. I also recommend checking how your database treats zero-length strings and do a quick check on how often they occur in your dataset.
Don’t forget to do another check for strings that consist only of whitespace characters. They are not of length zero, but they are effectively empty and could also stand for missing values.
I recommend running checks for any of the above missing value representations. If you’re building a data product and you need clean data further down the line, replace them with actual NULL values to avoid confusion.
Unfortunately, with numeric columns, there are similar problems. Depending on the origin of your data, a value of zero can mean that the measured variable was actually zero, but it can also be that somewhere in data collection, the value zero has been inserted whenever data was missing. Another popular value people use as a fallback for missing data is minus one. This is often used in ID columns where for whatever reason, there is no ID available. Other popular fallback values are 99, 999, -99, -999 and so on.
Sometimes, people become creative and replace missing numerical values with the mean of the column. This can be tricky because it’s not obvious that there was data missing. When this is done to a continuous variable, drawing a kernel density estimation or calculating counts for each value in the column can help you spot the unusual spike in frequency at this particular value.
In columns with dates, look out for dates that are too far in the past to make sense in your data. For example, since Unix time counts the seconds since January 1st, 1970, this is a commonly used fallback value when a date is missing. Another value that should catch your attention is January 1st, 1900, which is the oldest possible value in Microsoft Excel. When you encounter this value, you can almost always be sure that this is not the actual value of the field, but some fallback due to missing data.
Sometimes, data records are only valid for a certain time period. For example, a product may have been listed on your webshop from 01.03.2016 (START_DATE) to 31.08.2018 (END_DATE). Another product may have a START_DATE of 01.05.2017 and still be sold in your shop, so the END_DATE is empty, i.e. NULL. The problem with this is that you may want to select all currently valid products by using the condition WHERE CURRENT_DATE BETWEEN PRODUCT.START_DATE AND PRODUCT.END_DATE. This doesn’t work because END_DATE is NULL. Sometimes, people solve this problem by setting END_DATE to some unrealistically high value like 31.12.2099 or 31.12.2999. Even though you may often see this in practice, it’s not best practice. After all, this is how the millennium bug, also known as Y2K bug, came to be.