New cover for my book

Old cover
New cover

I’ve been dissatisfied with the cover of my book (link) for a while. The old cover was kind of okay, but the red text block with the title, subtitle and author was a generic template from the Amazon Kindle cover designer. The whole cover was kind of boring.

The new cover is designed to be a little bit “louder”. High contrast, big letters, looks really great and distinctive, even when shrinked to thumbnail-size. I’m happy how it turned out.

I kept the misty forrest in the background because I like it as a metaphor for a vast dataset that is impossible to take in with a single look. Something so large that it’s easy to get lost in there and not being able to see the wood for the trees.

The washing machine is obviously a play at the “clean data” that we’re all after. Running something through a washing machine takes time and a lot of “iteration cycles”, just like building a data product.

Invisible control characters

I can’t tell how many hours of my life I have wasted searching for invisible control characters. These are characters that are not visible in the text and that do things like controlling certain formatting options, e.g. the non-breaking space or giving instructions to the printer.
It’s one of mankind’s biggest mysteries how people manage to get these random characters into the data, but every time they do, a data scientist somewhere is driven to insanity.

The mischievous thing about these types of special characters is that they will often make data import tools fail with error messages that don’t necessarily specify what exactly went wrong.
On top of that, it’s hard to specifically check for invisible control characters. I once ran into errors when trying to read a large CSV file into R. After hours of trying to figure out what was wrong, I started to read the file as plain text, remove all the characters I knew and checked what was left. After removing all alphabetical and numerical characters using regular expression replacement functions, I was left with punctuation marks. Removing those was an iterative process since I did not have a complete list of all possible types of punctuation marks. Finally, when there was almost nothing left in my file, I discovered a few of these characters. It’s important that you do this using a tool that has a visual representation of non-printable characters like Notepad++. Otherwise you will continue to search the source of your troubles for a long time.

Export limits

Today’s blog post is about export limits in software programs.

Truncated exports are a common problem in many software programs that handle large amounts of data. In order to avoid long loading times that would ruin the usability of a program, software developers like to truncate large amounts of data. This is a reasonable solution that works well in most use-cases. They don’t anticipate a situation where the user would want to export the complete set of data. Many programs don’t even issue a warning to inform the user about the truncation, they simply limit the number of returned result rows to a predefined number and claim to have successfully finished the export.

The same goes for many APIs that return data upon request like REST interfaces. APIs are usually built to answer very specific queries and return a small number of results, so the default export limits are usually much smaller than in server or desktop applications.

When you rely on an export function of a program that you are not familiar with, make sure to check if it has any restrictions regarding the number of observations it can export. If you’re requesting data from the users of a program that handles data, ask them to make sure that any such restrictions are turned off before they run the export.
Also, count the number of records in your data. When your data has exactly 10.000 or 1.000.000 records, it’s very likely that you have hit an export limit and need to go back and fix that before continuing with your analysis.

Missing value representations

Today’s blog article is about missing value representations.

Identifying missing values is not always as easy as finding all NULL values in a database or all NA values in a data frame in R. Sometimes, empty fields are encoded by specific values that represent the missing data point. For example, an empty string, that is, a string of length zero, can represent a missing string value or it can simply be an empty string with no characters. To make things even more complicated, some databases automatically convert zero-length strings to NULL while others treat them as different objects.
I have also encountered data where missing values in text columns were encoded with ‘-‘, ‘NA’, ‘na’, ‘NULL’, ‘null’ among others. It’s always a good idea to calculate counts for all values of a text column, look at the most frequent values and check for patterns like the above-mentioned. I also recommend checking how your database treats zero-length strings and do a quick check on how often they occur in your dataset.
Don’t forget to do another check for strings that consist only of whitespace characters. They are not of length zero, but they are effectively empty and could also stand for missing values.
I recommend running checks for any of the above missing value representations. If you’re building a data product and you need clean data further down the line, replace them with actual NULL values to avoid confusion.

Unfortunately, with numeric columns, there are similar problems. Depending on the origin of your data, a value of zero can mean that the measured variable was actually zero, but it can also be that somewhere in data collection, the value zero has been inserted whenever data was missing. Another popular value people use as a fallback for missing data is minus one. This is often used in ID columns where for whatever reason, there is no ID available. Other popular fallback values are 99, 999, -99, -999 and so on.
Sometimes, people become creative and replace missing numerical values with the mean of the column. This can be tricky because it’s not obvious that there was data missing. When this is done to a continuous variable, drawing a kernel density estimation or calculating counts for each value in the column can help you spot the unusual spike in frequency at this particular value.

In columns with dates, look out for dates that are too far in the past to make sense in your data. For example, since Unix time counts the seconds since January 1st, 1970, this is a commonly used fallback value when a date is missing. Another value that should catch your attention is January 1st, 1900, which is the oldest possible value in Microsoft Excel. When you encounter this value, you can almost always be sure that this is not the actual value of the field, but some fallback due to missing data.
Sometimes, data records are only valid for a certain time period. For example, a product may have been listed on your webshop from 01.03.2016 (START_DATE) to 31.08.2018 (END_DATE). Another product may have a START_DATE of 01.05.2017 and still be sold in your shop, so the END_DATE is empty, i.e. NULL. The problem with this is that you may want to select all currently valid products by using the condition WHERE CURRENT_DATE BETWEEN PRODUCT.START_DATE AND PRODUCT.END_DATE. This doesn’t work because END_DATE is NULL. Sometimes, people solve this problem by setting END_DATE to some unrealistically high value like 31.12.2099 or 31.12.2999. Even though you may often see this in practice, it’s not best practice. After all, this is how the millennium bug, also known as Y2K bug, came to be.

Missing value patterns

Today’s blog post is about missing value patterns.

Knowing the number of missing values per column helps to get a feel for the completeness of the data and to quickly uncover potential problems. In order to get a better understanding, it also helps to look for missing value patterns, that is, groups of variables where, if one of the variables is missing, the others are usually missing as well.
This property is very common when the data has been generated by joining multiple tables together. For example, if you join a table containing the names of customers with a table containing their addresses, there might be some customers in the name table that do not have an entry in the address table. For these customers, all the columns with address data will be empty.

A good way to identify missing value patterns is by replacing each column with an indicator column that is equal to 1 if the value is not missing and 0 if the value is missing. Each observation basically becomes a vector of ones and zeroes and if two of these vectors are identical, it means that the corresponding observations exhibit the same missing value pattern. If you then do a GROUP BY using all indicator columns, you will get a distinct list of all missing value patterns along with the number of observations in each pattern. You can sort them by the number of observations in descending order to get an idea of the most frequent missing value patterns.