Gain insight into the process of cleaning data for a specific Kaggle competition, including a step by step overview.
By Brett Romero, Open Data Kosovo.
This article on cleaning data is Part III in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, it is recommended that you go back and read Part I and Part II.
In this part we will focus on cleaning the data provided for the Airbnb Kaggle competition.

When we talk about cleaning data, what exactly are we talking about? Generally when people talk about cleaning data, there are a few specific things they are referring to:
Missing data in general is one of the trickier issues that is dealt with when cleaning data. Broadly there are two solutions:
1. Deleting/Ignoring rows with missing values
The simplest solution available when faced with missing values is to not use the records with missing values when training your model. However, there are some issues to be aware of before you starting deleting masses of rows from your dataset.
The first is that this approach only makes sense if the number of rows with missing data is relatively small compared to the dataset. If you are finding that you will be deleting more than around 10% of your dataset due to rows having missing values, you may need to reconsider.
The second issue is that in order to delete the rows containing missing data, you have to be confident that the rows you are deleting do not contain information that is not contained in other rows. For example, in the current Airbnb dataset we have seen that many users have not provided their age. Can we assume that the people who chose not to provide their age are the same as the users who did? Or are they likely to represent a different type of user, perhaps an older and more privacy conscious user, and therefore a user that is likely to make different choices on which countries to visit? If the answer is the latter, we probably do not want to just delete the records.
2. Filling in the Values
The second broad option for dealing with missing data is to fill the missing values with a value. But what value to use? This depends on a range of factors, including the type of data you are trying to fill.
If the data is categorical (i.e. countries, device types, etc.), it may make sense to simply create a new category that will represent ‘unknown’. Another option may be to fill the values with the most common value for that column (the mode). However, because these are broad methods for filling the missing values, this may oversimplify your data and/or make your final model less accurate.
For numerical values (for example the age column) there are some other options. Given that in this case using the mode to fill values makes less sense, we could instead use the mean or median. We could even take an average based on some other criteria – for example filling the missing age values based on an average age for users that selected the same country_destination.
For both types of data (categorical and numerical), we can also use far more complicated methods to impute the missing values. Effectively, we can use a similar methodology that we are planning to use to predict the country_destination to predict the values in any of the other columns, based on the columns that do have data. And just like with modeling in general, there are an almost endless number of ways this can be done, which won’t be detailed here. For more information on this topic, the orange Python library provides some excellent documentation.
With that general overview out of the way, let’s start cleaning the Airbnb data. In relation to the datasets provided for the Airbnb Kaggle competition, we will focus our cleaning efforts on two files –train_users_2.csv and test_users.csv and leave aside sessions.csv.
Loading in the Data
The first step is to load the data from the CSV files using Python. To do this we will use the Pandas library and load the data from two filestrain_users_2.csv and test_users.csv. After loading, we will combine them into one dataset so that any cleaning (and later any other changes) will be done to all the data at once [1].
import pandas as pd# Import dataprint("Reading in data...")tr_filepath = "./train_users_2.csv"df_train = pd.read_csv(tr_filepath, header=0, index_col=None)te_filepath = "./test_users.csv"df_test = pd.read_csv(te_filepath, header=0, index_col=None)# Combine into one datasetdf_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)Clean the Timestamps
Once the data has been loaded and combined, the first cleaning step we will undertake is fixing the format of the dates – as we saw in Part II, at least one of the date columns looks like it is formatted as one long number. You may be wondering why this is necessary – after all, can’t we all see what the dates are supposed to represent when we look at the data?
The reason we need to convert the values in the date columns is that, if we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates. This will become much clearer next week when we start adding various new features to the training data based on this date information.
Luckily, fixing date formats is relatively easy. Pandas has a simple function, to_datetime, that will allow us to input a column and get the correctly formatted dates as a result. When using this function we also provide a parameter called ‘format’ that is like a regular expression for dates. In simpler terms, we are providing the function with a generalized form of the date so that it can interpret the data in the column. For example, for the date_account_created column we are telling the function to expect a four-digit year (%Y) followed by a ‘-’, then a two-digit month (%m), then ‘-’, then a two-digit day (%d) – altogether the expression would be ‘%Y-%m-%d’ (for the full list of directives that can be used, see here). For thetimestamp_first_active column, the date format provided is different so we adjust our expression accordingly.
Once we have fixed the date formats, we simply replace the existing date columns with the corrected data. Finally, because thedate_account_created column is sometimes empty, we replace the empty values with the value in the date_account_created column using the fillna function. The code for this step is provided below:
# Change Dates to consistent formatprint("Fixing timestamps...")df_all['date_account_created' = pd.to_datetime(df_all['date_account_created', format='%Y-%m-%d')df_all['timestamp_first_active' = pd.to_datetime(df_all['timestamp_first_active', format='%Y%m%d%H%M%S')df_all['date_account_created'.fillna(df_all.timestamp_first_active, inplace=True)Remove booking date field
Those following along and/or paying attention may have noticed that in the original dataset, there are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset. The code for removing the column is provided below:
# Remove date_first_booking columndf_all.drop('date_first_booking', axis=1, inplace=True)Clean the Age column
As identified in Part II, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’, which literally stands for Not a Number, but implies we do not know the age value. In other words we are changing the incorrect values into missing values. To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.
Again from Part II we know there were also a significant number of users who did not provide their age at all – so they also show up as NaN in the dataset. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1.
The code for these steps is shown below:
import numpy as np# Remove outliers functiondef remove_outliers(df, column, min_val, max_val):col_values = df[column.valuesdf[column = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)return df# Fixing age columnprint("Fixing age column...")df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)df_all['age'.fillna(-1, inplace=True)As mentioned earlier, there are several more complicated ways to fill in the missing values in the age column. We are selecting this simple method for two main reasons:
Identify and fill additional columns with missing values
From more detailed analysis of the data, you may have also realized there is one more column that has missing values – thefirst_affiliate_tracked column. In the same way we have been filling in the missing values in other columns, we now fill in the values in this column.
# Fill first_affiliate_tracked columnprint("Filling first_affiliate_tracked column...")df_all['first_affiliate_tracked'.fillna(-1, inplace=True)I
扫码加好友,拉您进群



收藏
