Doing Data Science: A Kaggle Walkthrough – Cleaning Data

oliyiyi

2798

收藏 2016-07-03

Gain insight into the process of cleaning data for a specific Kaggle competition, including a step by step overview.

By Brett Romero, Open Data Kosovo.

This article on cleaning data is Part III in a series looking at data science and machine learning by walking through a Kaggle competition. If you have not done so already, it is recommended that you go back and read Part I and Part II.

In this part we will focus on cleaning the data provided for the Airbnb Kaggle competition.

Cleaning Data

When we talk about cleaning data, what exactly are we talking about? Generally when people talk about cleaning data, there are a few specific things they are referring to:

Fixing up formats – Often when data is saved or translated from one format to another (for example in our case from CSV to Python), some data may not be translated correctly. We saw a good example of this in the last article in csv. The timestamp_first_active column contained numbers like 20090609231247 instead of timestamps in the expected format: 2009-06-09 23:12:47. A typical job when it comes to cleaning data is correcting these types of issues.
Filling in missing values – As we also saw in Part II, it is quite common for some values to be missing from datasets. This typically means that a piece of information was simply not collected. There are several options for handling missing data that will be covered below.
Correcting erroneous values – For some columns, there are values that can be identified as obviously incorrect. This may be a ‘gender’ column where someone has entered a number, or an ‘age’ column where someone has entered a value well over 100. These values either need to be corrected (if the correct value can be determined) or assumed to be missing.
Standardizing categories – More of a subcategory of ‘correcting erroneous values’, this type of data cleansing is so common it is worth special mention. In many (all?) cases where data is collected from users directly – particularly using free text fields – spelling mistakes, language differences or other factors will result in a given answer being provided in multiple ways. For example, when collecting data on country of birth, if users are not provided with a standardized list of countries, the data will inevitably contain multiple spellings of the same country (e.g. USA, United States, U.S. and so on). One of the main cleaning tasks often involves standardizing these values to ensure that there is only one version of each value.

Options for Dealing With Missing Data

Missing data in general is one of the trickier issues that is dealt with when cleaning data. Broadly there are two solutions:

1. Deleting/Ignoring rows with missing values

The simplest solution available when faced with missing values is to not use the records with missing values when training your model. However, there are some issues to be aware of before you starting deleting masses of rows from your dataset.

The first is that this approach only makes sense if the number of rows with missing data is relatively small compared to the dataset. If you are finding that you will be deleting more than around 10% of your dataset due to rows having missing values, you may need to reconsider.

The second issue is that in order to delete the rows containing missing data, you have to be confident that the rows you are deleting do not contain information that is not contained in other rows. For example, in the current Airbnb dataset we have seen that many users have not provided their age. Can we assume that the people who chose not to provide their age are the same as the users who did? Or are they likely to represent a different type of user, perhaps an older and more privacy conscious user, and therefore a user that is likely to make different choices on which countries to visit? If the answer is the latter, we probably do not want to just delete the records.

2. Filling in the Values

The second broad option for dealing with missing data is to fill the missing values with a value. But what value to use? This depends on a range of factors, including the type of data you are trying to fill.

If the data is categorical (i.e. countries, device types, etc.), it may make sense to simply create a new category that will represent ‘unknown’. Another option may be to fill the values with the most common value for that column (the mode). However, because these are broad methods for filling the missing values, this may oversimplify your data and/or make your final model less accurate.

For numerical values (for example the age column) there are some other options. Given that in this case using the mode to fill values makes less sense, we could instead use the mean or median. We could even take an average based on some other criteria – for example filling the missing age values based on an average age for users that selected the same country_destination.

For both types of data (categorical and numerical), we can also use far more complicated methods to impute the missing values. Effectively, we can use a similar methodology that we are planning to use to predict the country_destination to predict the values in any of the other columns, based on the columns that do have data. And just like with modeling in general, there are an almost endless number of ways this can be done, which won’t be detailed here. For more information on this topic, the orange Python library provides some excellent documentation.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

oliyiyi

2016-7-3 07:54:15

Step by Step

With that general overview out of the way, let’s start cleaning the Airbnb data. In relation to the datasets provided for the Airbnb Kaggle competition, we will focus our cleaning efforts on two files –train_users_2.csv and test_users.csv and leave aside sessions.csv.

Loading in the Data

The first step is to load the data from the CSV files using Python. To do this we will use the Pandas library and load the data from two filestrain_users_2.csv and test_users.csv. After loading, we will combine them into one dataset so that any cleaning (and later any other changes) will be done to all the data at once [1].

import pandas as pd# Import dataprint("Reading in data...")tr_filepath = "./train_users_2.csv"df_train = pd.read_csv(tr_filepath, header=0, index_col=None)te_filepath = "./test_users.csv"df_test = pd.read_csv(te_filepath, header=0, index_col=None)# Combine into one datasetdf_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)

Clean the Timestamps

Once the data has been loaded and combined, the first cleaning step we will undertake is fixing the format of the dates – as we saw in Part II, at least one of the date columns looks like it is formatted as one long number. You may be wondering why this is necessary – after all, can’t we all see what the dates are supposed to represent when we look at the data?

The reason we need to convert the values in the date columns is that, if we want to do anything with those dates (e.g. subtract one date from another, extract the month of the year from each date etc.), it will be far easier if Python recognizes the values as dates. This will become much clearer next week when we start adding various new features to the training data based on this date information.

Luckily, fixing date formats is relatively easy. Pandas has a simple function, to_datetime, that will allow us to input a column and get the correctly formatted dates as a result. When using this function we also provide a parameter called ‘format’ that is like a regular expression for dates. In simpler terms, we are providing the function with a generalized form of the date so that it can interpret the data in the column. For example, for the date_account_created column we are telling the function to expect a four-digit year (%Y) followed by a ‘-’, then a two-digit month (%m), then ‘-’, then a two-digit day (%d) – altogether the expression would be ‘%Y-%m-%d’ (for the full list of directives that can be used, see here). For thetimestamp_first_active column, the date format provided is different so we adjust our expression accordingly.

Once we have fixed the date formats, we simply replace the existing date columns with the corrected data. Finally, because thedate_account_created column is sometimes empty, we replace the empty values with the value in the date_account_created column using the fillna function. The code for this step is provided below:

# Change Dates to consistent formatprint("Fixing timestamps...")df_all['date_account_created' = pd.to_datetime(df_all['date_account_created', format='%Y-%m-%d')df_all['timestamp_first_active' = pd.to_datetime(df_all['timestamp_first_active', format='%Y%m%d%H%M%S')df_all['date_account_created'.fillna(df_all.timestamp_first_active, inplace=True)

Remove booking date field

Those following along and/or paying attention may have noticed that in the original dataset, there are three date fields, but we have only covered two above. The remaining date field, date_first_booking, we are going to drop (remove) from the training data altogether. The reason is that this field is only populated for users who have made a booking. For the data in training_users_2.csv, all the users that have a first booking country have a value in the date_first_booking column and for those that have not made a booking (country_destination = NDF) the value is missing. However, for the data in test_users.csv, the date_first_booking column is empty for all the records.

This means that this column is not going to be useful for predicting which country a booking will be made. What is more, if we leave it in the training dataset when building the model, it will likely increase the chances that the model predicts NDF as those are the records without dates in the training dataset. The code for removing the column is provided below:

# Remove date_first_booking columndf_all.drop('date_first_booking', axis=1, inplace=True)

Clean the Age column

As identified in Part II, there are several age values that are clearly incorrect (unreasonably high or too low). In this step, we replace these incorrect values with ‘NaN’, which literally stands for Not a Number, but implies we do not know the age value. In other words we are changing the incorrect values into missing values. To do this, we create a simple function that intakes a dataframe (table), a column name, a maximum acceptable value (90) and a minimum acceptable value (15). This function will then replace the values in the specified column that are outside the acceptable range with NaN.

Again from Part II we know there were also a significant number of users who did not provide their age at all – so they also show up as NaN in the dataset. After we have converted the incorrect age values to NaN, we then change all the NaN values to -1.

The code for these steps is shown below:

import numpy as np# Remove outliers functiondef remove_outliers(df, column, min_val, max_val):col_values = df[column.valuesdf[column = np.where(np.logical_or(col_values<=min_val, col_values>=max_val), np.NaN, col_values)return df# Fixing age columnprint("Fixing age column...")df_all = remove_outliers(df=df_all, column='age', min_val=15, max_val=90)df_all['age'.fillna(-1, inplace=True)

As mentioned earlier, there are several more complicated ways to fill in the missing values in the age column. We are selecting this simple method for two main reasons:

Clarity – this series of articles is going to be long enough without adding the complication of a complex methodology for imputing missing ages.
Questionable results – in my testing during the actual competition, I did test several more complex imputation methodologies. However, none of the methods I tested actually produced a better end result than the methodology outlined above.

Identify and fill additional columns with missing values

From more detailed analysis of the data, you may have also realized there is one more column that has missing values – thefirst_affiliate_tracked column. In the same way we have been filling in the missing values in other columns, we now fill in the values in this column.

# Fill first_affiliate_tracked columnprint("Filling first_affiliate_tracked column...")df_all['first_affiliate_tracked'.fillna(-1, inplace=True)

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

zgq.8026

2016-7-3 07:55:17

没有附件

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

oliyiyi

2016-7-3 07:56:02

zgq.8026 发表于 2016-7-3 07:55
没有附件

又不是书籍，是文章

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

oliyiyi

2016-7-3 07:56:59

zgq.8026 发表于 2016-7-3 07:55
没有附件

习惯在这里下载盗版书了，把这里当成盗版书基地了，哈哈

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

eeabcde

2016-7-28 14:54:12

可以省不少银子

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

点击查看更多内容…

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群