Bangda Sun

Practice makes perfect

First Journey through Kaggle

Comments from a kaggle novice: my experiences and thoughts after two simple “Get-Started” competitions.

Just spent the last five weeks (nights and weekends to be precise) on two kaggle competitions. I think it’s a good point to stop by and summarize what I did and learned for now. Really nice experience, especially when I heard the sound of disk spin – felt data’s coming.

1. Motivation

Kaggle is a nice place to interact machine learning experts in different countries and backgrounds. You can really learn a lot from the kernels and discussion panel. What I do is analyzing data and modeling by myself until I have no idea about improvement, then I will refer to the kernels. Since it is helpful to me to think independently and critically.

I have some mathematical modeling contests experiences before, therefore I’m very comfortable with modeling, predicting, analyzing. Competitions in kaggle are more data oriented – you win if you family with the data better than others. On top of that, I’ve taken courses like statistical computing, machine learning and some programming courses as well as some projects, there is no need to start from the very beginning point. I can just fetch the data and start working on that.

2. Play with Data

The general procedure of a kaggle competition can be decomposed to these parts:

Exploration Data Analysis (EDA);
Feature Engineering;
Modeling;
Model Evaluations and Improvement.

Then submit the result and check the Leaderboard; if you’re not satisfied with the score you can jump back to any step and do it again.

3. EDA

EDA usually consists of variables interpretations and descriptive statistics accompanied by data visualizations. Here are some general ways I can summarize:

  1. Check the data types: numerical variables and categorical variables. Sometimes categorical variables use numbers but they are still discrete, and they can also classified as ordered / non-ordered variables.
  2. Get familiar with the response variable (target variable): for regression problem, check the distributions, ideally it’s not skewed; for classification problem, check the proportion of the different classes (binary case or multiple case), sometimes the classes could be imbalanced.
  3. Read the data description files carefully, and try to interpret variables as much as possible (google them if you still don’t know too much about that from description files), I know it’s time consuming but you will surely get benefits from it.
  4. Check missing value in features (predictors), some missing cases are random (we can use mean/median/mode/random number generating to impute) while some have certain pattern (we need find the relationship with other variables, applying some algorithms to impute). Sometimes visualize observations and variables would be helpful.
  5. Use correlation matrix (visualize it as heatmap), pair plot to get the general idea of correlation, and use correlation coefficients with caution when there are categorical variables.
  6. Collinearity could exist when there are a lot of features, this would probably make the truly important variables looks ‘unimportant’ and destroy your model.
  7. Check the skewness of variables, this is similar to point (2).
  8. Outliers, you can delete them or treat them carefully, based on your problem background.
  9. For visualization, you can get some basic idea from my previous blog. Tricks like facet, positioning (in barplot), conditioning plot are usually used when you try to visualize multiple variables simultaneously. ggplot2 and seaborn are your best friends to do this since you can place different variables as aesthetics.

Sometimes the problem cannot be simply classified, for instance some competitions are seeking solutions of detect some pattern from images.

4. Feature Engineering

Feature Engineering is super important for classical machine learning. There are several goals to be achieved in feature engineering:

  1. Create new features, you can extract from already existing variables; you can synthesize several variables (based on variables’ meaning; interaction effect; PCA, etc); you can do transformation (logarithm, Box-Cox on skewed variables; normalize and standardize for extreme values; higher order terms for non-linear relationship).
  2. Features selection, for collinearity you probably want drop some variables since other highly correlated variables still could give the information; Some variables don’t have too much variation they do less contribution to your prediction. Algorithms like random forest, boosting could return the variable importance based on error reduction.
  3. Encoding, this is widely used for categorical variables. You can convert them into dummy variables (one-hot encoding) for non-ordered variables; you can convert them into positive integers for ordered variables.

5. Modeling and Evaluation

There could be two categories of model, one is models like linear regression: easy to interpret, we can get many information by taking experiments on that; one is like random forest, boosting and neural networks, they are more likely to be ‘black boxes’.

We should keep the assumptions of the models/algorithms in our mind all the time, like normal distribution; i.i.d.. This could always be our entry point make improvement: we might search for other methods that have more generalized assumptions.

Techniques like cross-validation (model evaluation and parameters tuning), regularization, bootstrap are widely used, they are our old friends in machine learning courses, so I won’t discuss much of that here :-).

Here is one technique I want to mention – model stacking. It makes good use of out-of-fold data and can be always a way to improve your current / single model (in kaggle at least), you create new friends for your model that they can work together. I will discuss this idea of modeling in a separate blog later.

6. Summary

I just get a quick summary here and hopefully I will post more details of my two competitions in the later blogs.

In addition, I also have some thoughts after I read some kernels:

  • As a statistics/mathematics background person, I was surprised to see some kagglers take 3rd/4th order term to all numeric variables…

… and the PB score is really good, emm …

  • Include all variables they created into models, some models have more than 200 variables, I think there could be collinearities for linear regression model.