Practical Comparison of XGBoost and LightGBM

A practical comparison of XGBoost and LightGBM.

1. Introduction

Last two posts are XGBoost and LightGBM paper readings, they are official descriptions of these two GBM frameworks. However many practical details are not mentioned or described very clearly. Also, there are some different features between them. Therefore this time I’ll give a summary of comparison which will start from their mechanism and practical usage.

2. Common Tricks

2.1 Tree Growing

Both XGBoost and LightGBM support Best-first Tree Growth, a.k.a. Leaf-wise Tree Growth. Many other GBM implementation use Depth-first Tree Growth, a.k.a. Depth-wise Tree Growth. Use the description from LightGBM doc:

For leaf-wise method, it will choose the leaf with max loss reduce to grow, rather than finish the leaf growth in same level. With number of leaves fixed, leaf-wise method tend to achieve lower loss than depth-wise method. Also leaf-wise method can converge much faster, but it will also be more likely to overfit.

Here is the depth-wise tree growth.

Unlike LightGBM, XGBoost also support depth-wise. The parameter is grow_policy with default to be depthwise; to use leaf-wise, switch to lossguide.

For leaf-wise tree growth, the key parameters are:

number of leaves:
- XGBoost: max_leaves (need to set grow_policy=lossguide, otherwise it is 0)
- LightGBM: num_leaves
max depth:
- XGBoost: max_depth (can set to 0 when grow_policy=lossguide and tree_method=hist)
- LightGBM: max_depth (set to -1 means no limit)
min data required in leaf to split:
- XGBoost: min_child_weight
- LightGBM: min_data_in_leaf

2.2 Histogram-based Split Finding

Both XGBoost and LightGBM support histogram-based algorithm for split finding. As mentioned in XGBoost paper, the exact-greedy (brute-force) split find algorithm is time consuming: for current feature to search, need to sort feature values and iterate through. For faster training, histogram-based algorithm is used, which bucket continuous feature into discrete bins. This speeds up training and reduces memory usage.

LightGBM is using histogram-based algorithm. Related parameters are:

max_bin: max number of bins that feature values will be bucketed in.
min_data_in_bin: minimal number of data inside one bin.
bin_construct_sample_cnt: number of data that sampled to construct histogram bins.

XGBoost has options to choose histogram-based algorithm, it is specified by tree_method with options:

auto: (default) use heuristic to choose the fastest method.
exact: exact greedy algorithm.
approx: approximate greedy algorithm using quantile sketch and gradient histogram.
hist: fast histogram optimized approximate greedy algorithm, with this option enabled, max_bin (default 256) could be tuned

2.3 Missing Values Handling

Both XGBoost and LightGBM could handle missing values in input data.

XGBoost supports missing values by default. As mentioned in the paper, the missing values will be hold at first, then the optimal directions are learning during training to get best performance. On the other hand, XGBoost accepts sparse feature format where only non-zero values are stored, this way the data non-presented are treated as missing.

LightGBM has several parameters for missing values handling:

use_missing: default to be true.
zero_as_missing: default to be false, which means only np.nan is considered as missing.

3. Different Tricks

3.1 Categorical Feature Handling

XGBoost currently only supports numerical inputs, which means if the categorical features are encoded as integers, they are treated as ranked numerical features, this could introduce bias. Therefore one-hot encoding needs to be applied before feed into XGBoost.

LightGBM supports categorical input type, use categorical_feature, notice:

only supports integer type, e.g. outputs from sklearn.LabelEncoder.
index starts from 0 and it doesn’t count the label column when passing type is integer.
all values should be less than Int32.MaxValue (214783647).
all negative values will be treated as missing value.

In LightGBM, how to find best split for categorical features is briefly described: the optimal solution is to split on categorical feature by partitioning its categories into 2 subsets, if the feature has k categories, there are 2^(k-1)-1 possible partitions, the basic idea is to sort the categories according to training objective at each split; more specifically, it sorts the histogram according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

And there are also some parameters to handle categorical features regularization.

3.2 Boosters

As mentioned in LightGBM paper, a novel technique called gradient based one-side sampling is used, it could be set by boosting=goss with top_rate (between 0 and 1, the retain ratio of large gradient data) and other_rate (between 0 and 1, the retain ratio of small gradient data) specified.

In XGBoost, there are also multiple options :gbtree, gblinear, dart for boosters (booster), with default to be gbtree.

4. Other Things to Notice

4.1 Feature Importance

Feature importance is a good to validate and explain the results.

LightGBM returns feature importance by calling

1	lgb_model.feature_importance(importance_type, iteration=None)

the choice of importance_type means different measures of feature importance

"split": the number of times the feature is used to split data across all trees
"gain": the total gain of the feature when it is used across all trees

XGBoost returns feature importance by calling

1	xgb_model.get_score(fmap, importance_type)

the choice of importance_type includes:

"weight": the number of times the feature is used to split data across all trees
"gain": the average gain of the feature when it is used across all trees
"cover": the average coverage across all splits the feature is used (relative number of observations related to this feature)
"total_gain"
"total_cover"

5. Summary

There are still other topics worth discussing for XGBoost and LightGBM, e.g. they also both support DART boosting, which could have better performance but the parameter tuning is tricky; e.g. they could not only used on regression and classification tasks, but also learning to rank task. Anyway I’ll update here if there are other interesting comparisons could be done.

Here I also attach two examples of parameters usage for LightGBM and XGBoost.

LightGBM

lgb_params = {
    'objective'         : 'regression',
    'boosting_type'     : 'gbdt',
    'metric'            : 'rmse',
    'learning_rate'     : 0.01,
    'max_depth'         : 8,
    'num_leaves'        : 120,
    'min_data_in_leaf'  : 90,
    'feature_fraction'  : 0.185,
    'bagging_fraction'  : 1,
    'data_random_seed'  : 42,
    'lambda_l1'         : 0.4,
    'lambda_l2'         : 0.4,
    'cat_l2'            : 15,
    'min_gain_to_split' : 0.00,
    'min_data_per_group': 100,
    'max_bin'           : 255,
    'nthread'           : 4
}

XGBoost

xgb_params = {
    'objective'       : 'binary:logistic',
    'eval_metric'     : 'auc',
    'eta'             : 0.01,
    'max_depth'       : 6,
    'min_child_weight': 3,
    'colsample_bytree': 0.151,
    'subsample'       : 0.9,
    'gamma'           : 0.00, # min_split_loss
    'max_delta_step'  : 0.00,
    'lambda'          : 1,    # L2 regularization
    'alpha'           : 0,    # L1 regularization
    'scale_pos_weight': 1,    # control the balance of positive samples weights
    'seed'            : 42
}