This repository was archived by the owner on Jun 22, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 21
bucketing row aggregations
Kamil A. Kaczmarek edited this page Jul 10, 2018
·
4 revisions
- more aggregations implemented here: feature_extraction.py#L111.
aggregations = {'non_zero_mean': non_zero_values.mean(),
'non_zero_std': non_zero_values.std(),
'non_zero_max': non_zero_values.max(),
'non_zero_min': non_zero_values.min(),
'non_zero_sum': non_zero_values.sum(),
'non_zero_skewness': skew(non_zero_values),
'non_zero_kurtosis': kurtosis(non_zero_values),
'non_zero_median': non_zero_values.median(),
'non_zero_q1': np.percentile(non_zero_values, q=25),
'non_zero_q3': np.percentile(non_zero_values, q=75),
'non_zero_log_mean': np.log1p(non_zero_values).mean(),
'non_zero_log_std': np.log1p(non_zero_values).std(),
'non_zero_log_max': np.log1p(non_zero_values).max(),
'non_zero_log_min': np.log1p(non_zero_values).min(),
'non_zero_log_sum': np.log1p(non_zero_values).sum(),
'non_zero_log_skewness': skew(np.log1p(non_zero_values)),
'non_zero_log_kurtosis': kurtosis(np.log1p(non_zero_values)),
'non_zero_log_median': np.log1p(non_zero_values).median(),
'non_zero_log_q1': np.percentile(np.log1p(non_zero_values), q=25),
'non_zero_log_q3': np.percentile(np.log1p(non_zero_values), q=75),
'non_zero_count': non_zero_values.count(),
'non_zero_fraction': non_zero_values.count() / row.count()
}
- added per bucket aggregations where the columns are divided into buckets and for each bucket the aggregations are calculated. This is parametrized with:
row_aggregations__bucket_nrs: "[1, 2]"
For instance here aggregations are calculated for 1 bucket (the entire dataset) and for each of 2 buckets.
- Less leaves, no row sampling and more column sampling improved both local CV and public LB. My guess is that less complexity resulted in less overfitting.
- model parameters
# Light GBM
lgbm_random_search_runs: 0
lgbm__device: cpu # gpu cpu
lgbm__boosting_type: gbdt
lgbm__objective: rmse
lgbm__metric: rmse
lgbm__number_boosting_rounds: 10000
lgbm__early_stopping_rounds: 1000
lgbm__learning_rate: 0.001
lgbm__num_leaves: 16
lgbm__max_depth: -1
lgbm__min_child_samples: 1
lgbm__max_bin: 300
lgbm__subsample: 1.0
lgbm__subsample_freq: 1
lgbm__colsample_bytree: 0.1
lgbm__min_child_weight: 10
lgbm__reg_lambda: 0.1
lgbm__reg_alpha: 0.0
lgbm__scale_pos_weight: 1
lgbm__zero_as_missing: False
- lightGBM new aggregations + projections (second best) 1.333 CV 1.38 LB 🏆

check our GitHub organization https://github.com/neptune-ml for more cool stuff 😃
Kamil & Kuba, core contributors
- honey bee 🐝 LightGBM and 5fold CV
- beetle 🪲 LightGBM on binarized dataset
- dromedary camel 🐪 LightGBM with row aggregations
- whale 🐳 LightGBM on dimension reduced dataset
- water buffalo 🐃 Exploring various dimension reduction techniques
- blowfish 🐡 bucketing row aggregations