Skip to content
This repository was archived by the owner on Jun 22, 2022. It is now read-only.

bucketing row aggregations

Kamil A. Kaczmarek edited this page Jul 10, 2018 · 4 revisions

blowfish 🐡

Feature Extraction

aggregations = {'non_zero_mean': non_zero_values.mean(),
                'non_zero_std': non_zero_values.std(),
                'non_zero_max': non_zero_values.max(),
                'non_zero_min': non_zero_values.min(),
                'non_zero_sum': non_zero_values.sum(),
                'non_zero_skewness': skew(non_zero_values),
                'non_zero_kurtosis': kurtosis(non_zero_values),
                'non_zero_median': non_zero_values.median(),
                'non_zero_q1': np.percentile(non_zero_values, q=25),
                'non_zero_q3': np.percentile(non_zero_values, q=75),
                'non_zero_log_mean': np.log1p(non_zero_values).mean(),
                'non_zero_log_std': np.log1p(non_zero_values).std(),
                'non_zero_log_max': np.log1p(non_zero_values).max(),
                'non_zero_log_min': np.log1p(non_zero_values).min(),
                'non_zero_log_sum': np.log1p(non_zero_values).sum(),
                'non_zero_log_skewness': skew(np.log1p(non_zero_values)),
                'non_zero_log_kurtosis': kurtosis(np.log1p(non_zero_values)),
                'non_zero_log_median': np.log1p(non_zero_values).median(),
                'non_zero_log_q1': np.percentile(np.log1p(non_zero_values), q=25),
                'non_zero_log_q3': np.percentile(np.log1p(non_zero_values), q=75),
                'non_zero_count': non_zero_values.count(),
                'non_zero_fraction': non_zero_values.count() / row.count()
                }
  • added per bucket aggregations where the columns are divided into buckets and for each bucket the aggregations are calculated. This is parametrized with:
  row_aggregations__bucket_nrs: "[1, 2]"

For instance here aggregations are calculated for 1 bucket (the entire dataset) and for each of 2 buckets.

Model

  • Less leaves, no row sampling and more column sampling improved both local CV and public LB. My guess is that less complexity resulted in less overfitting.
  • model parameters
# Light GBM
  lgbm_random_search_runs: 0
  lgbm__device: cpu # gpu cpu
  lgbm__boosting_type: gbdt
  lgbm__objective: rmse
  lgbm__metric: rmse
  lgbm__number_boosting_rounds: 10000
  lgbm__early_stopping_rounds: 1000
  lgbm__learning_rate: 0.001
  lgbm__num_leaves: 16
  lgbm__max_depth: -1
  lgbm__min_child_samples: 1
  lgbm__max_bin: 300
  lgbm__subsample: 1.0
  lgbm__subsample_freq: 1
  lgbm__colsample_bytree: 0.1
  lgbm__min_child_weight: 10
  lgbm__reg_lambda: 0.1
  lgbm__reg_alpha: 0.0
  lgbm__scale_pos_weight: 1
  lgbm__zero_as_missing: False
  • lightGBM new aggregations + projections (second best) 1.333 CV 1.38 LB 🏆

Pipeline diagram

solution-6 diagram
Clone this wiki locally