bucketing row aggregations

blowfish 🐡

Feature Extraction

more aggregations implemented here: feature_extraction.py#L111.

aggregations = {'non_zero_mean': non_zero_values.mean(),
                'non_zero_std': non_zero_values.std(),
                'non_zero_max': non_zero_values.max(),
                'non_zero_min': non_zero_values.min(),
                'non_zero_sum': non_zero_values.sum(),
                'non_zero_skewness': skew(non_zero_values),
                'non_zero_kurtosis': kurtosis(non_zero_values),
                'non_zero_median': non_zero_values.median(),
                'non_zero_q1': np.percentile(non_zero_values, q=25),
                'non_zero_q3': np.percentile(non_zero_values, q=75),
                'non_zero_log_mean': np.log1p(non_zero_values).mean(),
                'non_zero_log_std': np.log1p(non_zero_values).std(),
                'non_zero_log_max': np.log1p(non_zero_values).max(),
                'non_zero_log_min': np.log1p(non_zero_values).min(),
                'non_zero_log_sum': np.log1p(non_zero_values).sum(),
                'non_zero_log_skewness': skew(np.log1p(non_zero_values)),
                'non_zero_log_kurtosis': kurtosis(np.log1p(non_zero_values)),
                'non_zero_log_median': np.log1p(non_zero_values).median(),
                'non_zero_log_q1': np.percentile(np.log1p(non_zero_values), q=25),
                'non_zero_log_q3': np.percentile(np.log1p(non_zero_values), q=75),
                'non_zero_count': non_zero_values.count(),
                'non_zero_fraction': non_zero_values.count() / row.count()
                }

added per bucket aggregations where the columns are divided into buckets and for each bucket the aggregations are calculated. This is parametrized with:

  row_aggregations__bucket_nrs: "[1, 2]"

For instance here aggregations are calculated for 1 bucket (the entire dataset) and for each of 2 buckets.

Model

Less leaves, no row sampling and more column sampling improved both local CV and public LB. My guess is that less complexity resulted in less overfitting.
model parameters

# Light GBM
  lgbm_random_search_runs: 0
  lgbm__device: cpu # gpu cpu
  lgbm__boosting_type: gbdt
  lgbm__objective: rmse
  lgbm__metric: rmse
  lgbm__number_boosting_rounds: 10000
  lgbm__early_stopping_rounds: 1000
  lgbm__learning_rate: 0.001
  lgbm__num_leaves: 16
  lgbm__max_depth: -1
  lgbm__min_child_samples: 1
  lgbm__max_bin: 300
  lgbm__subsample: 1.0
  lgbm__subsample_freq: 1
  lgbm__colsample_bytree: 0.1
  lgbm__min_child_weight: 10
  lgbm__reg_lambda: 0.1
  lgbm__reg_alpha: 0.0
  lgbm__scale_pos_weight: 1
  lgbm__zero_as_missing: False

lightGBM new aggregations + projections (second best) 1.333 CV 1.38 LB 🏆

Pipeline diagram

check our GitHub organization https://github.com/neptune-ml for more cool stuff 😃

Kamil & Kuba, core contributors

Open solutions

honey bee 🐝 LightGBM and 5fold CV
beetle 🪲 LightGBM on binarized dataset
dromedary camel 🐪 LightGBM with row aggregations
whale 🐳 LightGBM on dimension reduced dataset
water buffalo 🐃 Exploring various dimension reduction techniques
blowfish 🐡 bucketing row aggregations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bucketing row aggregations

blowfish 🐡

Feature Extraction

Model

Pipeline diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Open solutions

Clone this wiki locally