Dev dlrm offline auc #294

ShawnXuan · 2022-01-25T15:16:29Z

No description provided.

RecommenderSystems/dlrm/utils/auc_calculater.py

* wdl -> dlrm * update train.py * update readme temporary * update * update * udpate * update * update * update * update arguments * rm spase optimizer * update * update * update * dot * eager 1 device, old embedding * eager consistent ok * OK for train only * rm transpose * still only train OK * use register_buffer * train and eval ok * embedding type * dense to int * log(dense+1) * eager OK * rm model type * ignore buffer * update sh * rm dropout * update module * one module * update * update * update * update * labels dtype * Dev dlrm parquet (#282) * update * backup * parquet train OK * update * update * update * dense to float * update * add lr scheduler (#283) * Dev dlrm eval partnum (#284) * eval data part number * fix * support slots (#285) * support slots * self._origin in graph * slots to consistent * format * fix speed (#286) Co-authored-by: guo ran <[email protected]> * Update dlrm.py bmm -> matmul * Dev dlrm embedding split (#290) * support embedding model parallel * to consistent for embedding * update sbp derivation * fix * update * dlrm one embedding add options (#291) * add options * add fp16 and loss_scaler (#292) * fix (#293) * Dev dlrm offline auc (#294) * calculate auc offline * fix one embedding module, rm optimizer conf (#296) * calculate auc offline * update * add auc calculater * fix * format print * add fused_interaction * fix * rm optimizer conf * fix Co-authored-by: ShawnXuan <[email protected]> * refine embedding options (#299) * refine options * rename args * fix arg * Dev dlrm offline eval (#300) * update offline auc * update * merge master * Dev dlrm consistent 2 global (#303) * consistent- * update * Dev dlrm petastorm (#306) petastorm dataset * bce with logits (#307) * Dev dlrm make eval ds (#308) * fix * new val dataloader each time * rm usless * rm usless * rm usless * Dev dlrm vocab size (#309) * fix * new val dataloader each time * rm usless * rm usless * rm usless * vocab size * fix fc(scores) init (#310) * udate dense relu (#311) * update * use naive logger * rm logger.py * update * fix loss to local * rm usless line * remove to local * rank 0 * fix * add graph_train.py * keep graph mode only in graph_train.py * rm is_global * update * train one_embedding with graph * update * rm usless files * rm more files * update * save -> save_model * update eval arguments * rm eval_save_dir * mv import oneflow before sklearn.metrics, otherwise not work on onebrain * rm usless lines * print host and device mem after eval * add auc calculation time * update * add fused_dlrm temporarily * eager train * shuffling_queue_capacity -> shuffle_row_groups * update trainer for eager * rm dataset type * update * update * parquet dataloader * rm fused_dlrm.py * update * update graph train * update * update * update lr scheduler * update * update shell * rm lr scheduler * rm useless lines * update * update one embedding api * fix * change size_factor order * fix eval loader * rm debug lines * rm train/eval subfolders * files * support test * update oneembedding initlizer * update * update * update * rm usless lines * option -> options * eval barrier * update * rm column_ids * new api * fix push pull job * rm eager test * rm graph test * rm * eager_train- * rm * merge graph train to train * rm Embedding * update * rm vocab size * rm test name * rm split axis * update * train -> train_eval * update * replace class Trainer * fix * fix * merge mlp and fused mlp * pythonic * interaction padding * format * left 3 store types * left 3 store types * use capacity_per_rank * fix * format * update * update * update * use 13 and 26 * update * rm size factor * update * update * update readme * update * update * modify_read * rm usless import * add requirements.txt * rm args.not_eval_after_training * rm batch size per rank * set default eval batches * every_n_iter -> interval * device_memory_budget_mb_per_rank -> cache_memory_budget_mb_per_rank * dataloader- * update * update * update * update * update * update * use_fp16- * single py * disable_fusedmlp * 4 to 1 * new api * add capacity * Arguments description (#325) * Arguments description * rectify README.md * column- * make_table * MultiTableEmbedding * update store type * update * update readme * update README * update * iter->step * update README * add license * update README * install oneflow nightly * Add tools directory info to DLRM README.md (#328) Co-authored-by: guo ran <[email protected]> Co-authored-by: BakerMara <[email protected]> Co-authored-by: BoWen Sun <[email protected]> Co-authored-by: Xinman Liu <[email protected]>

* wdl -> dlrm * update train.py * update readme temporary * update * update * udpate * update * update * update * update arguments * rm spase optimizer * update * update * update * dot * eager 1 device, old embedding * eager consistent ok * OK for train only * rm transpose * still only train OK * use register_buffer * train and eval ok * embedding type * dense to int * log(dense+1) * eager OK * rm model type * ignore buffer * update sh * rm dropout * update module * one module * update * update * update * update * labels dtype * Dev dlrm parquet (#282) * update * backup * parquet train OK * update * update * update * dense to float * update * add lr scheduler (#283) * Dev dlrm eval partnum (#284) * eval data part number * fix * support slots (#285) * support slots * self._origin in graph * slots to consistent * format * fix speed (#286) Co-authored-by: guo ran <[email protected]> * Update dlrm.py bmm -> matmul * Dev dlrm embedding split (#290) * support embedding model parallel * to consistent for embedding * update sbp derivation * fix * update * dlrm one embedding add options (#291) * add options * add fp16 and loss_scaler (#292) * fix (#293) * Dev dlrm offline auc (#294) * calculate auc offline * fix one embedding module, rm optimizer conf (#296) * calculate auc offline * update * add auc calculater * fix * format print * add fused_interaction * fix * rm optimizer conf * fix Co-authored-by: ShawnXuan <[email protected]> * refine embedding options (#299) * refine options * rename args * fix arg * Dev dlrm offline eval (#300) * update offline auc * update * merge master * Dev dlrm consistent 2 global (#303) * consistent- * update * Dev dlrm petastorm (#306) petastorm dataset * bce with logits (#307) * Dev dlrm make eval ds (#308) * fix * new val dataloader each time * rm usless * rm usless * rm usless * Dev dlrm vocab size (#309) * fix * new val dataloader each time * rm usless * rm usless * rm usless * vocab size * fix fc(scores) init (#310) * udate dense relu (#311) * update * use naive logger * rm logger.py * update * fix loss to local * rm usless line * remove to local * rank 0 * fix * add graph_train.py * keep graph mode only in graph_train.py * rm is_global * update * train one_embedding with graph * update * rm usless files * rm more files * update * save -> save_model * update eval arguments * rm eval_save_dir * mv import oneflow before sklearn.metrics, otherwise not work on onebrain * rm usless lines * print host and device mem after eval * add auc calculation time * update * add fused_dlrm temporarily * eager train * shuffling_queue_capacity -> shuffle_row_groups * update trainer for eager * rm dataset type * update * update * parquet dataloader * rm fused_dlrm.py * update * update graph train * update * update * update lr scheduler * update * update shell * rm lr scheduler * rm useless lines * update * update one embedding api * fix * change size_factor order * fix eval loader * rm debug lines * rm train/eval subfolders * files * support test * update oneembedding initlizer * update * update * update * rm usless lines * option -> options * eval barrier * update * rm column_ids * new api * fix push pull job * rm eager test * rm graph test * rm * eager_train- * rm * merge graph train to train * rm Embedding * update * rm vocab size * rm test name * rm split axis * update * train -> train_eval * update * replace class Trainer * fix * fix * merge mlp and fused mlp * pythonic * interaction padding * format * left 3 store types * left 3 store types * use capacity_per_rank * fix * format * update * update * update * use 13 and 26 * update * rm size factor * update * update * update readme * update * update * modify_read * rm usless import * add requirements.txt * rm args.not_eval_after_training * rm batch size per rank * set default eval batches * every_n_iter -> interval * device_memory_budget_mb_per_rank -> cache_memory_budget_mb_per_rank * dataloader- * update * update * update * update * update * update * use_fp16- * single py * disable_fusedmlp * 4 to 1 * new api * add capacity * Arguments description (#325) * Arguments description * rectify README.md * column- * make_table * MultiTableEmbedding * update store type * update * update readme * update README * update * iter->step * update README * add license * update README * install oneflow nightly * Add tools directory info to DLRM README.md (#328) * Add deepfm model(FM component missed) * Add FM component * Update README.md * Fix loss bug; change weight initialization methods * change lr scheduler to multistepLR * Add dropout layer to dnn * Add monitor for early stopping * Simplify early stopping schema * Normal initialization for oneembedding; Adam optimizer; h52parquet * Add logloss in eval for early stop * Fix dataloader slicing bug * Change lr schedule to reduce lr on plateau * Refine train/val/test * Add validation and test evaluation * Update readme and help message * use flow.roc_auc_score, prefetch eval batches, fix train step start time * Delete unused args; Change file path; Add Throughput measurement. * Add deepfm with MultiColOneEmbedding * remove fusedmlp; change interaction class to function; keep val graph predict in gpu * Use flow._C.binary_cross_entropy_loss; Remove sklearn from env requirement; * Fix early stop bug; Check if path valid before loading model * Change auc time and logloss time to metrics time; Remove last validation; * replace view with keepdim; replace nn.sigmoid with tensor.sigmoid * change unsqueeze to keepdim; use list in dataloader * Use from numpy to reduce cast time * Add early stop and save best to args * Reformat deepfm_train_eval * Use BCEWithLogitsLoss * Update readme; Change early_stop to disable_early_stop; Update train script * Update README.md * Fix early stop bugs * Refine save best model help message * Add scala script and spark launching shell script * Delete h5_to_parquet.py * Update readme.md * Use real values in table size array example; delete criteo_parquet.py * Add split_criteo_kaggle.py * Update readme.md * Rename training script; Update readme.md * Update Readme.md (fix bad links) * Update README.md * Format files * Add out_features in DNN Co-authored-by: ShawnXuan <[email protected]> Co-authored-by: guo ran <[email protected]> Co-authored-by: BakerMara <[email protected]> Co-authored-by: BoWen Sun <[email protected]>

ShawnXuan added 2 commits January 25, 2022 23:11

calculate auc offline

5e5b2c8

update

24761bf

ShawnXuan requested a review from guo-ran January 25, 2022 15:16

ShawnXuan added 2 commits January 25, 2022 23:17

add auc calculater

408ded7

fix

35812a5

MARD1NO reviewed Jan 26, 2022

View reviewed changes

RecommenderSystems/dlrm/utils/auc_calculater.py Outdated Show resolved Hide resolved

format print

6e1726b

guo-ran approved these changes Jan 28, 2022

View reviewed changes

guo-ran merged commit 5ea2f47 into dev_dlrm Jan 28, 2022

ShawnXuan deleted the dev_dlrm_offline_auc branch February 4, 2022 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev dlrm offline auc #294

Dev dlrm offline auc #294

Uh oh!

ShawnXuan commented Jan 25, 2022

Uh oh!

Uh oh!

Uh oh!

Dev dlrm offline auc #294

Dev dlrm offline auc #294

Uh oh!

Conversation

ShawnXuan commented Jan 25, 2022

Uh oh!

Uh oh!

Uh oh!