Arguments description #325

BakerMara · 2022-03-21T07:08:10Z

No description provided.

ShawnXuan · 2022-03-21T15:07:37Z

RecommenderSystems/dlrm/README.md

+|model_save_dir|model saving directory|./checkpoint|
+|save_initial_model|save initial model parameters or not.||
+|save_model_after_each_eval|save model after each eval||
+|not_eval_after_training|do eval after_training||


remove this

ShawnXuan · 2022-03-21T15:08:04Z

RecommenderSystems/dlrm/README.md

+|save_model_after_each_eval|save model after each eval||
+|not_eval_after_training|do eval after_training||
+|data_dir|the data file directory|/dataset/dlrm_parquet|
+|eval_batchs|<0: whole val ds, 0: do not val, >0: number of eval batches|-1|


eval_batches

ShawnXuan · 2022-03-21T15:08:17Z

RecommenderSystems/dlrm/README.md

+|data_dir|the data file directory|/dataset/dlrm_parquet|
+|eval_batchs|<0: whole val ds, 0: do not val, >0: number of eval batches|-1|
+|eval_batch_size||55296|
+|eval_batch_size_per_proc||None|


remove this

ShawnXuan · 2022-03-21T15:08:43Z

RecommenderSystems/dlrm/README.md

+|eval_batch_size||55296|
+|eval_batch_size_per_proc||None|
+|eval_interval||10000|    
+|batch_size|the data batch size in one step training|55296|


train_batch_size

ShawnXuan · 2022-03-21T15:08:55Z

RecommenderSystems/dlrm/README.md

+|eval_batch_size_per_proc||None|
+|eval_interval||10000|    
+|batch_size|the data batch size in one step training|55296|
+|batch_size_per_proc||None|


remove this

ShawnXuan · 2022-03-21T15:10:14Z

RecommenderSystems/dlrm/README.md

+|column_size_array|column_size_array||
+|persistent_path|path for persistent kv store||
+|store_type|||
+|device_memory_budget_mb_per_rank||8192|


cache_memory_budget_mb_per_rank

ShawnXuan · 2022-03-21T15:10:22Z

RecommenderSystems/dlrm/README.md

+|persistent_path|path for persistent kv store||
+|store_type|||
+|device_memory_budget_mb_per_rank||8192|
+|use_fp16|Run model with amp||


* wdl -> dlrm * update train.py * update readme temporary * update * update * udpate * update * update * update * update arguments * rm spase optimizer * update * update * update * dot * eager 1 device, old embedding * eager consistent ok * OK for train only * rm transpose * still only train OK * use register_buffer * train and eval ok * embedding type * dense to int * log(dense+1) * eager OK * rm model type * ignore buffer * update sh * rm dropout * update module * one module * update * update * update * update * labels dtype * Dev dlrm parquet (#282) * update * backup * parquet train OK * update * update * update * dense to float * update * add lr scheduler (#283) * Dev dlrm eval partnum (#284) * eval data part number * fix * support slots (#285) * support slots * self._origin in graph * slots to consistent * format * fix speed (#286) Co-authored-by: guo ran <[email protected]> * Update dlrm.py bmm -> matmul * Dev dlrm embedding split (#290) * support embedding model parallel * to consistent for embedding * update sbp derivation * fix * update * dlrm one embedding add options (#291) * add options * add fp16 and loss_scaler (#292) * fix (#293) * Dev dlrm offline auc (#294) * calculate auc offline * fix one embedding module, rm optimizer conf (#296) * calculate auc offline * update * add auc calculater * fix * format print * add fused_interaction * fix * rm optimizer conf * fix Co-authored-by: ShawnXuan <[email protected]> * refine embedding options (#299) * refine options * rename args * fix arg * Dev dlrm offline eval (#300) * update offline auc * update * merge master * Dev dlrm consistent 2 global (#303) * consistent- * update * Dev dlrm petastorm (#306) petastorm dataset * bce with logits (#307) * Dev dlrm make eval ds (#308) * fix * new val dataloader each time * rm usless * rm usless * rm usless * Dev dlrm vocab size (#309) * fix * new val dataloader each time * rm usless * rm usless * rm usless * vocab size * fix fc(scores) init (#310) * udate dense relu (#311) * update * use naive logger * rm logger.py * update * fix loss to local * rm usless line * remove to local * rank 0 * fix * add graph_train.py * keep graph mode only in graph_train.py * rm is_global * update * train one_embedding with graph * update * rm usless files * rm more files * update * save -> save_model * update eval arguments * rm eval_save_dir * mv import oneflow before sklearn.metrics, otherwise not work on onebrain * rm usless lines * print host and device mem after eval * add auc calculation time * update * add fused_dlrm temporarily * eager train * shuffling_queue_capacity -> shuffle_row_groups * update trainer for eager * rm dataset type * update * update * parquet dataloader * rm fused_dlrm.py * update * update graph train * update * update * update lr scheduler * update * update shell * rm lr scheduler * rm useless lines * update * update one embedding api * fix * change size_factor order * fix eval loader * rm debug lines * rm train/eval subfolders * files * support test * update oneembedding initlizer * update * update * update * rm usless lines * option -> options * eval barrier * update * rm column_ids * new api * fix push pull job * rm eager test * rm graph test * rm * eager_train- * rm * merge graph train to train * rm Embedding * update * rm vocab size * rm test name * rm split axis * update * train -> train_eval * update * replace class Trainer * fix * fix * merge mlp and fused mlp * pythonic * interaction padding * format * left 3 store types * left 3 store types * use capacity_per_rank * fix * format * update * update * update * use 13 and 26 * update * rm size factor * update * update * update readme * update * update * modify_read * rm usless import * add requirements.txt * rm args.not_eval_after_training * rm batch size per rank * set default eval batches * every_n_iter -> interval * device_memory_budget_mb_per_rank -> cache_memory_budget_mb_per_rank * dataloader- * update * update * update * update * update * update * use_fp16- * single py * disable_fusedmlp * 4 to 1 * new api * add capacity * Arguments description (#325) * Arguments description * rectify README.md * column- * make_table * MultiTableEmbedding * update store type * update * update readme * update README * update * iter->step * update README * add license * update README * install oneflow nightly * Add tools directory info to DLRM README.md (#328) Co-authored-by: guo ran <[email protected]> Co-authored-by: BakerMara <[email protected]> Co-authored-by: BoWen Sun <[email protected]> Co-authored-by: Xinman Liu <[email protected]>

* wdl -> dlrm * update train.py * update readme temporary * update * update * udpate * update * update * update * update arguments * rm spase optimizer * update * update * update * dot * eager 1 device, old embedding * eager consistent ok * OK for train only * rm transpose * still only train OK * use register_buffer * train and eval ok * embedding type * dense to int * log(dense+1) * eager OK * rm model type * ignore buffer * update sh * rm dropout * update module * one module * update * update * update * update * labels dtype * Dev dlrm parquet (#282) * update * backup * parquet train OK * update * update * update * dense to float * update * add lr scheduler (#283) * Dev dlrm eval partnum (#284) * eval data part number * fix * support slots (#285) * support slots * self._origin in graph * slots to consistent * format * fix speed (#286) Co-authored-by: guo ran <[email protected]> * Update dlrm.py bmm -> matmul * Dev dlrm embedding split (#290) * support embedding model parallel * to consistent for embedding * update sbp derivation * fix * update * dlrm one embedding add options (#291) * add options * add fp16 and loss_scaler (#292) * fix (#293) * Dev dlrm offline auc (#294) * calculate auc offline * fix one embedding module, rm optimizer conf (#296) * calculate auc offline * update * add auc calculater * fix * format print * add fused_interaction * fix * rm optimizer conf * fix Co-authored-by: ShawnXuan <[email protected]> * refine embedding options (#299) * refine options * rename args * fix arg * Dev dlrm offline eval (#300) * update offline auc * update * merge master * Dev dlrm consistent 2 global (#303) * consistent- * update * Dev dlrm petastorm (#306) petastorm dataset * bce with logits (#307) * Dev dlrm make eval ds (#308) * fix * new val dataloader each time * rm usless * rm usless * rm usless * Dev dlrm vocab size (#309) * fix * new val dataloader each time * rm usless * rm usless * rm usless * vocab size * fix fc(scores) init (#310) * udate dense relu (#311) * update * use naive logger * rm logger.py * update * fix loss to local * rm usless line * remove to local * rank 0 * fix * add graph_train.py * keep graph mode only in graph_train.py * rm is_global * update * train one_embedding with graph * update * rm usless files * rm more files * update * save -> save_model * update eval arguments * rm eval_save_dir * mv import oneflow before sklearn.metrics, otherwise not work on onebrain * rm usless lines * print host and device mem after eval * add auc calculation time * update * add fused_dlrm temporarily * eager train * shuffling_queue_capacity -> shuffle_row_groups * update trainer for eager * rm dataset type * update * update * parquet dataloader * rm fused_dlrm.py * update * update graph train * update * update * update lr scheduler * update * update shell * rm lr scheduler * rm useless lines * update * update one embedding api * fix * change size_factor order * fix eval loader * rm debug lines * rm train/eval subfolders * files * support test * update oneembedding initlizer * update * update * update * rm usless lines * option -> options * eval barrier * update * rm column_ids * new api * fix push pull job * rm eager test * rm graph test * rm * eager_train- * rm * merge graph train to train * rm Embedding * update * rm vocab size * rm test name * rm split axis * update * train -> train_eval * update * replace class Trainer * fix * fix * merge mlp and fused mlp * pythonic * interaction padding * format * left 3 store types * left 3 store types * use capacity_per_rank * fix * format * update * update * update * use 13 and 26 * update * rm size factor * update * update * update readme * update * update * modify_read * rm usless import * add requirements.txt * rm args.not_eval_after_training * rm batch size per rank * set default eval batches * every_n_iter -> interval * device_memory_budget_mb_per_rank -> cache_memory_budget_mb_per_rank * dataloader- * update * update * update * update * update * update * use_fp16- * single py * disable_fusedmlp * 4 to 1 * new api * add capacity * Arguments description (#325) * Arguments description * rectify README.md * column- * make_table * MultiTableEmbedding * update store type * update * update readme * update README * update * iter->step * update README * add license * update README * install oneflow nightly * Add tools directory info to DLRM README.md (#328) * Add deepfm model(FM component missed) * Add FM component * Update README.md * Fix loss bug; change weight initialization methods * change lr scheduler to multistepLR * Add dropout layer to dnn * Add monitor for early stopping * Simplify early stopping schema * Normal initialization for oneembedding; Adam optimizer; h52parquet * Add logloss in eval for early stop * Fix dataloader slicing bug * Change lr schedule to reduce lr on plateau * Refine train/val/test * Add validation and test evaluation * Update readme and help message * use flow.roc_auc_score, prefetch eval batches, fix train step start time * Delete unused args; Change file path; Add Throughput measurement. * Add deepfm with MultiColOneEmbedding * remove fusedmlp; change interaction class to function; keep val graph predict in gpu * Use flow._C.binary_cross_entropy_loss; Remove sklearn from env requirement; * Fix early stop bug; Check if path valid before loading model * Change auc time and logloss time to metrics time; Remove last validation; * replace view with keepdim; replace nn.sigmoid with tensor.sigmoid * change unsqueeze to keepdim; use list in dataloader * Use from numpy to reduce cast time * Add early stop and save best to args * Reformat deepfm_train_eval * Use BCEWithLogitsLoss * Update readme; Change early_stop to disable_early_stop; Update train script * Update README.md * Fix early stop bugs * Refine save best model help message * Add scala script and spark launching shell script * Delete h5_to_parquet.py * Update readme.md * Use real values in table size array example; delete criteo_parquet.py * Add split_criteo_kaggle.py * Update readme.md * Rename training script; Update readme.md * Update Readme.md (fix bad links) * Update README.md * Format files * Add out_features in DNN Co-authored-by: ShawnXuan <[email protected]> Co-authored-by: guo ran <[email protected]> Co-authored-by: BakerMara <[email protected]> Co-authored-by: BoWen Sun <[email protected]>

BakerMara and others added 2 commits March 21, 2022 15:06

Arguments description

0df3d55

Merge branch 'dev_dlrm_graph_train' into dev_dlrm_graph_train_bw

7d3462f

ShawnXuan reviewed Mar 21, 2022

View reviewed changes

rectify README.md

a068909

ShawnXuan approved these changes Mar 22, 2022

View reviewed changes

ShawnXuan merged commit ea8ee0b into dev_dlrm_graph_train Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arguments description #325

Arguments description #325

Uh oh!

BakerMara commented Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Uh oh!

Uh oh!

Arguments description #325

Arguments description #325

Uh oh!

Conversation

BakerMara commented Mar 21, 2022

Uh oh!

ShawnXuan Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

ShawnXuan Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

ShawnXuan Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

ShawnXuan Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

ShawnXuan Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

ShawnXuan Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

ShawnXuan Mar 21, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!