Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
07f6ed3
add dcn files.
jiangyzy Mar 24, 2022
8882312
add README.md
jiangyzy Mar 24, 2022
e953d85
update readme.md, requirements.txt, train.sh. pretrained models cover…
jiangyzy Mar 25, 2022
d241dbf
deleted files
jiangyzy Mar 28, 2022
e2d76dd
deleted files
jiangyzy Mar 28, 2022
a632cb5
auto format by CI
oneflow-ci-bot Mar 28, 2022
62b98d0
deleted .gitignore
jiangyzy Mar 28, 2022
543a1bb
deleted .gitignore
jiangyzy Mar 28, 2022
6e3a757
updated files
jiangyzy Mar 28, 2022
68e8e49
modified nn.init.zeros_ and nn.init.xavier_normal_ in crossnet.
jiangyzy Mar 29, 2022
06dfe97
fix change form /scripts/swin_dataloader_compare_speed_with_pytorch.py
jiangyzy Mar 30, 2022
c672dc4
add processing frappe from csv to parqurt format files: tools/frap…
jiangyzy Mar 30, 2022
03c5763
modified frappe download link in README.md
jiangyzy Mar 30, 2022
d9dfd92
delete tools dir
jiangyzy Mar 31, 2022
249b794
add tools dir
jiangyzy Mar 31, 2022
b87dbab
update dcn_graph_train_eval files
jiangyzy Apr 1, 2022
f14901b
Merge branch 'main' of https://github.com/Oneflow-Inc/models into dcn…
jiangyzy Apr 13, 2022
bc1d6d6
update fuxi dcn graph train and eval files , new dataset make tool ba…
jiangyzy Apr 13, 2022
472645f
modified train.sh table_size_array
jiangyzy Apr 13, 2022
91a88a4
fix some erroe in fuxi_data_util when save csv
jiangyzy Apr 14, 2022
900cab4
Merge remote-tracking branch 'origin/dcn_fuxi_train_eval' into main
jiangyzy Apr 18, 2022
09f6501
Criteo dcn related files
jiangyzy Apr 19, 2022
bb364fb
modified README.md
jiangyzy Apr 19, 2022
6d477b8
modified dcn_train_eval.py some arguments name
jiangyzy Apr 20, 2022
3c9201b
create graph when lr_decay
jiangyzy Apr 24, 2022
e987402
deleted fm_persistent
jiangyzy Apr 24, 2022
2e50a5b
update dcn_train_eval.py
jiangyzy Apr 26, 2022
bd77d0f
formated file by
jiangyzy Apr 26, 2022
329f789
new tool dir , and modified dcn_train_eval.py/sh fake path
jiangyzy May 5, 2022
a589398
add feature_map_json argment
jiangyzy May 5, 2022
572d969
delete unnecessary and useless code
jiangyzy May 6, 2022
24fe34d
add cast in make_criteo_parquet.py, modified dcn_train_eval.py
jiangyzy May 12, 2022
d29f5ee
delete useless
jiangyzy May 12, 2022
6297862
add throughput
jiangyzy May 16, 2022
e14f763
Merge branch 'main' of https://github.com/Oneflow-Inc/models into cri…
jiangyzy May 16, 2022
b2dc7d6
add valid test samples arg
jiangyzy May 16, 2022
9bae48b
fix batch_size and train_batch_size mismatched problem
jiangyzy May 16, 2022
1960bd7
delete uesless print code
jiangyzy May 16, 2022
ed02f35
add a blank line in the bottom of dataset_config.yaml
jiangyzy May 17, 2022
8bfb01b
add requirements.txt, update README.md
jiangyzy May 17, 2022
5ea1f41
move loss=loss.numpy() to improve efficiency
jiangyzy May 17, 2022
22a0477
delete fuxi code in dcn_train_eval.py, add scala related files, upda…
jiangyzy May 18, 2022
dd87ca6
update README
jiangyzy May 18, 2022
062680b
remove RecommenderSystems/dcn/tools/make_criteo_parquet.py and Recom…
jiangyzy May 18, 2022
3adeb3b
simplified DNN module, modified test eval process and related READEM…
jiangyzy May 19, 2022
cc9b8c4
add Crossnet fuxi quote, modified directory description in Readme an…
jiangyzy May 19, 2022
1cef2f6
name auc loglogg in eval process as val_auc val_logloss, add pandas …
jiangyzy May 20, 2022
314831f
simplified train.sh and related README contents
jiangyzy May 20, 2022
2bb71e4
simplified L2,3,4 in train.sh
jiangyzy May 20, 2022
b2574fb
set size_factor default=3
jiangyzy May 20, 2022
fffafce
add dcn structure image
jiangyzy May 20, 2022
60f8505
update Crossnet implementation in README
jiangyzy May 21, 2022
b1f9403
update Crossnet implementation in README
jiangyzy May 21, 2022
104127f
update Crossnet implementation in README
jiangyzy May 21, 2022
8eac6b3
update Crossnet implementation in README
jiangyzy May 21, 2022
ee21320
update README
jiangyzy May 23, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions RecommenderSystems/dcn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Deep&Cross
[Deep & Cross Network](https://dl.acm.org/doi/10.1145/3124749.3124754) (DCN) can not only keep the advantages of DNN model, but also learn specific bounded feature crossover more effectively. In particular, DCN can explicitly learn cross features for each layer without the need for manual feature engineering, and the increased algorithm complexity is almost negligible compared with DNN model.
![DCN](https://user-images.githubusercontent.com/80230303/159417248-1975736f-3de8-4972-84e3-2f0f346cbc1a.png)


Oneflow API is compatible to Pytorch, so only minor modification in codes then we can apply the Pytorch implemented modules to Oneflow. Therefore, we adopted some implementation from [FuxiCTR](https://github.com/xue-pai/FuxiCTR/tree/v1.0.2), for example the `CrossInteractionLayer` was reused as a basic `CrossNet` hidden layer, and we make a recurrent loop of these hidden layers in `CrossNet` module to get high-degree interaction across features.

## Directory description
```
.
|-- tools
|-- dcn_parquet.scala # Read Criteo Kaggle data and export it as parquet data format
|-- split_criteo.py # Split criteo kaggle dataset to train\val\test csv files
|-- launch_spark.sh # Spark launching shell script
|-- dcn_train_eval.py # OneFlow DCN train/val/test scripts with OneEmbedding module
|-- train.sh # DCN training shell script
|-- requirements.txt # python package configuration file
└── README.md # Documentation
```


## Arguments description
We use exactly the same default values as the [DCN_criteo_x4_001](https://github.com/openbenchmark/BARS/tree/master/ctr_prediction/benchmarks/DCN/DCN_criteo_x4_001) experiment in FuxiCTR.
|Argument Name|Argument Explanation|Default Value|
|-----|---|------|
|data_dir|the data file directory|*Required Argument*|
|num_train_samples|the number of training samples|36672493|
|num_valid_samples|the number of validation samples|4584062|
|num_test_samples|the number of test samples|4584062|
|shard_seed|seed for shuffling parquet data|2022|
|model_load_dir|model loading directory|None|
|model_save_dir|model saving directory|None|
|save_best_model|save best model or not|False|
|save_initial_model|save initial model parameters or not|False|
|save_model_after_each_eval|save model or not after each evaluation|False|
|embedding_vec_size|embedding vector dimention size|128|
|batch_norm|batch norm used in DNN|False|
|dnn_hidden_units|hidden units list of DNN|"1000,1000,1000,1000,1000"|
|crossing_layers|layer number of Crossnet|3|
|net_dropout|dropout rate of DNN|0.2|
|embedding_regularizer|rate of embedding layer regularizer|None|
|net_regularizer|rate of Crossnet and DNN layer regularizer|None|
|disable_early_stop|disable early stop or not|False|
|patience|waiting epoch of ealy stopping|2|
|min_delta|minimal delta of metric Monitor|1.0e-6|
|lr_factor|learning rate decay factor|0.1|
|min_lr|minimal learning rate|1.0e-6|
|learning_rate|learning rate|0.001|
|size_factor|size factor of OneEmbedding|3|
|valid_batch_size|valid batch size|10000|
|valid_batches|number of valid batches|1000|
|test_batch_size|test batch size|10000|
|test_batches|number of test batches|1000|
|train_batch_size|train batch size|10000|
|train_batches|number of train batches|15000|
|loss_print_interval|training loss print interval|100|
|train_batch_size|training batch size|55296|
|train_batches|number of minibatch training interations|75000|
|table_size_array|table size array for sparse fields|*Required Argument*|
|persistent_path|path for OneEmbedding persistent kv store|*Required Argument*|
|store_type|OneEmbeddig persistent kv store type: `device_mem`, `cached_host_mem` or `cached_ssd` |cached_ssd|
|cache_memory_budget_mb|size of cache memory budget on each device in megabytes when `store_type` is `cached_host_mem` or `cached_ssd`|8192|
|amp|enable Automatic Mixed Precision(AMP) training|False|
|loss_scale_policy|loss scale policy for AMP training: `static` or `dynamic`|static|

#### Early Stop Schema

The model is evaluated at the end of every epoch. At the end of each epoch, if the early stopping criterion is met, the training process will be stopped.

The monitor used for the early stop is `val_auc - val_log_loss`. The mode of the early stop is `max`. You could tune `patience` and `min_delta` as needed.

If you want to disable early stopping, simply add `--disable_early_stop` in the [train.sh](https://github.com/Oneflow-Inc/models/blob/criteo_dcn/RecommenderSystems/dcn/train.sh).


## Getting started
If you'd like to quickly train a OneFlow DCN model, please follow steps below:
### Installing OneFlow and Dependencies
1. To install nightly release of OneFlow with CUDA 11.5 support:
```
python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu115
```
For more information how to install Oneflow, please refer to [Oneflow Installation Tutorial](
https://github.com/Oneflow-Inc/oneflow#install-oneflow).

2. Please check `requirements.txt` to install dependencies manually or execute:
```bash
python3 -m pip install -r requirements.txt
```
### Dataset
The Criteo dataset is from [2014-kaggle-display-advertising-challenge-dataset](https://www.kaggle.com/competitions/criteo-display-ad-challenge/overview), considered the original download link is invalid, click [here](https://www.kaggle.com/datasets/mrkmakr/criteo-dataset) to donwload if you would.

Each sample contains:
- Label - Target variable that indicates if an ad was clicked (1) or not (0).
- I1-I13 - A total of 13 columns of integer features (mostly count features).
- C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.


1. Download the [Criteo Kaggle dataset](https://www.kaggle.com/c/criteo-display-ad-challenge) and then split it using split_criteo_kaggle.py.

2. launch a spark shell using [launch_spark.sh](https://github.com/Oneflow-Inc/models/blob/criteo_dcn/RecommenderSystems/dcn/tools/launch_spark.sh).

- Modify the SPARK_LOCAL_DIRS as needed

```shell
export SPARK_LOCAL_DIRS=/path/to/your/spark/
```

- Run `bash launch_spark.sh`

3. load [dcn_parquet.scala](https://github.com/Oneflow-Inc/models/blob/criteo_dcn/RecommenderSystems/dcn/tools/dcn_parquet.scala) to your spark shell by `:load dcn_parquet.scala`.

4. call the `makeDCNDataset(srcDir: String, dstDir:String)` function to generate the dataset.

```shell
makeDCNDataset("/path/to/your/src_dir", "/path/to/your/dst_dir")
```

After generating parquet dataset, dataset information will also be printed. It contains the information about the number of samples and table size array, which is needed when training.

```txt
train samples = 36672493
validation samples = 4584062
test samples = 4584062
table size array:
649,9364,14746,490,476707,11618,4142,1373,7275,13,169,407,1376,1460,583,10131227,2202608,305,24,12517,633,3,93145,5683,8351593,3194,27,14992,5461306,10,5652,2173,4,7046547,18,15,286181,105,142572
```


## Start training by Oneflow
Following command will launch 8 oneflow DCN training and evaluation processes on a node with 8 GPU devices, by specify `data_dir` for data input and `persistent_path` for OneEmbedding persistent store path.

`table_size_array` is close related to sparse features of data input. each sparse field such as `C1` or other `C*` field in criteo dataset corresponds to a embedding table and has its own capacity of unique feature ids, this capacity is also called `number of rows` or `size of embedding table`, the embedding table will be initialized by this value. `table_size_array` holds all sparse fields' `size of embedding table`. `table_size_array` is also used to estimate capacity for OneEmbedding.

```python
DEVICE_NUM_PER_NODE=8
DATA_DIR=your_path/criteo_parquet
PERSISTENT_PATH=your_path/persistent1
MODEL_SAVE_DIR=your_path/model_save_dir

python3 -m oneflow.distributed.launch \
--nproc_per_node $DEVICE_NUM_PER_NODE \
--nnodes 1 \
--node_rank 0 \
--master_addr 127.0.0.1 \
dcn_train_eval.py \
--data_dir $DATA_DIR \
--model_save_dir $MODEL_SAVE_DIR \
--persistent_path $PERSISTENT_PATH \
--table_size_array "649,9364,14746,490,476707,11618,4142,1373,7275,13,169,407,1376,1460,583,10131227,2202608,305,24,12517,633,3,93145,5683,8351593,3194,27,14992,5461306,10,5652,2173,4,7046547,18,15,286181,105,142572" \
--store_type 'cached_host_mem' \
--cache_memory_budget_mb 2048 \
--dnn_hidden_units "1000, 1000, 1000, 1000, 1000" \
--crossing_layers 4 \
--embedding_vec_size 16

```

You could modified it in [train.sh](https://github.com/Oneflow-Inc/models/blob/criteo_dcn/RecommenderSystems/dcn/train.sh), and then quickly run by

`
bash train.sh
`




Loading