Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
7e29a06
script
Panlichen Jun 21, 2022
e016552
half p2b scripts
Panlichen Jun 23, 2022
437bd08
use to_local to print real partial tensor
Panlichen Jun 23, 2022
be1ee37
done p2b test script
Panlichen Jun 23, 2022
ee4c6e0
update test_p2b.py
Panlichen Jun 23, 2022
5db457f
update scripts
Panlichen Jun 23, 2022
4267c00
update scripts
Panlichen Jun 24, 2022
18cad18
Merge branch 'Oneflow-Inc:main' into test_ofccl
Panlichen Jun 24, 2022
45ee712
update scripts
Panlichen Jun 24, 2022
fd543ba
update scripts
Panlichen Jun 25, 2022
fa136f6
scripts
Panlichen Jun 25, 2022
c0728e0
update scripts
Panlichen Jun 28, 2022
1b67b63
update gitignore
Panlichen Jun 28, 2022
b32320d
Merge branch 'test_ofccl' into test_vlog
Panlichen Jun 28, 2022
6c35f28
update scripts for vlog
Panlichen Jun 28, 2022
caaf7d1
update scripts
Panlichen Jun 28, 2022
62641f0
update imageNet path for machines
Panlichen Sep 27, 2022
b6972b8
polish scripts
Panlichen Oct 10, 2022
b300237
script
Panlichen Oct 14, 2022
44821eb
script
Panlichen Oct 24, 2022
b63e1eb
no eval for resnet
Panlichen Oct 24, 2022
fdfb1ac
Merge branch 'test_ofccl' of github.com:Panlichen/models into test_ofccl
Panlichen Oct 24, 2022
21142af
no eval for resnet
Panlichen Oct 24, 2022
9403a74
.
Panlichen Oct 24, 2022
3d00cfa
script
Panlichen Oct 26, 2022
540f3d0
train resnet script
Panlichen Nov 18, 2022
2594084
use all epoch
Panlichen Nov 18, 2022
b3832ba
test 1 epoch
Panlichen Nov 18, 2022
e5adf38
not use log file
Panlichen Nov 19, 2022
23502bc
4 rank use 0,1,4,5
Panlichen Nov 19, 2022
bb83ede
script
Panlichen Nov 21, 2022
f5a4218
script
Panlichen Nov 26, 2022
628c2aa
scripts
Panlichen Nov 29, 2022
6f240d7
scripts
Panlichen Nov 30, 2022
c0833cd
scripts
Panlichen Dec 1, 2022
390a9a2
scripts
Panlichen Dec 5, 2022
69ca481
script
Panlichen Dec 15, 2022
7a4edc6
flow.boxing.nccl.set_fusion_max_ops_num(1)
Panlichen Dec 16, 2022
9ebe384
update scripts
Panlichen Dec 19, 2022
10c3e48
report tp with small interval
Panlichen Dec 20, 2022
be8657f
28 is occupied
Panlichen Dec 22, 2022
d585a5b
update path
Panlichen Dec 22, 2022
83864e7
update path
Panlichen Dec 22, 2022
1d83e6a
update path
Panlichen Dec 22, 2022
f397558
scripts
Panlichen Dec 29, 2022
553ef2c
Merge branch 'test_ofccl' of github.com:Panlichen/models into test_ofccl
Panlichen Dec 29, 2022
0665347
scripts
Panlichen Jan 6, 2023
eb385af
path
Panlichen Jan 6, 2023
96ad2a5
open nego
Panlichen Jan 6, 2023
8abff9f
scripts
Panlichen Jan 8, 2023
0682893
scripts
Panlichen Jan 10, 2023
3b3f2b2
DEVICE_NUM_PER_NODE as cmd line param
Panlichen Jan 11, 2023
01e7102
scripts
Panlichen Jan 13, 2023
60bf8a3
scripts
Panlichen Jan 17, 2023
35a1418
debug 2 card resnet
Panlichen Jan 30, 2023
e267c0a
gdb 20 iter
Panlichen Feb 3, 2023
f1e0ce2
hyperparameter
Panlichen Feb 5, 2023
ae84921
shell 里的 == 用来比较字符串;可以考虑用-eq。
Panlichen Feb 6, 2023
49ba6cc
adjust env
Panlichen Feb 8, 2023
1f9b631
28 8卡参数;200iter
Panlichen Feb 27, 2023
315783d
+ 专用脚本
Panlichen Feb 27, 2023
612d9a5
4卡参数
Panlichen Feb 27, 2023
ba94f62
27 resnet的参数。
Panlichen Feb 27, 2023
43fa12d
scripts
Panlichen Apr 11, 2023
117f785
scripts
Panlichen Apr 12, 2023
93b0f78
scripts
Panlichen Apr 13, 2023
22cec6c
Update README.md
Panlichen Apr 13, 2023
338bf1d
Update README.md
Panlichen Apr 14, 2023
1d2597a
Update README.md
Panlichen Apr 14, 2023
bfa1724
+scripts
Panlichen Apr 24, 2023
549e674
scripts
Panlichen Apr 25, 2023
0b68c93
nego parameter
Panlichen Apr 25, 2023
27246b3
scripts
Panlichen Apr 26, 2023
bbcb4f5
scripts
Panlichen Apr 28, 2023
8c668e7
scripts
Panlichen May 4, 2023
9762a67
scripts
Panlichen May 4, 2023
4a02a39
Merge branch 'test_ofccl' of github.com:Panlichen/models into test_ofccl
Panlichen May 4, 2023
f64d071
debug scripts
Panlichen May 5, 2023
c421de5
scripts
Panlichen May 5, 2023
4253ed8
+ para 4090脚本;尝试处理数据集
Panlichen May 10, 2024
cca9b3f
+4090_para scripts
Panlichen May 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ wheels/
MANIFEST
output/
log/
nsys/

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down Expand Up @@ -149,3 +150,4 @@ result
imagenette2/
nanodataset/
data-test/
*checkpoints/
157 changes: 8 additions & 149 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,152 +1,11 @@
# OneFlow-Models
**Models and examples implement with OneFlow(version >= 0.5.0).**
Please refer to the [official repository](https://github.com/Oneflow-Inc/models) for detailed documentation.

## Introduction
**English** | [简体中文](/README_zh-CN.md)

OneFlow-Models is an open source repo which contains official implementation of different models built on OneFlow. In each model, we provide at least two scripts `train.sh` and `infer.sh` for a quick start. For each model, we provide a detailed `README` to introduce the usage of this model.

## Features
- various models and pretrained weight
- easy use for beginners

## Quick Start
Please check our the following **demos** for a quick start
- **image classification** [quick start lenet demo](Demo/quick_start_demo_lenet/lenet.py)
- **speaker recognition** [speaker identification demo](Demo/speaker_identification_demo)

## Model List
<details>
<summary> <b> Image Classification </b> </summary>

- [Lenet](https://github.com/Oneflow-Inc/models/blob/main/Demo/quick_start_demo_lenet/lenet.py)
- [Alexnet](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/alexnet)
- [VGG16/19](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/vgg)
- [Resnet50](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50)
- [InceptionV3](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/inception_v3)
- [Densenet](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/densenet)
- [Resnext50_32x4d](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnext50_32x4d)
- [Shufflenetv2](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/shufflenetv2)
- [MobilenetV2](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/mobilenetv2)
- [mobilenetv3](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/mobilenetv3)
- [Ghostnet](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/ghostnet)
- [RepVGG](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/repvgg)
- [DLA](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/DLA)
- [PoseNet](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/poseNet)
- [Scnet](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/scnet)
- [Mnasnet](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/mnasnet)
- [ViT](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/ViT)

</details>

<details>
<summary> <b> Video Classification </b> </summary>

- [TSN](https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/video/TSN)

</details>


<details>
<summary> <b> Object Detection </b> </summary>

- [CSRNet](https://github.com/Oneflow-Inc/models/tree/main/Vision/detection/CSRNet)

</details>

<details>
<summary> <b> Semantic Segmentation </b> </summary>

- [FODDet](https://github.com/Oneflow-Inc/models/tree/main/Vision/segmentation/FODDet)
- [FaceSeg](https://github.com/Oneflow-Inc/models/tree/main/Vision/segmentation/FaceSeg)
- [U-Net](https://github.com/Oneflow-Inc/models/tree/main/Vision/segmentation/U-Net)

</details>

<details>
<summary> <b> Generative Adversarial Networks </b> </summary>

- [DCGAN](https://github.com/Oneflow-Inc/models/tree/main/Vision/gan/DCGAN)
- [SRGAN](https://github.com/Oneflow-Inc/models/tree/main/Vision/gan/SRGAN)
- [Pix2Pix](https://github.com/Oneflow-Inc/models/tree/main/Vision/gan/Pix2Pix)
- [CycleGAN](https://github.com/Oneflow-Inc/models/tree/main/Vision/gan/CycleGAN)

</details>

<details>
<summary> <b> Neural Style Transform </b> </summary>

- [FastNeuralStyle](https://github.com/Oneflow-Inc/models/tree/main/Vision/style_transform/fast_neural_style)

</details>


<details>
<summary> <b> Person Re-identification </b> </summary>

- [BoT](https://github.com/Oneflow-Inc/models/tree/main/Vision/reid/BoT)

</details>


<details>
<summary> <b> Natural Language Processing </b> </summary>

- [RNN](https://github.com/Oneflow-Inc/models/tree/main/NLP/rnn)
- [Seq2Seq](https://github.com/Oneflow-Inc/models/tree/main/NLP/seq2seq)
- [LSTMText](https://github.com/Oneflow-Inc/models/tree/main/NLP/LSTMText)
- [TextCNN](https://github.com/Oneflow-Inc/models/tree/main/NLP/TextCNN)
- [Transformer](https://github.com/Oneflow-Inc/models/tree/main/NLP/Transformer)
- [Bert](https://github.com/Oneflow-Inc/models/tree/main/NLP/bert-oneflow)
- [CPT](https://github.com/Oneflow-Inc/models/tree/main/NLP/CPT)
- [MoE](https://github.com/Oneflow-Inc/models/tree/main/NLP/MoE)

</details>

<details>
<summary> <b> Audio </b> </summary>

- [SincNet](https://github.com/Oneflow-Inc/models/tree/main/Audio/SincNet)
- [Wav2Letter](https://github.com/Oneflow-Inc/models/tree/main/Audio/Wav2Letter)
- [AM_MobileNet1D](https://github.com/Oneflow-Inc/models/tree/main/Audio/AM-MobileNet1D)
- [Speech-Emotion-Analyer](https://github.com/Oneflow-Inc/models/tree/main/Audio/Speech-Emotion-Analyzer)
- [Speech-Transformer](https://github.com/Oneflow-Inc/models/tree/main/Audio/Speech-Transformer)
- [CycleGAN-VC2](https://github.com/Oneflow-Inc/models/tree/main/Audio/CycleGAN-VC2)
- [MaskCycleGAN-VC](https://github.com/Oneflow-Inc/models/tree/main/Audio/MaskCycleGAN-VC)
- [StarGAN-VC](https://github.com/Oneflow-Inc/models/tree/main/Audio/StarGAN-VC)
- [Adaptive_Voice_Conversion](https://github.com/Oneflow-Inc/models/tree/main/Audio/Adaptive_Voice_Conversion)
- [Opentransformer](https://github.com/Oneflow-Inc/models/tree/main/Audio/Opentransformer)
</details>

<details>
<summary> <b> Deep Reinforcement Learning </b> </summary>

- [FlappyBird](https://github.com/Oneflow-Inc/models/tree/main/DeepReinforcementLearning/FlappyBird)
</details>

<details>
<summary> <b> Quantization Aware Training </b> </summary>

- [Quantization](https://github.com/Oneflow-Inc/models/tree/main/Quantization)
</details>

## Installation and Environment setup
**Install Oneflow**

https://github.com/Oneflow-Inc/oneflow#install-with-pip-package

**Build custom ops from source**

In the root directory, run:
```bash
mkdir build
cd build
cmake ..
make -j$(nrpoc)
```
Example of using ops:
```bash
from ops import RoIAlign
pooler = RoIAlign(output_size=(14, 14), spatial_scale=2.0, sampling_ratio=2)
## Running experiments in the OCCL paper
```shell
cd Vision/classification/image/resnet50/examples
bash train_ofccl_graph_distributed_fp32.sh <NUM_LOCAL_GPUS>
```

Notes:
- Prepare the ImageNet dataset in advance.
- If the environment virable `ONEFLOW_ENABLE_OFCCL` in [train_ofccl_graph_distributed_fp32.sh](Vision/classification/image/resnet50/examples/train_ofccl_graph_distributed_fp32.sh#L16) is set to `1`, OCCL will be used during training; otherwise, NCCL will be employed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# set -aux
clear

if [ -z $DEVICE_NUM_PER_NODE ];then
DEVICE_NUM_PER_NODE=2
fi
MASTER_ADDR=127.0.0.1
NUM_NODES=1
NODE_RANK=0

export GLOG_vmodule=nn_graph*=1,plan_util*=1,of_collective_actor*=1,of_collective_boxing_kernels*=1
# export GLOG_v=1
export GLOG_logtostderr=1

echo ONEFLOW_OFCCL_SKIP_NEGO=$ONEFLOW_OFCCL_SKIP_NEGO
echo ONEFLOW_OFCCL_CHAIN=$ONEFLOW_OFCCL_CHAIN
echo GLOG_vmodule=$GLOG_vmodule
echo GLOG_v=$GLOG_v
echo GLOG_logtostderr=$GLOG_logtostderr

echo DEVICE_NUM_PER_NODE=$DEVICE_NUM_PER_NODE

export PYTHONUNBUFFERED=1
echo PYTHONUNBUFFERED=$PYTHONUNBUFFERED
export NCCL_LAUNCH_MODE=PARALLEL
echo NCCL_LAUNCH_MODE=$NCCL_LAUNCH_MODE
# export NCCL_DEBUG=INFO
export ONEFLOW_DEBUG_MODE=1
export ONEFLOW_PROFILER_KERNEL_PROFILE_KERNEL_FORWARD_RANGE=1
export ONEFLOW_ENABLE_OFCCL=1

CHECKPOINT_SAVE_PATH="./graph_distributed_fp32_checkpoints"
if [ ! -d "$CHECKPOINT_SAVE_PATH" ]; then
mkdir $CHECKPOINT_SAVE_PATH
fi

OFRECORD_PATH=/home/panlichen/dataset/ImageNet/ofrecord

OFRECORD_PART_NUM=256
LEARNING_RATE=0.768
MOM=0.875
EPOCH=50
# TRAIN_BATCH_SIZE=96
# VAL_BATCH_SIZE=50
TRAIN_BATCH_SIZE=20
VAL_BATCH_SIZE=20

# SRC_DIR=/path/to/models/resnet50
SRC_DIR=$(realpath $(dirname $0)/..)

if [ $ONEFLOW_ENABLE_OFCCL == "1" ]; then
NSYS_FILE="ofccl_resnet"
else
NSYS_FILE="nccl_resnet"
fi

rm -rf ./log
mkdir ./log

if [ -z $RUN_TYPE ];then
RUN_TYPE="PURE"
fi

if [ "$RUN_TYPE" == "PURE" ];then
cmd="python3 -m oneflow.distributed.launch"
elif [ "$RUN_TYPE" == "GDB" ];then
cmd="gdb -ex r --args python3 -m oneflow.distributed.launch"
elif [ "$RUN_TYPE" == "NSYS" ];then
cmd="nsys profile -f true --trace=cuda,cudnn,cublas,osrt,nvtx -o nsys/$NSYS_FILE python3 -m oneflow.distributed.launch"
fi

unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY

$cmd \
--nproc_per_node $DEVICE_NUM_PER_NODE \
--nnodes $NUM_NODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
$SRC_DIR/train.py \
--save $CHECKPOINT_SAVE_PATH \
--ofrecord-path $OFRECORD_PATH \
--ofrecord-part-num $OFRECORD_PART_NUM \
--num-devices-per-node $DEVICE_NUM_PER_NODE \
--lr $LEARNING_RATE \
--momentum $MOM \
--num-epochs $EPOCH \
--train-batch-size $TRAIN_BATCH_SIZE \
--val-batch-size $VAL_BATCH_SIZE \
--use-gpu-decode \
--scale-grad \
--graph \
--fuse-bn-relu \
--fuse-bn-add-relu \
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# set -aux
clear

if [ -z $DEVICE_NUM_PER_NODE ];then
DEVICE_NUM_PER_NODE=2
fi
MASTER_ADDR=127.0.0.1
NUM_NODES=1
NODE_RANK=0

export GLOG_vmodule=nn_graph*=1,plan_util*=1,of_collective_actor*=1,of_collective_boxing_kernels*=1
# export GLOG_v=1
export GLOG_logtostderr=1

echo ONEFLOW_OFCCL_SKIP_NEGO=$ONEFLOW_OFCCL_SKIP_NEGO
echo ONEFLOW_OFCCL_CHAIN=$ONEFLOW_OFCCL_CHAIN
echo GLOG_vmodule=$GLOG_vmodule
echo GLOG_v=$GLOG_v
echo GLOG_logtostderr=$GLOG_logtostderr

echo DEVICE_NUM_PER_NODE=$DEVICE_NUM_PER_NODE

export PYTHONUNBUFFERED=1
echo PYTHONUNBUFFERED=$PYTHONUNBUFFERED
export NCCL_LAUNCH_MODE=PARALLEL
echo NCCL_LAUNCH_MODE=$NCCL_LAUNCH_MODE
# export NCCL_DEBUG=INFO
export ONEFLOW_DEBUG_MODE=1
export ONEFLOW_PROFILER_KERNEL_PROFILE_KERNEL_FORWARD_RANGE=1
export ONEFLOW_ENABLE_OFCCL=1

CHECKPOINT_SAVE_PATH="./graph_distributed_fp32_checkpoints"
if [ ! -d "$CHECKPOINT_SAVE_PATH" ]; then
mkdir $CHECKPOINT_SAVE_PATH
fi

OFRECORD_PATH=/dataset/ImageNet/ofrecord

OFRECORD_PART_NUM=256
LEARNING_RATE=0.768
MOM=0.875
EPOCH=50
# TRAIN_BATCH_SIZE=96
# VAL_BATCH_SIZE=50
TRAIN_BATCH_SIZE=20
VAL_BATCH_SIZE=20

# SRC_DIR=/path/to/models/resnet50
SRC_DIR=$(realpath $(dirname $0)/..)

if [ $ONEFLOW_ENABLE_OFCCL == "1" ]; then
NSYS_FILE="ofccl_resnet"
else
NSYS_FILE="nccl_resnet"
fi

rm -rf ./log
mkdir ./log

if [ -z $RUN_TYPE ];then
RUN_TYPE="PURE"
fi

if [ "$RUN_TYPE" == "PURE" ];then
cmd="python3 -m oneflow.distributed.launch"
elif [ "$RUN_TYPE" == "GDB" ];then
cmd="gdb -ex r --args python3 -m oneflow.distributed.launch"
elif [ "$RUN_TYPE" == "NSYS" ];then
cmd="nsys profile -f true --trace=cuda,cudnn,cublas,osrt,nvtx -o nsys/$NSYS_FILE python3 -m oneflow.distributed.launch"
fi

unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY

$cmd \
--nproc_per_node $DEVICE_NUM_PER_NODE \
--nnodes $NUM_NODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
$SRC_DIR/train.py \
--save $CHECKPOINT_SAVE_PATH \
--ofrecord-path $OFRECORD_PATH \
--ofrecord-part-num $OFRECORD_PART_NUM \
--num-devices-per-node $DEVICE_NUM_PER_NODE \
--lr $LEARNING_RATE \
--momentum $MOM \
--num-epochs $EPOCH \
--train-batch-size $TRAIN_BATCH_SIZE \
--val-batch-size $VAL_BATCH_SIZE \
--use-gpu-decode \
--scale-grad \
--graph \
--fuse-bn-relu \
--fuse-bn-add-relu \
Loading