Trusted Deep Learning Toolkit: distributed pytorch on k8s-sgx

As a ppml developer, I want to run distributed PyTorch on k8s, with SGX support (using with Gramine), in order to provide a security environment for our customer to run trusted deep learning applications.

- [x] Create **bigdl-ppml-gramine-base** image
- [x] Create **bigdl-ppml-trusted-deep-learning-gramine-base** image
    - [x] Test with MNIST dataset with three machines ditributively
- [x] Modify PyTorch and GLOO accordingly to support our trusted deep learning applications.
- [x] Find out why self-built pytorch is so slow compared to pytorch installed with pip
    - [x] Problem due to pytorch build cannot find the correct MKL library due to wired pip install mkl behavior
    - [x] Fix it in **bigdl-ppml-trusted-deep-learning-gramine-base** image
    - [x] Regression test, ensure much better performance
- [x] Provide a reference image, namely **bigdl-ppml-trusted-deep-learning-gramine-ref** image
    - [x] Finish initial Dockerfile 
    - [x] Simple regression tests using the MNIST dataset
    - [x] Merged
- [x] Enable TLS support to PyTorch, change GLOO_DEVICE_TRANSPORT from **TCP** to **TCP_TLS**
    - [x] Test if we can pass the certificates required by using kubernetes secret
    - [x] Simple regression tests using the MNIST dataset
    - [x] A basic README doc to illustrate how to prepare the required key and certificates
    - [x] We decide to pass in certificates by using k8s secrets
- [x] Integrate K8S with our **bigdl-ppml-trusted-deep-learning-gramine-ref** image as an example
    - [x] Local deployment with static yaml files.
    - [x] Deployment using script.
    - [x] Fix the performance problem caused by the default CPU share policy in kubernetes (cpu policy set to static required).
    - [x] Why PyTorch on k8s is slow -> caused by cpu topology, need to allocate cpu to the same numa node.
    - [x] Hyper-threading will cause problem in gramine, which only one of the logical cores can be used. 
- [x] Benchmark
    - [x] Compared local vs 2 machines vs 4 machines 
    - [x] Compared native vs sgx mode
    - [x]  Performance compared using k8s
    - [x] Compare performance by using different batch size (later)
- [x] Performance enhancing using Nano/OpenMp
    - [x] Test using OpenMp to enhance performance without changing code in native mode (result: good for perf)
    - [x] Test using jemalloc to enhance performance without changing code in native mode (result: good for perf)
    - [x] Test using ipex -> It does not show obvious performance enhancing in native mode.
    - [x] Test using OpenMp with jemalloc to enhance perf in sgx mode
    - [x]  Test using Gradient Accumulation to reduce Gramine overhead.
- [ ] Decide if we can use data partition for our DDP model.  Instead of downloading the entire dataset, we want each node to have part of the dataset.
- [x] end to end security
    - [x] Encrypt the input data, decrypt the data and do the training in SGX environment
    - [x] Save with model parameters after encryption
    - [x] Remote attestation integrated test


## PyTorch modifications
| PyTorch modifications                                    | Fix                                          | Additional Info                               |
|----------------------------------------------------------|----------------------------------------------|-----------------------------------------------|
| hipGetDeviceCount() raise error when no GPU is available | Apply patch to PyTorch to suppress the error | https://github.com/pytorch/pytorch/pull/80405 This problem has been fixed in PyTorch 1.13.0. |





## Gramine modifications
| Gramine modifications                 | Fix                                        | Additional Info                                 |
|---------------------------------------|--------------------------------------------|-------------------------------------------------|
| Return error when using MSG_MORE flag | Apply patch to Gramine to ignore this flag | https://github.com/analytics-zoo/gramine/pull/6 |
| ioctl has problems                    | Apply patch                                | https://github.com/analytics-zoo/gramine/pull/7 |
|Gramine cannot handle signal correctly|   Need fix later| https://github.com/gramineproject/gramine/issues/1034|




## GLOO modifications
| GLOO modifications                                           | Fix                                                                          | Additional Info                              |
|--------------------------------------------------------------|------------------------------------------------------------------------------|----------------------------------------------|
| getifaddrs use unsupported socket domain  NETLINK in Gramine | Apply patch to GLOO to use env variable to acquire network interface to use. | https://github.com/analytics-zoo/gloo/pull/1 |



## Other modifications
| Other modifications                             | Fix  | Additional Info |
|-------------------------------------------------|------|-----------------|
| Enable runtime domain configuration in Gramine  | Add sys.enable_extra_runtime_domain_names_conf = true in manifest | None.           |
|datasets package provided by huggingface using **flock filelock**|Apply patch to use fnctl file lock| Gramine only supports **fnctl file lock**|
|pyarrow error because of aws version| Downgrade PyArrow to 6.0.1||
|Insufficient memory when doing training| Increase SGX_MEM_SIZE| This is closely related to the batch size.  Generally, larger batch size requires larger SGX_MEM_SIZE.  After we have EDMM, this can be ignored|
| The default k8s cpu management policy is share| Change cpu management policy to use **static** according to [here](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/) | A related [issue](https://github.com/moby/moby/issues/38839) |
|Change topology manager policy| Change cpu topology policy to `best-effort` or `single-numa-node` according to [here](https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/)|[Ref](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#cpu-specific-optimizations)|
|Disable Hyper-threading for servers|  Disable hyper-threading in server and config kubelet accordingly| The use of hyper-threading may cause security problems such as side-channel attack.  Therefore, it is not supported by Gramine|

**Hyper-threading**
It seems that the use of Hyper-threading will also have impacts on the performance on the distributed training.
In native mode, Kubernetes will try to allocate logical cores on the same physical cores. For instance, if the user request 24 logical cores and each physical cores have two threads, then the 24 logical cores will be distributed onto 12 physical cores by default.

This behavior is described [here](https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/#more-details-please)

In SGX mode with Gramine Libos, the Hyper-threading seems have no functionality.  We try to allocate 12 logical cores on 6 physical cores.  However, as the result, only 6 logical cores can function.  A comparison can be seen by the following two figures:


![Image](https://user-images.githubusercontent.com/110874468/200520306-dde44bf8-43d9-4bbb-bafb-cd249c6ad25c.PNG)



![Image](https://user-images.githubusercontent.com/110874468/200520354-98057335-1a68-4543-bb59-933111b8cc2a.PNG)

 




## Enable TCP_TLS during computation
Check [here](https://github.com/intel-analytics/BigDL/tree/main/ppml/trusted-deep-learning#2-prepare-tls-keys)

## Optimization

### Intel-openmp

The Intel-openmp is currently not supported in SGX mode.  The related error and Gramine issue:

openat(AT_FDCWD, "/dev/shm/__KMP_REGISTERED_LIB_1_0", O_RDWR|O_CREAT|O_EXCL|0xa0000, 0666) = -2
https://github.com/gramineproject/gramine/pull/827

### Gramine-patched openmp
As recommended [here](https://github.com/gramineproject/examples/blob/master/pytorch/pytorch.manifest.template#L62), this patched OpenMP can bring better performance in the SGX enclaves.

However, after setting this in **LD_PRELOAD**, the PyTorch training in native mode will get segmentation fault.

Besides, we have to set this **LD_PRELOAD** environment variable in `bash.manifest.template`, which means that we cannot change this argument in the image after the image is built.


### Intel PyTorch extension

Ipex has almost no acceleration effect on pert training, and in sgx mode it will cause errors due to fork.
```bash
bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
 Illegal instruction (core dumped)
/usr/lib/python3/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Traceback (most recent call last):
  File "/ppml/examples/pert_ipex.py", line 12, in <module>
    import intel_extension_for_pytorch as ipex
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/__init__.py", line 24, in <module>
    from . import cpu
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/cpu/__init__.py", line 2, in <module>
    from . import runtime
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/cpu/runtime/__init__.py", line 3, in <module>
    from .multi_stream import MultiStreamModule, get_default_num_streams, \
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 43, in <module>
    class MultiStreamModule(nn.Module):
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/cpu/runtime/multi_stream.py", line 90, in MultiStreamModule
    cpu_pool: CPUPool = CPUPool(node_id=0),
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/cpu/runtime/cpupool.py", line 32, in __init__
    self.core_ids = get_core_list_of_node_id(node_id)
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/cpu/runtime/runtime_utils.py", line 20, in get_core_list_of_node_id
    num_of_nodes = get_num_nodes()
  File "/usr/local/lib/python3.7/dist-packages/intel_extension_for_pytorch/cpu/runtime/runtime_utils.py", line 4, in get_num_nodes
    return int(subprocess.check_output('lscpu | grep Socket | awk \'{print $2}\'', shell=True))
```


### jemalloc

jemalloc can bring better performance in native mode.

In sgx mode, it will cause the training speed to gradually slow down, and eventually it will be slower than not using jemalloc.


After applying jemalloc and intel-openmp in `native` mode, the execution time for 4 nodes reduced from 5450s to 4125s.


### Test cases and test data
#### Test case 0
无任何优化
环境：docker，native 模式，启用tls
baseline：无
使用4节点分布式，6w数据

5450.44s      10.5797

#### Test case 1
测试jemalloc 和 intel-omp环境变量参数带来的性能提升
环境：docker，native 模式，启用tls
baseline: test case 0中的测试结果
使用4节点分布式，6w数据

4125.75s       13.9611

#### Test case 2
测试jemalloc单独带来的性能提升
环境：docker, sgx 模式，启用tls
baseline： 之前在k8s下测得的性能数据
使用4节点分布式，6w数据

7115.37s        8.1436

出现了性能degradation，训练会越来越慢，k8s下的训练一轮的时间大概是8500s

#### Test case 3
测试jemalloc和gramine patched openmp带来的性能提升
环境：docker sgx模式，启用tls
baseline: test case 2 中测得的性能数据
使用4节点分布式，6w数据

7015.12s         8.24 
与test case 2 一样出现了性能degradation，训练会越来越慢。

#### Test case 4
使用openmp 参数  
export OMP_SCHEDULE=STATIC
export OMP_PROC_BIND=CLOSE
环境：docker，sgx 模式，启用tls
baseline：k8s 测试数据。
使用4节点分布式，6w数据

8520.7s           6.76

#### Test case 5
使用梯度累积,四个batch后做一个all_reduce
环境：docker，sgx 模式，启用tls
baseline：k8s 测试数据。
使用4节点分布式，6w数据

6749.17s         8.53467

## Perf

The performance data can be acquired [here](https://github.com/analytics-zoo/bytedance-pytorch-examples#performance)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trusted Deep Learning Toolkit: distributed pytorch on k8s-sgx #4

PyTorch modifications

Gramine modifications

GLOO modifications

Other modifications

Enable TCP_TLS during computation

Optimization

Intel-openmp

Gramine-patched openmp

Intel PyTorch extension

jemalloc

Test cases and test data

Test case 0

Test case 1

Test case 2

Test case 3

Test case 4

Test case 5

Perf

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gramine modifications	Fix	Additional Info
Return error when using MSG_MORE flag	Apply patch to Gramine to ignore this flag	analytics-zoo/gramine#6
ioctl has problems	Apply patch	analytics-zoo/gramine#7
Gramine cannot handle signal correctly	Need fix later	gramineproject/gramine#1034

Other modifications	Fix	Additional Info
Enable runtime domain configuration in Gramine	Add sys.enable_extra_runtime_domain_names_conf = true in manifest	None.
datasets package provided by huggingface using flock filelock	Apply patch to use fnctl file lock	Gramine only supports fnctl file lock
pyarrow error because of aws version	Downgrade PyArrow to 6.0.1
Insufficient memory when doing training	Increase SGX_MEM_SIZE	This is closely related to the batch size. Generally, larger batch size requires larger SGX_MEM_SIZE. After we have EDMM, this can be ignored
The default k8s cpu management policy is share	Change cpu management policy to use static according to here	A related issue
Change topology manager policy	Change cpu topology policy to `best-effort` or `single-numa-node` according to here	Ref
Disable Hyper-threading for servers	Disable hyper-threading in server and config kubelet accordingly	The use of hyper-threading may cause security problems such as side-channel attack. Therefore, it is not supported by Gramine

Trusted Deep Learning Toolkit: distributed pytorch on k8s-sgx #4

Description

PyTorch modifications

Gramine modifications

GLOO modifications

Other modifications

Enable TCP_TLS during computation

Optimization

Intel-openmp

Gramine-patched openmp

Intel PyTorch extension

jemalloc

Test cases and test data

Test case 0

Test case 1

Test case 2

Test case 3

Test case 4

Test case 5

Perf

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions