|
| 1 | +DataLoader |
| 2 | +========================================= |
| 3 | + |
| 4 | +Deep Learning has been encountering larger and larger datasets which are so memory consuming. Before, working with large datasets requires loading them into memory all at once. It is impossible due to the lack of memory, we must figure out an efficient data generation scheme. This is not only about handle the lack of memory in large datasets, also about make the process of loading data faster enough using multi processing/thread. We call the data generation object as 'DataLoader'. |
| 5 | + |
| 6 | +With the importance of DataLoader, different framework have their own DataLoadermodule, as for Intel® Low Precision Optimization Tool, it needs to calibrate the inputs/outputs of each layer of the model, framework specific DataLoader has different features and API that will make it hard to use them same way in the tool. Another request is, the tool also treat batch size as a tuning parameter, that means the tool can dynamically change the batch size to get accuracy target. The third reason is for easy of use, an unified DataLoader API can make it easy to config dataloader in yaml file without any code modification. Considering about all these advantages the tool has implemented an internal DataLoader. |
| 7 | + |
| 8 | +DataLoader takes dataset as input parameter and loads data from dataset when needed. |
| 9 | + |
| 10 | +Dataset is a container which holds all data that should be used by dataloader, and have the ability to be fetched by index or created as an iterator. One can implement a specific Dataset by inhereting from class Dataset with implementing `__iter__` method or `__getitem__` method, while implementing `__getitem__` method, `__len__` method is recommended. |
| 11 | + |
| 12 | +Dataset use Transform as its data process component, Transform contains 3 different part, aimng at different part of the life cycle of data processing, it is: |
| 13 | + |
| 14 | + 1. preprocessing |
| 15 | + |
| 16 | + 2. postprocessing |
| 17 | + |
| 18 | + 3. general |
| 19 | + |
| 20 | +General Transform can be used in both preprocessing and postprocessing, one can also implement a specific transform by inheriting from class Transform with implementing `__call__` method. Usually, DataLoader will use Transform for preprocessing and postprocessing transform is used to give right processed data to metric to update. Transforms also support to compose together to be one and serially implement the transforms. |
| 21 | + |
| 22 | +Transform for preprocessing will be launched in Dataset `__getitem__` or `__next__` method, that means transform is used after dataloader has loaded batched data and before the data given to model for inference. That helps reduce the memory compared with load and process all data at once. Transform for postprocessing is used in evaluation function of internal lpot to process the inferenced data and the processed data is used by metric. |
| 23 | + |
| 24 | +# How to use it |
| 25 | + |
| 26 | +## Config dataloader in yaml file |
| 27 | +In this case dataloader will created after the Quantization object initialized. As calibration and evaluation may have different Transform and dataset, you can config different dataloader in yaml file. |
| 28 | + |
| 29 | +```yaml |
| 30 | +quantization: # optional. tuning constraints on model-wise for advance user to reduce tuning space. |
| 31 | + calibration: |
| 32 | + sampling_size: 300 # optional. default value is 100 samples. used to set how many samples in calibration dataset are used. |
| 33 | + dataloader: |
| 34 | + dataset: |
| 35 | + ImageFolder: |
| 36 | + root: /path/to/calibration/dataset |
| 37 | + transform: |
| 38 | + RandomResizedCrop: |
| 39 | + size: 224 |
| 40 | + RandomHorizontalFlip: {} |
| 41 | + ToTensor: {} |
| 42 | + Normalize: |
| 43 | + mean: [0.485, 0.456, 0.406] |
| 44 | + std: [0.229, 0.224, 0.225] |
| 45 | + |
| 46 | +evaluation: # optional. required if user doesn't provide eval_func in lpot.Quantization. |
| 47 | + accuracy: # optional. required if user doesn't provide eval_func in lpot.Quantization. |
| 48 | + metric: |
| 49 | + topk: 1 |
| 50 | + dataloader: |
| 51 | + batch_size: 30 |
| 52 | + dataset: |
| 53 | + ImageFolder: |
| 54 | + root: /path/to/evaluation/dataset |
| 55 | + transform: |
| 56 | + Resize: |
| 57 | + size: 256 |
| 58 | + CenterCrop: |
| 59 | + size: 224 |
| 60 | + ToTensor: {} |
| 61 | + Normalize: |
| 62 | + mean: [0.485, 0.456, 0.406] |
| 63 | + std: [0.229, 0.224, 0.225] |
| 64 | + performance: # optional. used to benchmark performance of passing model. |
| 65 | + configs: |
| 66 | + cores_per_instance: 4 |
| 67 | + num_of_instance: 7 |
| 68 | + dataloader: |
| 69 | + batch_size: 1 |
| 70 | + dataset: |
| 71 | + ImageFolder: |
| 72 | + root: /path/to/evaluation/dataset |
| 73 | + transform: |
| 74 | + Resize: |
| 75 | + size: 256 |
| 76 | + CenterCrop: |
| 77 | + size: 224 |
| 78 | + ToTensor: {} |
| 79 | + Normalize: |
| 80 | + mean: [0.485, 0.456, 0.406] |
| 81 | + std: [0.229, 0.224, 0.225] |
| 82 | +``` |
| 83 | +
|
| 84 | +## Create user specific dataloader |
| 85 | +
|
| 86 | +```python |
| 87 | +calib_data = mx.io.ImageRecordIter(path_imgrec=dataset, |
| 88 | + label_width=1, |
| 89 | + preprocess_threads=data_nthreads, |
| 90 | + batch_size=batch_size, |
| 91 | + data_shape=data_shape, |
| 92 | + label_name=label_name, |
| 93 | + rand_crop=False, |
| 94 | + rand_mirror=False, |
| 95 | + shuffle=args.shuffle_dataset, |
| 96 | + shuffle_chunk_seed=args.shuffle_chunk_seed, |
| 97 | + seed=args.shuffle_seed, |
| 98 | + dtype=data_layer_type, |
| 99 | + ctx=args.ctx, |
| 100 | + **combine_mean_std) |
| 101 | + |
| 102 | +from lpot import Quantization, common |
| 103 | +quantizer = Quantization('conf.yaml') |
| 104 | +quantizer.model = common.Model(fp32_model) |
| 105 | +quantizer.calib_dataloader = calib_data |
| 106 | +quantizer.eval_dataloader = calib_data |
| 107 | +q_model = quantizer() |
| 108 | +``` |
| 109 | + |
0 commit comments