Accelerate computer vision training using GPU preprocessing with NVIDIA DALI on Amazon SageMaker

CV information preprocessing generally consists of two compute-intensive steps on CPUs– JPEG decoding and use-case particular enhancements. GPUs can perform some operations faster than CPU with tera-floating point operations per second (tFLOPS). However for the GPU to perform such operations, the information need to be available in the GPU memory. The usage of GPUs varies with design intricacy, with bigger designs requiring more GPU resources. The difficulty lies in optimizing the steps of the training pipeline so that the GPU doesnt starve for the data to perform its computations, thus optimizing total resource usage. The ratio in between the needed data preprocessing load on the CPU and the amount of design calculation on the GPU often depends upon the use-case. As described in another post, a CPU traffic jam normally occurs when this ratio exceeds the ratio in between the overall CPU and GPU calculate capability.
Experiment introduction
In this post, we show an example with a PyTorch CV model to visualize resource traffic jams with Debugger. We compare and contrast performance gains by unloading information preprocessing to GPU using NVIDIA DALI across 3 model architectures with increasing augmentation load aspects.
The following is the total architecture for the training setup utilizing SageMaker that we use to standard the various training job trials.

AWS customers are increasingly training and tweak big computer vision (CV) designs with hundreds of terabytes of data and millions of specifications. Advanced driver help systems (ADAS) train understanding models to spot pedestrians, roadway indications, vehicles, traffic lights, and other objects. Identity verification systems for the financial services market train CV designs to verify the authenticity of the individual declaring the service with live video camera images and official ID files.
With growing information sizes and increasing design complexity, there is a need to resolve efficiency bottlenecks within training tasks to decrease expenses and turn-around times. Training bottlenecks include storage space and network throughput to move data in addition to upgrading design checkpoints, gradients, and parameters. Another common traffic jam with deep knowing designs is under-utilization of pricey GPU resources due to CPU-bound preprocessing bottlenecks.
Clients wish to determine these bottlenecks with debugging tools and enhance preprocessing with enhanced libraries and other finest practices.
Amazon SageMaker allows artificial intelligence (ML) specialists to develop, train, and release premium designs with a broad set of purpose-built ML abilities. These totally managed services look after the undifferentiated facilities heavy lifting involved in ML projects. SageMaker offers ingenious open-source training containers for deep knowing structures like PyTorch (toolkit, SDK), TensorFlow (toolkit, SDK), and Apache MXNet (toolkit, SDK).
In this post, we focus specifically on determining CPU traffic jams in a SageMaker training task with SageMaker Debugger and moving information preprocessing operations to GPU with NVIDIA Data Loading library (DALI). The repository uses SageMaker training with two executions of data preprocessing: NVIDIA DALI on GPU, and PyTorch dataloader on CPU, as shown in the following diagram.

We perform the following trials:

Augmentation Load Factor
Trial A (Seconds/Epoch).
Trial B (Seconds/Epoch).
Improvement Training Time (%).

The colors purple and yellow in the heat map suggest utilization near 100% and 0%, respectively. Trial A shows the CPU bottleneck with all CPU cores at optimum utilization, while the GPU is under-utilized with frequently stalled cycles. This traffic jam is dealt with in Trial B with less than 100% CPU utilization and higher GPU usage throughout the information preprocessing stage.
You can either access this report from Studio or the Amazon Simple Storage Service (Amazon S3) pail where we have the training outputs. It shows the compute use improvement stats of CPU and GPU for minimum, maximum, p99, p50, and p90 percentiles for the training tasks.
The following screenshot reveals systems stats for Trial A.


The objective is to balance the load in between the CPUs and GPUs by moving the compute extensive operations of JPEG decoding and enhancement to the GPU. Because were increasing the size of the calculation graph that is working on GPUs, we need to make certain that there suffices unused memory for information preprocessing operations. You can do this with a smaller sized training batch size.
We run SageMaker training tasks with the PyTorch Estimator 1.8.1 for two epochs with a batch-size of 32 images (298 steps per epoch). We utilize three various CV designs of increasing intricacies ResNet18, ResNet50, and ResNet152 from the PyTorch pretrained design repository, Torchvision.
Depending on the usage case, training CV models frequently requires heavy data augmentation operations, such as dilation and gaussian blur. We replicate a real-world situation by introducing a heavy data augmentation load. We utilize a load factor to repeat the operations of horizontal flip, vertical flip, and random rotation Nx times prior to resizing, normalizing, and cropping the image.
Compare and identify CPU bottlenecks with Debugger
We determine training time and system metrics with Debugger to identify possible source and bottlenecks. Debugger assists capture and imagine real-time design training metrics and resource utilization data. They can be recorded programmatically using the SageMaker Python SDK or aesthetically through SageMaker Studio.
For the trials conducted with ResNet18 model with an augmentation load of 12x, we utilized the smdebug library in util_debugger. py to produce the following heat map of CPU and GPU usage for the two training jobs.

We can observe that the p95 CPU usage dropped from 100% in Trial A to 64.87% in Trial B, consequently addressing the information preprocessing traffic jam. The p95 GPU utilization increased from 31% to 55% and has further potential to process more load.
For more details about the various choices of using Debugger to gain more insights and recommendations, see Identify traffic jams, enhance resource usage, and decrease ML training costs with the deep profiling feature in Amazon SageMaker Debugger.
Experiment outcomes.
The following table reveals the training time in seconds per epoch for various enhancement loads for both the trials when utilizing a batch size of 32 in training the three ResNet models for 2 dates. The visualization charts following the table summarize the insights from these outcomes.

The following screenshot reveals systems statistics for Trial B.












If its too low, it might increase the training time and impact design merging. The training structure binaries likewise provide low-level configurations to take full advantage of CPUs, such using the sophisticated vector extensions, if suitable.

With growing information sizes and increasing design intricacy, there is a requirement to attend to performance bottlenecks within training jobs to decrease costs and turnaround times. Move augmentation to the data preparation stage– Identifying operations that can be moved to the raw training information creation stage can release up CPU cycles during training. Its essential not to increase the size of the training data excessively since this may increase network traffic to fill the information and trigger an I/O traffic jam.

Recognize the ideal instance type– Depending on the data preprocessing load and design intricacy, choosing instances with an optimal ratio of variety of GPUs and cpus can balance the load to minimize traffic jams. This in turn speeds up training. SageMaker provides a range of instances with different CPU/GPU ratios, memory, storage types, and bandwidth to pick from. Understanding more about choosing right-sized resources and picking the ideal GPU can help you pick the proper circumstances.

Move enhancement to the data preparation phase– Identifying operations that can be moved to the raw training information production phase can release up CPU cycles throughout training. These are normally preprocessing steps that should not depend on a hyperparameter or need not be used arbitrarily. Its important not to increase the size of the training data excessively because this might increase network traffic to load the information and cause an I/O bottleneck.

The following chart illustrates how Trial B has consistently lower training time than Trial An across all the 3 models with increasing augmentation load aspects. This is because of effective resource usage with information preprocessing offloaded to the GPU in Trial B. A complicated design (ResNet152 over ResNet18) has more criteria to train and therefore more time is spent in forward and backwards pass on the GPU. This results in a boost in total training time as compared to a less intricate model throughout both trials.

ResNet18 reveals an increase in enhancement from 48.88% to 72.59% when training with 1x and 12x enhancement load, respectively.
ResNet152 reveals an enhancement of 33.95% compared to ResNet18 with 72.59% for the very same enhancement load of 12x.
The enhancement in training time reduces with increasing model intricacy due to the fact that the GPU usage is higher for training jobs and is less offered for the data preprocessing job.

Offload information preprocessing to other devices– GPUs are the most pricey resources when training designs. A method to deal with the expense and efficiency bottleneck is to move some of the CPU heavy information preprocessing activity to devoted employees on different instances with only CPU cores. Its crucial to think about that such a remote worker maker need to have sufficient network bandwidth for information transfer so regarding make sure total efficiency enhancement.

Finest practices to address data preprocessing CPU traffic jams.
Beyond the approach we went over in this post, the following are extra best practices for dealing with preprocessing CPU traffic jams:.

The following chart illustrates the portion improvement in training times with Trial B over Trial An across various models and enhancement loads. The improvement in training time increases with increasing augmentation load and decreases with increasing design complexity. We can observe the following:.

Training CV designs typically needs complex and multi-stage data processing pipelines that include loading, decoding, cropping, resizing, and other use-case particular augmentations. They are natively operated on the CPUs and frequently end up being a traffic jam, restricting the performance and scalability of training jobs.
In this post, we demonstrated how you can utilize Debugger to identify resource traffic jams and enhance performance by moving information preprocessing operations to the GPU with SageMaker and NVIDIA DALI. We demonstrated a training time enhancement of 72%, 37%, and 43% for ResNet50, resnet152, and resnet18, respectively, for a continuous augmentation load.
A number of elements assist determine whether data preprocessing on GPU will improve performance for your CV training pipeline. These consist of the computational intricacy of the design and the augmentation operations, existing usage of GPUs, training circumstances types, data format, and individual JPEG image sizes. The demonstrated performance improvement minimizes the total expense since the expense of a training job is reliant on the circumstances type and training time.
Attempt out the sample code utilized for this benchmarking experiment and the other finest practices described in this post.

Balance the load among offered CPUs– Assigning an ideal variety of employees in a multi-CPU core system can help stabilize the information preprocessing load and avoid having some CPU cores constantly being more active than others. When increasing the number of employees, you need to account for the prospective traffic jam that can accompany respect to the concurrency limitations of the file system.

Offload information preprocessing to other devices– GPUs are the most expensive resources when training models. Due to the fact that the cost of a training job is reliant on the instance type and training time, the shown efficiency improvement decreases the total expense.

About the Authors.
Sayon focuses on assisting consumers across Europe, Middle East, and Africa to design and release ML services in production. In his free time, he loves to travel, check out cuisines and cultures, and is passionate about photography.
He has over 12 years of work experience as an information researcher, device knowing specialist and software application developer. In his spare time, Hasan likes to check out nature and invest time with friends and household.

Leave a Reply

Your email address will not be published.