CAS MSc Thesis Presentation

Hybrid Posit and Fixed Point Hardware for Quantized DNN Inference

Zep Kleijweg

The recently introduced posit number system was designed as a replacement for IEEE 754 floating point, to alleviate some of its shortcomings. As the number distribution of posits is similar to the data distributions in deep neural networks (DNNs), posits offer a good alternative to fixed point numbers in DNNs: using posits can result in high inference accuracy while using low precision numbers. The number accuracy is most important for the first and last network layers to achieve good performance. For this reason, these are often computed using larger precision fixed point numbers compared to the hidden network layers. Instead, these can be computed using low precision posit, to reduce the memory access energy consumption and the required memory bandwidth. The hidden layer computation can still be performed using cheaper fixed point numbers.
An inference accuracy analysis is performed to quantify what the effect of this approach is on the VGG16 network for the ImageNet image classification task. Using 8 bit posit for the first and last network layer instead of 16 bit fixed point is shown to result in a top-5 accuracy degradation of only 0.24%. The hidden layers are computed using 8 bit fixed point in both cases.
The design of a parameterized systolic array accelerator performing exact accumulation is proposed that can be used in a scale-out system along with fixed point systolic array tiles. To increase hardware utilization, a hybrid posit decoder is designed to enable fixed point computation on the posit hardware. Using this hardware, the entire network can be computed using 8 bit data, instead of using 16 bits for some layers. This reduces energy consumption and the complexity of the memory hierarchy

Overview of MSc SS Thesis Presentation