The University of Southampton

It is the movement of data, not arithmetic operations, that dominate the energy costs of deep learning inference calculations. In this work, we focus on reducing these data movements costs by reducing the number of unique weights in a network. The thinking goes that if the number of unique weights was to be kept small enough, then the entire network could be distributed and stored on processing elements (PEs) within accelerators, and the data movement costs for weight reads substantially reduced.

To this end, we investigate the merits of a method, which we call Weight Fixing Networks (WFN). We design the approach to realise four model outcome objectives: i) very few unique weights, ii) low-entropy weight encodings, iii) unique weight values which are amenable to energy-saving versions of hardware multiplication, and iv) lossless task-performance.

Some of these goals are conflicting. To best balance these conflicts, we combine a few novel (and some well-trodden) tricks; a novel regularisation term, (i, ii) a view of clustering cost as relative distance change (i, ii, iv), and a focus on whole-network re-use of weights (i, iii). The method is applied iteratively, and we achieve state-of-the-art (SOTA) results across relevant metrics. Our Imagenet experiments demonstrate lossless compression using 50x fewer unique weights and half the weight-space entropy than SOTA quantisation approaches.