Gradient Dosage

2022 Sep 11

I think an underappreciated way of understanding why neural networks do what they do is the concept of gradient dosage. It's such a simple idea that it seems almost dumb, but at the same time, it's a very intuitive way of explaining lots of things about neural networks that seem otherwise mysterious.

The basic idea is that a neural net has within it various groups of neurons that are capable of recognizing particular patterns. We'll call these feature recognizers. In order to work well, feature recognizers need to have a certain set of weights. Thanks to regularization techniques like weight-decay, and just due to general noise in the gradient, the tendency is for these weights to become randomized unless backpropagated gradients consistently reinforce them. Thus the neural network is a little bit like an ecosystem, with the feature recognizers competing for sustenance in the form of reinforcing gradients. Any feature recognizer that doesn't collect a big enough gradient dosage dies. Of course, we shouldn't take this metaphor too far; it would be false to say that the feature recognizers arise through a process of evolution. Rather, the gradients can shape the feature recognizers themselves. By updating the weights in a weak feature recognizer, gradient descent can produce a strong feature recognizer, without any kind reproduction and competition being involved.

Indeed, rather than being a force of creative destruction, it seems that competition can sometimes lead to reduced performance and increased path-dependence when training neural networks. This is the point made by this recent paper, which dubs the effect "Gradient Starvation". Consider a case where three different feature recognizers all predict the class of an image to 90% accuracy, and the three features are independent. If the network starts out by mainly listening to the first feature recognizer, then the first will get a very high gradient dose, since it is being relied on to achieve 90% prediction accuracy. The other two feature recognizers could be used to get a better accuracy on the remaining 10% of images, but that's going to be a gradient dose of (roughly) 9 times less! Since all feature recognizers are equally good, the network would be best off listening to all of them, and taking the majority opinion as its prediction, but one feature starves the others of gradient dosage, leading to suboptimal results. You can read the full paper for a much better explanation, as well as a regularization technique that the authors propose in order to mitigate this phenomenon.

Gradient dosage also provides an interesting explanation for the fact that neural networks tend to learn functions that can generalize well from the training data to the test data. This explanation was first proposed by Quintin Pope to the best of my knowledge. A general feature recognizer is one that applies to a lot of training examples, while a special purpose feature recognizer might just apply to a single example (if the network is just memorizing the training data). From this, we can see that the general feature recognizer is going to get a high gradient dose, since it will be reinforced on almost every example. The special purpose feature recognizer, on the other hand, is only reinforced on a couple of training examples, and so its gradient dose will be orders of magnitude smaller. This is a pleasingly mechanistic explanation. Once again, you can read the full article for more details.