Artificial neural networks work the following way: you have a bunch of “neurons”...

Artificial neural networks work the following way: you have a bunch of “neurons” which have inputs and an output. Neuron’s inputs have weights associated with them, the larger the weight, the more influence the input has on the neuron. These weights need to be represented in our computers somehow, usually people use IEEE754 floating point numbers. But these numbers take a lot of space (32 or 16 bits). So one approach people have invented is to use more compact representation of these weights (10, 8, down to 2 bits). This process is called quantisation. Having a smaller representation makes running the model faster because models are currently limited by memory bandwidth (how long it takes to read weights from memory), going from 32 bits to 2 bits potentially leads to 16x speed up. The surprising part is that the models still produce decent results, even when a lot of information from the weights was “thrown away”.