I think where it ends up being an especially big problem is with low-luma regions that have subtle gradients. A particularly common example is a scene showing the sky at night. I'm not sure what the crux of the issue is, but I think it may be some combination of the following hypotheses:
1. Psychovisual models are overly pessimistic about human perception of dark scenes and bit rate is reduced in these scenes too aggressively.
2. Large areas of subtle gradation cause the motion predictor to use big 16x16 blocks. This would be bad on two counts: i) the deblocker tones down to its lowest level on motion compensation boundaries, and ii) the deblocker can affect a maximum radius of 3 pixels from the edge, which would do little more than make the blocks look fuzzy.
3. Greater quantization in dark areas means that faint, transient noise like film grain is smoothed over, but the edge detector still sees it in the source frame, and that causes the deblocker to turn off (or at least weaken) on those blocks. If the noise were there, it would perceptually mask the boundary, but because it has been lost to other compression techniques, the block boundaries stand out. This particular hypothesis suggests that the problem could be reduced without modifying the spec by adding a simple luma-sensitive denoise preprocessing step.