They cannot try out their patches in a way that will tell them exactly what the damage comparison will be. it is impossible for them to get accurate data about how the overall damage of a job will be across the entire community. They only do a very small sample-size test and have most likely done so.

the metrics only reveal themselves when the content is handed to the masses where a proper sample size is taken. This is why they make (usually) very small, careful and iterative changes.