That will be the problem. While they create a level playing field for all products to be tested, they don't put each product in its target realm.
In their taxi/oil test, although very comprehensive in its scope, chose an environment where the differences between dino and synth are least challenged. A taxi on conventional, under the prescribed criteria, already got (probably) 300-400k ..and much more if the fuel component was stretched out into real miles.
No one drives like that outside of a taxi (for the most part). Now if they had done this and a flat out track sequence ..and an Artic cold starting sequence ..and a Death Valley endurance sequence...then they might cover most of the stuff you might encounter over a broad sampling. Then you could draw sensible conclusions based on how your service intruded into those conditions.