This isn't actually much of a problem and the majority of the work can be avoided with a few tweaks to the proposition IMHO.
First off, the easiest approach to working about basic threshold values has already been done by logs. You just work out a series of averages from your big table of results and go from there, there's really no need to have an internal team do this (Beyond maybe throwing the QA team's score into the pot so the players can get scored from day one), once the code's in place, it'll do the maths itself. As the game evolves, so will the results.
Secondly, avoiding issues with party comp disparities is doable but it's not perfect. Going by potency per second takes much of the RNG away and also levels the playing field somewhat through ignoring party buffs and such. That still leaves jobs that might impact their own rotation to maintain a buff for others as well as jobs where some of their potency rests on RNG (eg bloodletter procs).
As long as the ranks were reasonably lenient I don't think these details would be impactful enough to actually prevent a decent player from scoring in a manner that they should. And let's face it, SE aren't going to implement something like this in casual content whilst expecting people to perform at a 99% percentile.
Yeah agreed, I've said it so many times, but at the risk of sounding like a stuck record, Amdapor Keep was a fantastic dungeon because it did such a good job of preparing people for the end game.