I am not a game designer or engineer so I don't have the specifics but I do know it has something to do with how many different details can be displayed at once. I am probably not going to explain this well but...
Imagine Limsa has 30 players gathered around the Aetheryte. There's currently 5 races each with 2 variants so thats 10 possible different racial types. Even though some have only minor differences like skin color or eye irises, others like the Hyur have a more notable difference in body. Then there's like 4 or so different faces for each race. Then there's 2 genders for each. Then there's other details like make up/tattoos/tails. Then each of those players is wearing different gear. Each of those 30 players can look very different from the other both in model and gear and the more differences there and/or the more players there are the bigger the load on the server.
All those differences eat up resources and the more players you have, especially if gathered in one area, the more effort the engine has to put out to keep up. A game where there is only one player character or only a few player characters that are pre-rendered and don't change or have limited options to change their appearance eats up MUCH less resources. This is one reason why single player games can look so much better than MMO's. So I imagine that more face options means more differences which means more work for the engine to do and more data eating up bandwidth.
It's a sad truth that despite the advances in technology we are still hamstrung by the tech of our time. Can you imagine if/when a new type of wireless internet is developed so people across the globe have equal ping no matter how remote they might be located from the main server? Or how much more graphically realistic MMO's could look without bandwidth limitations? Gaming 100 years from now will probably be amazing.