In general, I think the best way to test creatives (on Facebook or otherwise, although Facebook does have some unique features -- more below) is to start with A/B tests at the concept level and move into budget allocation and multi-variate testing with variants of proven concepts. This means that I tend to think about A/B testing in a concept vs. concept context, with variant selection happening in a dynamic, within-an-ad-set approach.
What I mean by this, and what you allude to in your question, is that you need a lot of traffic to support a real A/B test, and that traffic should be roughly identical in terms of composition and volume, so it gets expensive to simultaneously test a number of variants. In forcing a hard "decision point," I like to gather as much data around each variant as possible at minimal cost -- a good way to do that is with an A/B test that compares two concepts against each other for clicks. Think about this as being the essence of the ad: the thing that is being showcased. This can be split-tested with a small number of variants to produce guidance around what content the ad should portray to best capture potential users' attention.
Next, you can split-test a few variants of the aesthetic tone of that ad: is it dark, light, realistic, cartoonish, etc. This is still more specific than the first test but still conceptual enough to be reasonable to do with a split test.
Once the tone of the concept is captured, you can move into testing specific variants against each other using normal budget-optimization within an ad set: this isn't a test per se but rather a dynamic process that allocates budget to the best variants as data is gathered and conversions are counted. These variants can all sit in one ad set and be pruned as needed (or simply ignored as they stop being served impressions). Here's a simple diagram of this process:
There's an additional step that can happen after Test 3, which is to move the best performing variants into an evergreen ad set that contains the best performers across all concepts.
Note that what I'm not doing in this process is definitively determining which single variant of all possible concepts performs the best: that would, as the question implies, be very time consuming and very expensive. What I am doing is using my intuition around my product, my users, and advertising in general to produce a handful of concepts that I feel capture the essence of my product and will resonate well with users. I am then systematically calibrating those concepts until a manageable number of performant ad creatives has been generated. I like this passage from a seminal paper on the topic of online experimentation, Online Controlled Experiments at Large Scale:
Organizations should consider a large number of initial ideas and have an efficient and reliable mechanism to narrow them down to a much smaller number of ideas that are ultimately implemented and released to users in online controlled experiments. For this funnel of ideas to be efficient, low cost methods such as pitching ideas and reviewing mockups are needed to evaluate and narrow down the large number of ideas at the top of the funnel. Controlled experiments are typically not suitable to evaluate ideas at the top of the funnel because they require each idea to be implemented sufficiently well to deploy and run on real users, and this feature development cost can be high. Hence, at the top of the funnel more ideas are evaluated using low-cost techniques, but with lower fidelity. Conversely, at the bottom of the funnel there are fewer ideas to evaluate and the organization should use more reliable methods to evaluate them, with controlled experiments being the most reliable and preferred method.
I want to let Facebook's budget-allocation algorithm do its job after I have gathered data around high-level concepts; I don't want to let it loose on a huge number of variants, as that will increase my cost and operating speed.