It's not that doing it with VCOs is costly...it's the whole concept itself. Full polyphony means that you have a fully-independent synth under the control of each voice signal, so you have to replicate the VCOs, VCF, VCAs, EGs, LFOs and so on over and over until you arrive at your final output mixer, where you'll mix the different voice signals together for a single mono or stereo output.
The next step down from this isn't actual polyphony. It's something referred to as 'paraphony'; each set of sound generators is controlled by a single voice signal, but instead of replicating the rest of the audio and control chains per voice, the mixdown to a single signal happens after the VCOs, then this goes through a single VCF, etc etc chain to the output. This method actually makes more sense in a modular context, since you can branch and recombine all sorts of paths along that post-VCO chain for sonic variation and arrive at a more controllable (and affordable!) system as a result. This is what I'd recommend as an approach, as a true polyphonic modular is, by default, going to be very spendy and also hell to patch and control. Think something along the lines of Junkie XL's MU 'wall' or Hans Zimmer's monster wall rig of Moog, PPG and Roland modules.