Voice = ref name on the TTS server (today only default is meaningful). Speed multiplies the generation rate.
default
data:audio/wav;base64,...
No file selected. Min ~3s, max ~30s clear speech.