I'd like to export transcription data that I can use in an interactive transcript. i.e. A transcript where each word is highlighted as it is spoken in the audio (in the same way that words are highlighted as they are spoken in the Descript app itself).
The VTT spec allows for Karaoke style cues:
00:16.500 --> 00:18.500
When the moon <00:17.500>hits your eye
00:00:18.500 --> 00:00:20.500
Like a <b><00:19.000>big-a <00:19.500></b>pizza <00:20.000>pie
00:00:20.500 --> 00:00:21.500
That's <00:00:21.000>amore
Another option would be to export a JSON file that contains an array of words with a start_time and end_time value for each.
Ideally you'd be able to export to both these options, and have some control over how the exported data is formatted. For example, should the current word in a line inside a VTT file be bolded.