Maybe Whisper? This github repo: https://github.com/linto-ai/whisper-timestamped
Says thay whispher can do timestamps on speech segments. However, I don't know if that's what you want, since whispher might only be able to do that if it is transcribing the actual audio, rather than editing another text file.