Google has announced a new translate tool which convert sone language into another and preserves the speaker's original voice، daily mail reported.
The tech giant's new system works without the need to convert it to text before.
A first-of-its-kind، the tool is able to do this while retaining the voice of the original speaker and making it sound 'more realistic'، the tech giant said.
Google claims the system، dubbed 'Translatotron'، will be able to retain the voice of the original speaker after translation while also understanding words better.
It can directly translate speech from one language into speech in another language، without relying on the intermediate text representation in either language، as is required in cascaded systems.
'Translatotron' is the first end-to-end model that can directly translate speech from one language into speech in another language،' Google wrote in a blog post.
Currently، Google Translate's system uses three stages.
Automatic speech recognition، which transcribes speech as text; machine translation، which translates this text into another language; and text-to-speech synthesis، which uses this text to generate speech.
The tech giant now says it will use a single model without the need for text.
This system avoids dividing the task into separate stages،' the blog post by Google AI software engineers Ye Jia and Ron Weiss said.
According to the company، this will mean faster translation speed and fewer errors.
The system retains the speaker's voice by using spectrograms، a visual representation of the soundwaves، as its input.
It then generates these spectrograms، also relying on a neural vocoder and a speaker encoder، meaning the speaker's vocal characteristics stay the same once translated.
Google admitted that the system needs refining through further training of the algorithm.
Sound clips published in the post were more 'realistic' than a machine voice، but still unmistakably computer-generated.
HOW DOES 'TRANSLATOTRON' WORK? Translatotron is based on a sequence-to-sequence network which takes source spectrograms، a visual representation of the soundwaves، as input and generates spectrograms of the translated content in the target language.
It also makes use of two other separately trained components: a neural vocoder that converts output spectrograms to waveforms.
Optionally، a speaker encoder that can be used to maintain the character of the source speaker’s voice in the synthesized translated speech.
During training، the sequence-to-sequence model uses a multitask objective to predict source and target transcripts at the same time as generating target spectrograms.
However، no transcripts or other intermediate text representations are used during inference.