-
Notifications
You must be signed in to change notification settings - Fork 43
Description
The problem
Look at this translation
The original text is
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1] The reverse process is speech recognition.
This text looks ridiculously wrong translated:
Синтез речи является искусственным производством человеческого речь.
It looks much better if I select this text segment and translate it:
Синтез речи - это искусственное воспроизведение человеческой речи
So the problem not in translator here. The problem is words have been translated separately and translator don't get the context.
Potential solutions
This text is placed inside a block element:
We could schedule translation better, to translate whole text in block in single context.
Example algorithm is
- to find a parent block container of target node for translation
- wait some time to collect more neighborhoods (i mean nodes inside the same block, not a direct siblings of target node)
- form single string that includes all texts in block and have special markup that may be split back
- translate this special string
- parse translated string back to segments and validate
- in case validation is fail, and we have incorrect number of elements - run translation step again, maybe with modified string, to get another result
- resolve each translation request in chunk with a proper string
Example
For the source HTML below
<p><b>Speech synthesis</b> is the artificial production of human <a href="/wiki/Speech" title="Speech">speech</a>. A computer system used for this purpose is called a <b>speech synthesizer</b>, and can be implemented in <a href="/wiki/Software" title="Software">software</a> or <a href="/wiki/Computer_hardware" title="Computer hardware">hardware</a> products. A <b>text-to-speech</b> (<b>TTS</b>) system converts normal language text into speech; other systems render <a href="/wiki/Symbolic_linguistic_representation" title="Symbolic linguistic representation">symbolic linguistic representations</a> like <a href="/wiki/Phonetic_transcription" title="Phonetic transcription">phonetic transcriptions</a> into speech.<sup id="cite_ref-1" class="reference"><a href="#cite_note-1"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</span></a></sup> The reverse process is <a href="/wiki/Speech_recognition" title="Speech recognition">speech recognition</a>.
</p>We could generate next markup
<pre>
<pre>Speech synthesis</pre><pre> is the artificial production of human</pre>
<pre>speech</pre><pre>. A computer system used for
this purpose is called a <b>speech synthesizer</b>, and can be implemented in</pre>
<pre>software</pre><pre> or</pre>
<pre>hardware</pre><pre> products. A </pre><pre>text-to-speech</pre><pre> (</pre><b>TTS</b><pre>) system converts normal language
text into speech; other systems render</pre>
<pre>symbolic linguistic representations</pre><pre> like</pre><pre>phonetic transcriptions</pre><pre> into speech.</pre><pre><pre><pre>[</pre>1<pre>]</pre></pre></pre><pre> The reverse process is</pre>
<pre>speech recognition</pre><pre>.</pre>
</pre>that would be translated to
<предварительный>
<pre>Синтез речи</pre><pre> - это искусственное воспроизведение человеческой</pre>
<pre>речи</pre><pre>. Компьютерная система, используемая для
этой цели, называется <b>синтезатором речи</b> и может быть реализована в</pre>
<pre>Программное обеспечение</pre><pre> или</pre>
<pre>аппаратное обеспечение</pre><pre> продукты. A </pre><pre>преобразование текста в речь</pre><pre> (</pre><b>TTS</b><pre>) система преобразует обычный язык
текст в речь; другие системы воспроизводят</pre>
<pre>символические лингвистические представления</pre><pre> Нравится</pre><pre>фонетические транскрипции</pre><pre> в речь.</pre><pre><pre><pre>[</pre>1<pre>]</pre></pre></pre><pre> Обратный процесс - это</pre>
<pre>распознавание речи</pre><pre>.</pre>
>As we can see, the translated result is broken, because first tag <pre> is translated as <предварительный> so we have to find a proper format to translate text batches.
The goals
We have to find a way how to translate many texts with single request and then parse text back.
The methodology must describe an approach and algorithm how to implement such approach, including text formats and parsing approach.