YobiYoba web service API - Basic Methods
YobiYoba service - streaming and real-time STT API
- With the streaming mode the client host sends an audio stream and the service sends back an XML stream (one HTTP request per stream). The HTTP chunk mode is used both directions. The audio data should be sent while the data is being recorded. It is possible to specify the average latency, but it cannot be less than 10-15s. Large latency values (up to a few minutes) give the best accuracy. The audio stream can last up to 12 hours. This mode is supported for all languages accepted by the bs_trans method.
- The real-time mode works like the streaming mode but with a very low latency. It should be noted however that this mode produces a less complete XML (e.g. no punctuation, no speaker diarization) and that the speech/non-speech detection must be performed by the client software if such a feature is needed. As for the streaming mode the audio stream can last a few hours but the real-time mode is also suited for interactive usages. This mode is currently not supported for all languages. Contact us at yobiyoba@yobinext.com if you have specific needs.
Streaming mode
The streaming mode is activated by using the bs_xtrans method and by specifying the mandatory model and audiofile arguments. Here is an example request:PUT /api?method=bs_xtrans&model=eng&audiofile=stream1.mp3 HTTP/1.1 api-key: your api key User-Agent: ClientProgram/1.0 Host: member.yobiyoba.com:8095 Transfer-Encoding: chunked [blank line] [chunk#1] [chunk#2] ... [empty chunk]
You can use any chunk size, but it should small enough to not have an impact on the
latency. Here is a curl example:
cat stream1.mp3 | curl -ksS -H "api-key: $api_key" -T -
"https://member.yobiyoba.com:8095/api?method=bs_xtrans&model=eng&audiofile=stream1.mp3"
Here the 'cat' process needs to be replaced by the streaming process to get the correct latency. For example you can instead use another 'curl' process reading the audio stream. The YobiYoba server also uses the HTTP chunked transfer encoding to send you back the XML transcription, i.e. the XML stream must be read on the fly while you are sending the audio stream. Also since an error can occur after you have received some of the transcript, you need to check for an error message at the end of the XML stream (last line). To properly end the audio stream you have to send a last chunk with a zero size. The service continuously computes the latency and returns it in the <Latency> XML tag. Here is an example tag:
<Latency stime="1289.00" etime="1295.61" seg="9.7" avg="9.9"/>
This means that the average latency for the words in the segment specified by stime and etime is 9.7s and that the average latency since the start of the file is 9.9s.
The 'xlopt' parameter can be used to specify the latency. The minimum value is 15.0 (for 15s) and there is no maximum value. The default value is about 16s. Here is an example URI using this option specifying a delay of 60s:
https://.../api?method=bs_xtrans&model=eng&xlopt=60&audiofile=stream1.mp3
Real-time mode
The real-time mode is activated by adding the option rtopt=1 to the bs_trans or cts_trans methods and by specifying the mandatory model and audiofile arguments. Here is example request:You can use any chunk size, however to minimize the latency, the chunk size should contain less than 20ms of audio, i.e. for a 64kbps audio coding, the chunk size should be less then 1600 bytes. Here is a curl example:PUT /api?method=bs_trans&rtopt=1&model=eng&audiofile=stream1.mp3 HTTP/1.1 api-key: your api key User-Agent: ClientProgram/1.0 Host: member.yobiyoba.com:8095 Transfer-Encoding: chunked [blank line] [chunk#1] [chunk#2] ... [empty chunk]
cat stream1.mp3 | curl -ksSN -H "api-key: $api_key" -T -
"https://member.yobiyoba.com:8095/api?method=bs_trans&rtopt=1&model=eng&audiofile=stream1.mp3"
Note that the curl -N option disables the buffering of the XML stream so as to not delay it. To process the audio in real-time you have to replace the 'cat' command in the above example by your recording process. Depending on your use case, you may need a real-time voice activity detection in your recording process as this functionality is not provided by the service.
To use a text file for language model adaptation, you first need to upload the file and then specify this file in your STT request as follows:
curl ... "https://.../api?method=upload&textfile=mytext.txt" -T mytext.txtcurl ... -T - "https://.../api?method=bs_trans&rtopt=1&model=eng&textfile=mytext.txt&audiofile=stream1.mp3"
Model adaptation is not suitable for an interactive use case as it adds a delay at the beginning of the stream. However it is fine for for long streams, as the delay is quickly compensated.