YobiYoba web service API - Basic Methods

REST API

YobiYoba service - streaming and real-time STT API

The YobiYoba STT service currently offers 3 submission modes: the file mode (default), the streaming mode, and the real-time mode.
  • With the default file mode, the client host submits a request that includes an audio file and the service returns the XML result once the file has been entirely treated. The latency is therefore proportional to the audio file duration. The file mode API is described here.
    • With the streaming mode the client host sends an audio stream and the service sends back an XML stream (one HTTP request per stream). The HTTP chunk mode is used both directions. The audio data should be sent while the data is being recorded. It is possible to specify the average latency, but it cannot be less than 10-15s. Large latency values (up to a few minutes) give the best accuracy. The audio stream can last up to 12 hours. This mode is supported for all languages accepted by the bs_trans method.
    • The real-time mode works like the streaming mode but with a very low latency. It should be noted however that this mode produces a less complete XML (e.g. no punctuation, no speaker diarization) and that the speech/non-speech detection must be performed by the client software if such a feature is needed. As for the streaming mode the audio stream can last a few hours but the real-time mode is also suited for interactive usages. This mode is currently not supported for all languages. Contact us at yobiyoba@yobinext.com if you have specific needs.
    Both the streaming and real-time modes requires the use of URI encoded requests, i.e. the HTTP body is exclusively reserved for the audio stream. If other files are needed to fully specify the query, these additional files have to be uploaded using the upload method (more information is given below). Both modes accept the vocfile and textfile arguments.

    Streaming mode

    The streaming mode is activated by using the bs_xtrans method and by specifying the mandatory model and audiofile arguments. Here is an example request:
    PUT /api?method=bs_xtrans&model=eng&audiofile=stream1.mp3 HTTP/1.1
    api-key: your api key
    User-Agent: ClientProgram/1.0
    Host: member.yobiyoba.com:8095
    Transfer-Encoding: chunked
    [blank line]
    [chunk#1]
    [chunk#2]
    ...
    [empty chunk]
    

    You can use any chunk size, but it should small enough to not have an impact on the latency. Here is a curl example: cat stream1.mp3 | curl -ksS -H "api-key: $api_key" -T - "https://member.yobiyoba.com:8095/api?method=bs_xtrans&model=eng&audiofile=stream1.mp3"

    Here the 'cat' process needs to be replaced by the streaming process to get the correct latency. For example you can instead use another 'curl' process reading the audio stream. The YobiYoba server also uses the HTTP chunked transfer encoding to send you back the XML transcription, i.e. the XML stream must be read on the fly while you are sending the audio stream. Also since an error can occur after you have received some of the transcript, you need to check for an error message at the end of the XML stream (last line). To properly end the audio stream you have to send a last chunk with a zero size. The service continuously computes the latency and returns it in the <Latency> XML tag. Here is an example tag:

    <Latency stime="1289.00" etime="1295.61" seg="9.7" avg="9.9"/>

    This means that the average latency for the words in the segment specified by stime and etime is 9.7s and that the average latency since the start of the file is 9.9s.

    The 'xlopt' parameter can be used to specify the latency. The minimum value is 15.0 (for 15s) and there is no maximum value. The default value is about 16s. Here is an example URI using this option specifying a delay of 60s:

    https://.../api?method=bs_xtrans&model=eng&xlopt=60&audiofile=stream1.mp3

    Real-time mode

    The real-time mode is activated by adding the option rtopt=1 to the bs_trans or cts_trans methods and by specifying the mandatory model and audiofile arguments. Here is example request:
    PUT /api?method=bs_trans&rtopt=1&model=eng&audiofile=stream1.mp3 HTTP/1.1
    api-key: your api key
    User-Agent: ClientProgram/1.0
    Host: member.yobiyoba.com:8095
    Transfer-Encoding: chunked
    [blank line]
    [chunk#1]
    [chunk#2]
    ...
    [empty chunk]
    
    You can use any chunk size, however to minimize the latency, the chunk size should contain less than 20ms of audio, i.e. for a 64kbps audio coding, the chunk size should be less then 1600 bytes. Here is a curl example:

    cat stream1.mp3 | curl -ksSN -H "api-key: $api_key" -T - "https://member.yobiyoba.com:8095/api?method=bs_trans&rtopt=1&model=eng&audiofile=stream1.mp3"

    Note that the curl -N option disables the buffering of the XML stream so as to not delay it. To process the audio in real-time you have to replace the 'cat' command in the above example by your recording process. Depending on your use case, you may need a real-time voice activity detection in your recording process as this functionality is not provided by the service.

    To use a text file for language model adaptation, you first need to upload the file and then specify this file in your STT request as follows:

    1. curl ... "https://.../api?method=upload&textfile=mytext.txt" -T mytext.txt
    2. curl ... -T - "https://.../api?method=bs_trans&rtopt=1&model=eng&textfile=mytext.txt&audiofile=stream1.mp3"
    In step 1 you can either use an URI encoded request (for a single file) or a MIME multi-part request to upload more than one file (e.g. textfile and vocfile).

    Model adaptation is not suitable for an interactive use case as it adds a delay at the beginning of the stream. However it is fine for for long streams, as the delay is quickly compensated.