Skip to content

Realtime Speech Recognition (Streaming)

1. Introduction

This document guides you through realtime speech to text using the Daglo Realtime STT gRPC API.

WARNING

  • Realtime speech recognition currently only supports Korean and English.
  • Audio with loud song sound or background music is not supported.

1) Key Features

  • Realtime speech recognition: Convert microphone input (audio stream) to text in realtime
  • High-accuracy text conversion: Apply the latest speech recognition technology for high-accuracy text conversion
  • Fast response time: Process speech in real-time for immediate text output

2. Prerequisites

1) Requirements

  • Knowledge of gRPC
  • Device with microphone connected
  • Daglo API account / API Token

3. Protocol Buffer Service Definition

Currently, the Realtime STT API only provides a Protocol Buffer service definition file and does not provide separate client SDKs. This file can be used to generate gRPC client libraries in any language that supports gRPC. For more information, see grpc.io.

protobuf
syntax = "proto3";

package dagloapis.speech.v1;

// A service that implements a speech recognition API
service Speech {
  // Speech recognition is performed in a bidirectional streaming manner, 
  // transmitting audio and receiving results simultaneously.
  rpc StreamingRecognize(stream StreamingRecognizeRequest)
      returns (stream StreamingRecognizeResponse) {}
}

// The top-level message sent by the client to the `StreamingRecognize` method.
// `StreamingRecognizeRequest` message has two kinds.
// The first message must contain the `config` field and not `audio_content`.
// All messages sent after this must contain `audio_content` and not `config`.
message StreamingRecognizeRequest {
  // A streaming request is a streaming configuration or audio stream.
  oneof streaming_request {
    // Provides configuration information regarding audio and STT processing.
    RecognitionConfig config = 1;

    // Audio stream data. Consecutive audio data fragments are sent sequentially 
    // using `StreamingRecognizeRequest` messages.
    // Audio sources must be captured and transmitted using `LINEAR16` encoding.
    // The sampling rate should be `16000Hz`. Resample the audio if necessary.
    // Only mono (1 channel) audio is supported.
    bytes audio_content = 2;
  }
}

// Provides the server with a way to handle requests.
message RecognitionConfig {
  // Select the language in which the audio will be provided.
  // Language codes follow the [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) standard.
  // This field can be omitted, but if omitted, the default value "ko-KR" will be applied.
  // Supported language codes:
  // - ko-KR
  // - en-US
  string language_code = 1;

  // If `true`, return the temporary partial result during the utterance immediately
  // (These partial results are indicated by the `is_final=false` flag).
  // If `false` or omitted, only results with `is_final=true` will be returned.
  bool interim_results = 2;
}

// `StreamingRecognizeResponse` is the only message returned to the client by `StreamingRecognize`. 
// Zero or more `StreamingRecognizeResponse` messages are streamed to the client.
// In each response, only one field, either `error` or `result`, is set.
message StreamingRecognizeResponse {
  // The result corresponding to the audio portion currently being processed.
  StreamingRecognitionResult result = 1;

  // Total audio duration for the stream processed (in seconds).
  float total_duration = 2;
}

// Streaming speech recognition results corresponding to the audio portion currently being processed.
message StreamingRecognitionResult {
  // Script text representing the words spoken by the user.
  // In languages where words are separated by spaces, this script may contain leading spaces if it is not the first result.
  // If you concatenate each result, you get the entire script without any delimiters.
  string transcript = 1;

  // If `false`, this `StreamingRecognitionResult` represents a partial result that may change.
  // If `true`, it means the result of the completed result for the corresponding audio segment 
  // from the previous starting point to one completed utterance.
  bool is_final = 2;
}

For an example Python client using Protobuf, see here.

4. Authorization

Authorization for the Realtime STT API uses Bearer authentication, using a token issued from the API Console.

When calling the StreamingRecognize RPC, you must add the authorization: Bearer <API_TOKEN> value as metadata (header).

5. Trobleshooting

  • If an error occurs, gRPC streaming will terminate. To prevent disconnection, please implement reconnection logic.