模型编程能力评测
提示词
grop 语音识别文档
Speech
Groq API is the fastest speech-to-text solution available, offering OpenAI-compatible endpoints that enable real-time transcriptions and translations. With Groq API, you can integrate high-quality audio processing into your applications at speeds that rival human interaction.
API Endpoints
We support two endpoints:
Endpoint Usage API Endpoint
Transcriptions Convert audio to text https://api.groq.com/openai/v1/audio/transcriptions
Translations Translate audio to English text https://api.groq.com/openai/v1/audio/translations
Supported Models
Model ID Model Supported Language(s) Description
whisper-large-v3-turbo Whisper Large V3 Turbo Multilingual A fine-tuned version of a pruned Whisper Large V3 designed for fast, multilingual transcription tasks.
distil-whisper-large-v3-en Distil-Whisper English English-only A distilled, or compressed, version of OpenAI's Whisper model, designed to provide faster, lower cost English speech recognition while maintaining comparable accuracy.
whisper-large-v3 Whisper large-v3 Multilingual Provides state-of-the-art performance with high accuracy for multilingual transcription and translation tasks.
Which Whisper Model Should You Use?
Having more choices is great, but let's try to avoid decision paralysis by breaking down the tradeoffs between models to find the one most suitable for your applications:
If your application is error-sensitive and requires multilingual support, use whisper-large-v3.
If your application is less sensitive to errors and requires English only, use distil-whisper-large-v3-en.
If your application requires multilingual support and you need the best price for performance, use whisper-large-v3-turbo.
The following table breaks down the metrics for each model.
Model Cost Per Hour Language Support Transcription Support Translation Support Real-time Speed Factor Word Error Rate
whisper-large-v3 $0.111 Multilingual Yes Yes 189 10.3%
whisper-large-v3-turbo $0.04 Multilingual Yes No 216 12%
distil-whisper-large-v3-en $0.02 English only Yes No 250 13%
Working with Audio Files
Audio File Limitations
Max File Size
25 MB
Minimum File Length
0.01 seconds
Minimum Billed Length
10 seconds. If you submit a request less than this, you will still be billed for 10 seconds.
Supported File Types
`flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, `webm`
Single Audio Track
Only the first track will be transcribed for files with multiple audio tracks. (e.g. dubbed video)
Supported Response Formats
`json`, `verbose_json`, `text`
Audio Preprocessing
Our speech-to-text models will downsample audio to 16KHz mono before transcribing, which is optimal for speech recognition. This preprocessing can be performed client-side if your original file is extremely large and you want to make it smaller without a loss in quality (without chunking, Groq API speech endpoints accept up to 25MB). We recommend FLAC for lossless compression.
The following ffmpeg command can be used to reduce file size:
ffmpeg \
-i <your file> \
-ar 16000 \
-ac 1 \
-map 0:a \
-c:a flac \
<output file name>.flac
Working with Larger Audio Files
For audio files that exceed our size limits or require more precise control over transcription, we recommend implementing audio chunking. This process involves:
Breaking the audio into smaller, overlapping segments
Processing each segment independently
Combining the results while handling overlapping
To learn more about this process and get code for your own implementation, see the complete audio chunking tutorial in our Groq API Cookbook.
Using the API
The following are optional request parameters you can use in your transcription and translation requests:
Parameter Type Default Description
prompt string None Provide context or specify how to spell unfamiliar words (limited to 224 tokens).
response_format string json Define the output response format.
Set to verbose_json to receive timestamps for audio segments.
Set to text to return a text response.
temperature float None Specify a value between 0 and 1 to control the translation output.
language string None whisper-large-v3-turbo and whisper-large-v3 only!
Specify the language for transcription. Use ISO 639-1 language codes (e.g. "en" for English, "fr" for French, etc.). We highly recommend setting the language if you know it as specifying a language may improve transcription accuracy and speed.
Example Usage of Transcription Endpoint
The transcription endpoint allows you to transcribe spoken words in audio or video files.
Python
JavaScript
curl
The following is an example cURL request:
curl https://api.groq.com/openai/v1/audio/transcriptions \
-H "Authorization: bearer ${GROQ_API_KEY}" \
-F "file=@./sample_audio.m4a" \
-F model=whisper-large-v3-turbo \
-F temperature=0 \
-F response_format=json \
-F language=en
The following is an example response:
{
"text": "Your transcribed text appears here...",
"x_groq": {
"id": "req_unique_id"
}
}
Example Usage of Translation Endpoint
The translation endpoint allows you to translate spoken words in audio or video files to English.
Python
JavaScript
curl
The following is an example cURL request:
curl https://api.groq.com/openai/v1/audio/translations \
-H "Authorization: bearer ${GROQ_API_KEY}" \
-F "file=@./sample_audio.m4a" \
-F model=whisper-large-v3 \
-F prompt="Specify context or spelling" \
-F temperature=0 \
-F response_format=json
The following is an example response:
{
"text": "Your translated text appears here...",
"x_groq": {
"id": "req_unique_id"
}
}
Understanding Metadata Fields
When working with Groq API, setting response_format to verbose_json outputs each segment of transcribed text with valuable metadata that helps us understand the quality and characteristics of our transcription, including avg_logprob, compression_ratio, and no_speech_prob.
This information can help us with debugging any transcription issues. Let's examine what this metadata tells us using a real example:
{
"id": 8,
"seek": 3000,
"start": 43.92,
"end": 50.16,
"text": " document that the functional specification that you started to read through that isn't just the",
"tokens": [51061, 4166, 300, 264, 11745, 31256],
"temperature": 0,
"avg_logprob": -0.097569615,
"compression_ratio": 1.6637554,
"no_speech_prob": 0.012814695
}
As shown in the above example, we receive timing information as well as quality indicators. Let's gain a better understanding of what each field means:
id:8: The 9th segment in the transcription (counting begins at 0)
seek: Indicates where in the audio file this segment begins (3000 in this case)
start and end timestamps: Tell us exactly when this segment occurs in the audio (43.92 to 50.16 seconds in our example)
avg_logprob (Average Log Probability): -0.097569615 in our example indicates very high confidence. Values closer to 0 suggest better confidence, while more negative values (like -0.5 or lower) might indicate transcription issues.
no_speech_prob (No Speech Probability): 0.0.012814695 is very low, suggesting this is definitely speech. Higher values (closer to 1) would indicate potential silence or non-speech audio.
compression_ratio: 1.6637554 is a healthy value, indicating normal speech patterns. Unusual values (very high or low) might suggest issues with speech clarity or word boundaries.
Using Metadata for Debugging
When troubleshooting transcription issues, look for these patterns:
Low Confidence Sections: If avg_logprob drops significantly (becomes more negative), check for background noise, multiple speakers talking simultaneously, unclear pronunciation, and strong accents. Consider cleaning up the audio in these sections or adjusting chunk sizes around problematic chunk boundaries.
Non-Speech Detection: High no_speech_prob values might indicate silence periods that could be trimmed, background music or noise, or non-verbal sounds being misinterpreted as speech. Consider noise reduction when preprocessing.
Unusual Speech Patterns: Unexpected compression_ratio values can reveal stuttering or word repetition, speaker talking unusually fast or slow, or audio quality issues affecting word separation.
Quality Thresholds and Regular Monitoring
We recommend setting acceptable ranges for each metadata value we reviewed above and flagging segments that fall outside these ranges to be able to identify and adjust preprocessing or chunking strategies for flagged sections.
By understanding and monitoring these metadata values, you can significantly improve your transcription quality and quickly identify potential issues in your audio processing pipeline.
Prompting Guidelines
The prompt parameter (max 224 tokens) helps provide context and maintain a consistent output style. Unlike chat completion prompts, these prompts only guide style and context, not specific actions.
Best Practices
Provide relevant context about the audio content, such as the type of conversation, topic, or speakers involved.
Use the same language as the language of the audio file.
Steer the model's output by denoting proper spellings or emulate a specific writing style or tone.
Keep the prompt concise and focused on stylistic guidance.
我目的的代码实现
package com.litongjava.translator;
import java.io.ByteArrayOutputStream;
import java.util.function.Consumer;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.DataLine;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.TargetDataLine;
import com.litongjava.translator.consts.AudioConst;
import com.litongjava.translator.utils.AudioUtils;
/**
* 录音器实现
*
* 1. 静音的判断:当前音频的分贝低于设定的阈值(默认 -7.5 dB)时认为是静音
* 2. 增加了开始、暂停、停止功能,分别由外部按钮控制
*/
public class AudioRecorder {
private static final float SAMPLE_RATE = 16000;
private static final int SAMPLE_SIZE = 16; // 16位单声道
private static final int SILENCE_DURATION = 1500; // 静音持续时间(ms)
// 静音阈值(单位:分贝),默认 -7.5 dB
private double silenceThreshold = AudioConst.DEFAULT_SILENCE_THRESHOLD;
private final Consumer<byte[]> dataConsumer;
// 录音控制标志
private volatile boolean running = false;
private volatile boolean paused = false;
private TargetDataLine line;
public AudioRecorder(Consumer<byte[]> dataConsumer) {
this.dataConsumer = dataConsumer;
}
/**
* 启动录音线程
* 如果已运行,则不会重复启动
*/
public void startRecording() {
if (running) {
return;
}
running = true;
paused = false;
new Thread(this::recordingLoop).start();
}
/**
* 录音主循环
*/
private void recordingLoop() {
try {
AudioFormat format = new AudioFormat(SAMPLE_RATE, SAMPLE_SIZE, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
if (!AudioSystem.isLineSupported(info)) {
System.err.println("Line not supported");
running = false;
return;
}
line = (TargetDataLine) AudioSystem.getLine(info);
line.open(format);
line.start();
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
long lastSoundTime = System.currentTimeMillis();
while (running) {
if (paused) {
// 暂停时,不读取数据,等待恢复
Thread.sleep(100);
continue;
}
int count = line.read(buffer, 0, buffer.length);
if (count > 0) {
byte[] data = new byte[count];
System.arraycopy(buffer, 0, data, 0, count);
// 将数据传给UI更新波形和状态
dataConsumer.accept(data);
// 判断是否静音
if (isSilence(data)) {
// 如果静音持续时间超过设置,则认为一句结束
if (System.currentTimeMillis() - lastSoundTime > SILENCE_DURATION) {
AudioProcessor.processAudio(out.toByteArray());
out.reset();
}
} else {
// 有声音则更新最后声音的时间戳
lastSoundTime = System.currentTimeMillis();
out.write(buffer, 0, count);
}
}
}
// 录音停止时关闭数据线
if (line != null) {
line.stop();
line.close();
}
} catch (LineUnavailableException | InterruptedException e) {
e.printStackTrace();
}
}
/**
* 判断一段音频数据是否为静音
*/
private boolean isSilence(byte[] buffer) {
short[] samples = AudioUtils.convertBytesToShorts(buffer);
if (samples.length == 0) {
return true; // 无数据视为静音
}
double rms = AudioUtils.calculateRMS(samples);
double db = 20 * Math.log10(rms + 1e-12);
return db < silenceThreshold;
}
public void setSilenceThreshold(double threshold) {
this.silenceThreshold = threshold;
}
public double getSilenceThreshold() {
return this.silenceThreshold;
}
// 控制接口
public void pauseRecording() {
paused = true;
}
public void resumeRecording() {
paused = false;
}
public void stopRecording() {
running = false;
}
public boolean isRunning() {
return running;
}
public boolean isPaused() {
return paused;
}
}
package com.litongjava.translator;
public class AudioProcessor {
public static void processAudio(byte[] audioData) {
// arc
String text = asr(audioData);
// 调用翻译
//String translation = Translator.translateToChinese(pythonResult);
// 更新UI
MainUI.updateUI(text);
}
private static String asr(byte[] audioData) {
return "1234";
}
}
package com.litongjava.translator;
import com.litongjava.translator.consts.AudioConst;
import com.litongjava.translator.utils.AudioUtils;
import javafx.application.Application;
import javafx.application.Platform;
import javafx.geometry.Insets;
import javafx.scene.Scene;
import javafx.scene.canvas.Canvas;
import javafx.scene.canvas.GraphicsContext;
import javafx.scene.control.Button;
import javafx.scene.control.Label;
import javafx.scene.control.Slider;
import javafx.scene.control.TextArea;
import javafx.scene.layout.GridPane;
import javafx.scene.layout.HBox;
import javafx.scene.paint.Color;
import javafx.stage.Stage;
public class MainUI extends Application {
private Label statusLabel = new Label("状态: 准备中");
private static TextArea translatedText = new TextArea();
private Canvas waveformCanvas;
private GraphicsContext gc;
private AudioRecorder audioRecorder;
@Override
public void start(Stage primaryStage) {
primaryStage.setTitle("实时语音翻译");
GridPane grid = new GridPane();
grid.setPadding(new Insets(10));
grid.setHgap(10);
grid.setVgap(10);
// 翻译结果区域
grid.add(new Label("翻译结果:"), 0, 0);
translatedText.setEditable(false);
translatedText.setWrapText(true);
grid.add(translatedText, 1, 0);
// 静音阈值设置
Slider thresholdSlider = new Slider(-60, 0, AudioConst.DEFAULT_SILENCE_THRESHOLD);
thresholdSlider.setBlockIncrement(1);
thresholdSlider.setMajorTickUnit(5);
thresholdSlider.setMinorTickCount(4);
thresholdSlider.setSnapToTicks(true);
thresholdSlider.setShowTickLabels(true);
thresholdSlider.setShowTickMarks(true);
grid.add(new Label("静音阈值:"), 0, 1);
grid.add(thresholdSlider, 1, 1);
// 状态显示
grid.add(statusLabel, 1, 2);
// 波形显示
grid.add(new Label("波形显示:"), 0, 3);
waveformCanvas = new Canvas(800, 150);
gc = waveformCanvas.getGraphicsContext2D();
grid.add(waveformCanvas, 1, 3);
// 三个控制按钮:开始、暂停、停止
Button startBtn = new Button("开始");
Button pauseBtn = new Button("暂停");
Button stopBtn = new Button("停止");
HBox buttonBox = new HBox(10, startBtn, pauseBtn, stopBtn);
grid.add(buttonBox, 1, 4);
Scene scene = new Scene(grid, 850, 500);
primaryStage.setScene(scene);
primaryStage.show();
// 添加窗口关闭时退出程序的处理逻辑
primaryStage.setOnCloseRequest(e -> {
if (audioRecorder != null && audioRecorder.isRunning()) {
audioRecorder.stopRecording();
}
Platform.exit();
System.exit(0);
});
// 初始化录音器
audioRecorder = new AudioRecorder(this::handleAudioData);
// 设置初始静音阈值
audioRecorder.setSilenceThreshold(thresholdSlider.getValue());
thresholdSlider.valueProperty().addListener((obs, oldVal, newVal) -> {
audioRecorder.setSilenceThreshold(newVal.doubleValue());
});
// 按钮事件处理
startBtn.setOnAction(e -> {
// 如果录音器已经启动但处于暂停状态,则恢复录音;否则启动新的录音线程
if (audioRecorder.isRunning() && audioRecorder.isPaused()) {
audioRecorder.resumeRecording();
statusLabel.setText("状态: 录音中");
} else if (!audioRecorder.isRunning()) {
audioRecorder.startRecording();
statusLabel.setText("状态: 录音中");
}
});
pauseBtn.setOnAction(e -> {
if (audioRecorder.isRunning() && !audioRecorder.isPaused()) {
audioRecorder.pauseRecording();
statusLabel.setText("状态: 暂停");
}
});
stopBtn.setOnAction(e -> {
if (audioRecorder.isRunning()) {
audioRecorder.stopRecording();
statusLabel.setText("状态: 停止");
}
});
}
/**
* 更新翻译文本(可供其他线程调用)
*/
public static void updateUI(String translated) {
Platform.runLater(() -> {
translatedText.setText(translated);
});
}
/**
* 从 AudioRecorder 回调的音频数据,更新UI线程中的状态和波形
*/
private void handleAudioData(byte[] audioData) {
Platform.runLater(() -> {
// 转换为 short 数组
short[] samples = AudioUtils.convertBytesToShorts(audioData);
if (samples.length == 0) {
return;
}
// 计算 RMS 并转换为 dB
double rms = AudioUtils.calculateRMS(samples);
double db = 20 * Math.log10(rms + 1e-12);
// 计算平均幅度 dB(可选,用于展示不同的音量测量方式)
double sum = 0;
for (short sample : samples) {
sum += Math.abs(sample);
}
double avg = sum / samples.length;
double volumeDb = 20 * Math.log10(avg / 32768.0 + 1e-12);
// 获取当前静音阈值
double threshold = audioRecorder.getSilenceThreshold();
// 根据阈值判断状态
String curStatus = (db < threshold) ? "静音" : "录音中";
String statusText = String.format("状态: %s | RMS分贝: %.1f dB | 平均音量: %.1f dB | 阈值: %.1f dB", curStatus, db, volumeDb, threshold);
statusLabel.setText(statusText);
// 绘制波形
drawWaveform(samples);
});
}
/**
* 绘制波形
*/
private void drawWaveform(short[] samples) {
double width = waveformCanvas.getWidth();
double height = waveformCanvas.getHeight();
double yCenter = height / 2;
gc.clearRect(0, 0, width, height);
gc.setStroke(Color.BLUE);
gc.setLineWidth(1);
gc.beginPath();
for (int i = 0; i < samples.length; i++) {
// 将 short 映射到 [-1, 1]
double sample = samples[i] / 32768.0;
double x = (double) i / samples.length * width;
double y = yCenter - (sample * yCenter);
if (i == 0) {
gc.moveTo(x, y);
} else {
gc.lineTo(x, y);
}
}
gc.stroke();
}
public static void main(String[] args) {
launch(args);
}
}
package com.litongjava.translator.utils;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
public class AudioUtils {
/**
* 将16位单声道音频字节数组转换为short数组
*/
public static short[] convertBytesToShorts(byte[] audioData) {
if (audioData.length < 2) {
return new short[0];
}
if (audioData.length % 2 != 0) {
byte[] adjusted = new byte[audioData.length - 1];
System.arraycopy(audioData, 0, adjusted, 0, adjusted.length);
audioData = adjusted;
}
short[] samples = new short[audioData.length / 2];
ByteBuffer bb = ByteBuffer.wrap(audioData).order(ByteOrder.LITTLE_ENDIAN);
bb.asShortBuffer().get(samples);
return samples;
}
/**
* 计算短数组的RMS值
*/
public static double calculateRMS(short[] samples) {
double sum = 0.0;
for (short s : samples) {
double normalized = s / 32768.0;
sum += normalized * normalized;
}
return Math.sqrt(sum / samples.length);
}
}
需要完成的功能
1.用户可以再页面上自主选择模型
2.基于okhttp封装GroqSpeechClient.
3. 在 AudioProcessor中 调用GroqSpeechClient进行语音识别
4. 增加请求限制 如果超出次数则 合并音频再发送,下面是请求限制信息
Speech To Text
ID Requests per Minute Requests per Day Audio Seconds per Hour Audio Seconds per Day
distil-whisper-large-v3-en 100 200,000 600,000 10,000,000
whisper-large-v3 100 130,000 160,000 4,000,000
whisper-large-v3-turbo 400 200,000 200,000 4,000,000
5.点击开始后保存录音文件 点击暂停不录音 点击关闭停止录音.
请完成代码