会社のWeb会議システムを内製することになり、音声テキスト化についてWhisper APIかSpeech-to-text APIを用いるかで半日悩んでいた。Whisper APIはクライアント型のCLIツールで無償利用できるけど、スペック依存なのでネットワーク全体で信頼性を担保できない。動画から音声を分離するバッチも必要
I spent half a day struggling with the decision of whether to use the Whisper API or the Speech-to-Text API for text transcription in our company's web conferencing system. The Whisper API is a client-based CLI tool that can be used for free, but it relies on the specifications of the client's device, which means we cannot guarantee its reliability across the entire network. Furthermore, we would also need a separate batch process to extract audio from video.

Whisper CLIによる音声テキスト化を試用してみた。可能な限り主要Web会議ツールをテストしたけど、Open AI社製だけあってWhisperが頭一つ抜けてる感じ。Base modelでもこの精度が出る。話者分離機能はないけど、そもそも他のサービスの分離精度も怪しいレベル。当然ながら英語の方が高精度
Speech-to-text using Whisper CLI. I tested various web conferencing tools as much as possible, Whisper stands out noticeably. Even with the base model, it achieves this accuracy. It doesn't have speaker recognition, but the separation accuracy of other services is questionable.