-
-
Notifications
You must be signed in to change notification settings - Fork 64
Options
The documentation below is for detailed on customizing the app. If you are just getting started you can check the getting started guide at getting started
You can choose the following action:
- Stay on top
- Hide
- Exit
You can choose the following action:
- Settings (Shortcut:
F2
)
Open the settings menu - Log (Shortcut:
Ctrl + F1
)
Open log window - Export Directory
Open export directory - Log Directory
Open log directory - Model Directory
Open model directory
You can choose the following action:
-
Transcribed speech subtitle window (Shortcut:
F3
)
Shows the result of the transcription in recording session but in a detached window just like a subtitle box. -
Translated speech subtitle window (Shortcut:
F4
)
Shows the result of the translation in recording session but in a detached window just like a subtitle box.
Preview:
Windows user can further customize it to remove the background by right clicking the window
and choosing the clickthrough/transparent
option.
You can choose the following action:
- About (Shortcut:
F1
) - Open Documentation / Wiki
- Visit Repository
- Check for Update
Select the model for transcription, you can choose between the following:
- Tiny
- Base
- Small
- Medium
- Large
Each model have different requirements and produce different result. for more information you can check it directly at the whisper repository.
Select the method for translation.
- Whisper (To english only from 99 language available)
- Google Translate (133 target language with 94 of it have compatibility with whisper as source language)
- LibreTranslate v1.5.1 (45 target language with 43 of it have compatibility with whisper as source language)
- MyMemoryTranslator (127 target language with 93 of it have compatibility with whisper as source language)
Set language to translate from. The selection of the language in this option will be different depending on the selected method in the Translate
option.
Set language to translate to. The selection of the language in this option will be different depending on the selected method in the Translate
option.
Swap the language in From
and To
option. Will also swap the textbox result.
Clear the textbox result.
Set the device Host API for recording.
Set the mic device for recording. This option will be different depending on the selected Host API in the HostAPI
option.
Set the speaker device for recording. This option will be different depending on the selected Host API in the HostAPI
option. (Only on windows 8 and above)
Set the task to do for recording. The task available are:
- Transcribe
- Translate
Set the input for recording. The input available are:
- Microphone
- Speaker
Copy the textbox result.
Open the tool dropdown menu. The tool available are:
- Export recorded results
- Align results
- Refine results
- Translate results
Start recording. The button will change to Stop
when recording.
Import file to transcribe, will open its own modal window.
Wether to check if there is any new update or not on every app startup. (Default checked
)
Wethere to ask for confirmation when the recording button is pressed
Wether to suppress the notification to show that the app is now hidden to tray. (Default unchecked
)
Wether to supress any warning that might show up related to device. (Default unchecked
)
Wether to show the audio input visualizer when recording. (Default checked
)
Wether to show the audio input visualizer in the setting menu. (Default checked
)
By default, the app is bundled with the sun valley custom theme. You should also be able to add custom theme with some limitation and instruction located in the readme in the theme folder.
Set log folder location, to do it press the button on the right. Action available:
- Open folder
- Change log folder
- Set back to default
- Empty log folder
Wether to log the record session verbosely. (Default unchecked
)
Wether to keep the log files or not. If not checked
, the log files will be deleted everytime the app runs. (Default unchecked
)
Set log level. (Default DEBUG)
Wether to show debug log for record session. Setting this on might slow down the app. (Default unchecked
)
Wether to save recorded audio in the record session into the debug
folder. The debug folder is located in the speechtranslate/debug
.
The audio here will be saved as .wav
in the debug folder, and if unchecked will be deleted automatically every run. Setting this on might slow down the app (Default unchecked
)
Wethere to show debug log for translate session. (Default unchecked
)
Set model folder location, to do it press the button on the right. Action available:
- Open folder
- Change model folder
- Set back to default
- Empty model folder
- Download model
Wether to automatically check if the model is available on first opening setting menu. (Default uncheked
)
You can download the model by pressing the download button. Each model have different requirements and produce different result. You can read more about it here.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |
Note
Speaker input only works on windows 8 and above.
Alternatively, you can make a loopback to capture your system audio as virtual input (like mic input) by using this guide/tool: (Voicemeeter on Windows) - (YT Tutorial) - (pavucontrol on Ubuntu with PulseAudio) - (blackhole on MacOS)
Set sample rate for the input device. (Default 16000
)
Set channels for the input device. (Default 1
)
Set chunk size for the input device. (Default 1024
)
Wether to automatically set the sample rate based on the input device. (Default unchecked
for microphone
and checked
for speaker
)
Wether to automatically set the channels based on the input device. (Default unchecked
for microphone
and checked
for speaker
)
Set the rate for transcribing the audio in milisecond. (Default 300
)
Conversion method to feed to the whisper model. (Default is using Numpy Array
)
Numpy array is the default and recommended method. It is faster and more efficient, but if there are any errors related to device or conversion in the record session, try using the temporary wav file method. Temporary wav file is a little slower and less efficient but might be more accurate in some cases. When using wav file, the I/O process of the recorded wav file might slow down the performance of the app significantly, especially on long buffers. Both setting will resample the audio to a 16k hz sample rate. Difference is, numpy array uses scipy to resample the audio while temporary wav file uses default from whisper.
Use numpy array to feed to the model. This method is faster because of no need to write out the audio to wav file.
Use temporary wav file to feed to the model. Using this might slow down the process because of the File IO operation. Using this might help fix error related to device (which rarely happens). When both VAD and Demucs are enabled in record session, this option will be used automatically.
Set minimum buffer input (in seconds) for the input to be considered as a valid input. This means that the input must be at least x seconds long before being passed to Whisper to get the result. (Default: 0.4
)
Set the maximum buffer size for the audio in seconds. (Default 10
)
Set max number of sentences. One sentence equals to one buffer. So if max buffer is 10 seconds, the words that are in those 10 seconds is the sentence. (Default 5
)
If enabled will remove the limit for the result saved in memory when recording
Wether to enable threshold or not. If enabled, the app will only transcribe the audio if the audio is above the threshold. (Default checked
)
If set to auto, will use VAD (voice activity detection) for the threshold. The VAD is using WebRTC VAD through py-webrtcvad. (Default checked
)
If set to auto, the user will need to select the VAD sensitivity to filter out the noise. The higher the sensitivity, the more noise will be filtered out. If not set to auto, the user will need to set the threshold manually.
Wether to automatically break the buffer when silence is found for more than 1 second.
Wether to use Silero Vad alongside WebRTC VAD (Note that this option might not be available on every device option)
Set the separator for the text result. (Default \n
)
Wether to use faster whisper or not. (Default checked
)
Set the decoding preset. (Default Beam Search
). You can choose between the following:
- Greedy, greedy will set the temperature parameter to 0.0 with both best, beam size, and patience set to none
- Beam Search, beam search will set the temperature parameter with fallback of 0.2, so the temperature is 0.0 0.2 0.4 0.6 0.8 1.0, both best of and beam size are set to 3 and patience is set to 1
- Custom, set your own decoding option
Temperature to use for sampling
Number of candidates when sampling with non-zero temperature
Number of beams in beam search, only applicable when temperature is zero
If the gzip compression ratio is higher than this value, treat the decoding as failed. (Default is 2.4)
If the average log probability is lower than this value, treat the decoding as failed. (Default is -1.0)
If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to logprob_threshold
, consider the segment as silence. (Default is 0.72)
Optional text to provide as a prompt for the first window. (Default is empty)
Optional text to prefix the current context. (Default is empty)
Comma-separated list of token ids to suppress during sampling. '-1' will suppress most special characters except common punctuations. (Default is empty)
Maximum initial timestamp to use for the first window. (Default is 1.0
)
If true will suppress blank output. (Default is checked
)
if True, provide the previous output of the model as a prompt for the next window, disabling may make the text inconsistent across windows. (Default is checked
)
If true, will use fp16 for inference. (Default is checked
)
Command line arguments / parameters to be used. It has the same options as when using stable-ts
with CLI but with some parameter removed because it is set in the app / GUI. All of the parameter are:
# [device]
* description: device to use for PyTorch inference (A Cuda compatible GPU and PyTorch with CUDA support are still required for GPU / CUDA)
* type: str, default cuda
* usage: --device cpu
# [cpu_preload]
* description: load model into CPU memory first then move model to specified device; this reduces GPU memory usage when loading model.
* type: bool, default True
* usage: --cpu_preload True
# [dynamic_quantization]
* description: whether to apply Dynamic Quantization to model to reduce memory usage (~half less) and increase inference speed at cost of slight decrease in accuracy; Only for CPU; NOTE: overhead might make inference slower for models smaller than 'large'
* type: bool, default False
* usage: --dynamic_quantization
# [prepend_punctuations]
* description: Punctuations to prepend to the next word
* type: str, default "'“¿([{-"
* usage: --prepend_punctuations "<punctuation>"
# [append_punctuations]
* description: Punctuations to append to the previous word
* type: str, default "\"'.。,,!!??::”)]}、"
* usage: --append_punctuations "<punctuation>"
# [gap_padding]
* description: padding to prepend to each segment for word timing alignment; used to reduce the probability of the model predicting timestamps earlier than the first utterance
* type: str, default " ..."
* usage: --gap_padding "padding"
# [word_timestamps]
* description: extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment; disabling this will prevent segments from splitting/merging properly.
* type: bool, default True
* usage: --word_timestamps True
# [regroup]
* description: whether to regroup all words into segments with more natural boundaries; specify a string for customizing the regrouping algorithm; ignored if [word_timestamps]=False.
* type: str, default "True"
* usage: --regroup "regroup_option"
# [ts_num]
* description: number of extra inferences to perform to find the mean timestamps
* type: int, default 0
* usage: --ts_num <number>
# [ts_noise]
* description: percentage of noise to add to audio_features to perform inferences for [ts_num]
* type: float, default 0.1
* usage: --ts_noise 0.1
# [suppress_silence]
* description: whether to suppress timestamps where audio is silent at segment-level and word-level if [suppress_word_ts]=True
* type: bool, default True
* usage: --suppress_silence True
# [suppress_word_ts]
* description: whether to suppress timestamps where audio is silent at word-level; ignored if [suppress_silence]=False
* type: bool, default True
* usage: --suppress_word_ts True
# [suppress_ts_tokens]
* description: whether to use silence mask to suppress silent timestamp tokens during inference; increases word accuracy in some cases, but tends to reduce 'verbatimness' of the transcript; ignored if [suppress_silence]=False
* type: bool, default False
* usage: --suppress_ts_tokens True
# [q_levels]
* description: quantization levels for generating timestamp suppression mask; acts as a threshold to marking sound as silent; fewer levels will increase the threshold of volume at which to mark a sound as silent
* type: int, default 20
* usage: --q_levels <number>
# [k_size]
* description: Kernel size for average pooling waveform to generate suppression mask; recommend 5 or 3; higher sizes will reduce detection of silence
* type: int, default 5
* usage: --k_size 5
# [time_scale]
* description: factor for scaling audio duration for inference; greater than 1.0 'slows down' the audio; less than 1.0 'speeds up' the audio; 1.0 is no scaling
* type: float
* usage: --time_scale <value>
# [vad]
* description: whether to use Silero VAD to generate timestamp suppression mask; Silero VAD requires PyTorch 1.12.0+; Official repo: https://github.com/snakers4/silero-vad
* type: bool, default False
* usage: --vad True
# [vad_threshold]
* description: threshold for detecting speech with Silero VAD. (Default: 0.35); low threshold reduces false positives for silence detection
* type: float, default 0.35
* usage: --vad_threshold 0.35
# [vad_onnx]
* description: whether to use ONNX for Silero VAD
* type: bool, default False
* usage: --vad_onnx True
# [min_word_dur]
* description: only allow suppressing timestamps that result in word durations greater than this value
* type: float, default 0.1
* usage: --min_word_dur 0.1
# [demucs]
* description: whether to reprocess the audio track with Demucs to isolate vocals/remove noise; Demucs official repo: https://github.com/facebookresearch/demucs
* type: bool, default False
* usage: --demucs True
# [demucs_output]
* path(s) to save the vocals isolated by Demucs as WAV file(s); ignored if [demucs]=False
* type: str
* usage: --demucs_output "<path>"
# [only_voice_freq]
* description: whether to only use sound between 200 - 5000 Hz, where the majority of human speech is.
* type: bool
* usage: --only_voice_freq True
# [strip]
* description: whether to remove spaces before and after text on each segment for output
* type: bool, default True
* usage: --strip True
# [tag]
* description: a pair of tags used to change the properties of a word at its predicted time; SRT Default: '<font color=\"#00ff00\">', '</font>'; VTT Default: '<u>', '</u>'; ASS Default: '{\\1c&HFF00&}', '{\\r}'
* type: str
* usage: --tag "<start_tag> <end_tag>"
# [reverse_text]
* description: whether to reverse the order of words for each segment of text output
* type: bool, default False
* usage: --reverse_text True
# [font]
* description: word font for ASS output(s)
* type: str, default 'Arial'
* usage: --font "<font_name>"
# [font_size]
* description: word font size for ASS output(s)
* type: int, default 48
* usage: --font_size 48
# [karaoke]
* description: whether to use progressive filling highlights for karaoke effect (only for ASS outputs)
* type: bool, default False
* usage: --karaoke True
# [threads]
* description: number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS
* type: int
* usage: --threads <value>
# [mel_first]
* description: process the entire audio track into a log-Mel spectrogram first instead in chunks
* type: bool
* usage: --mel_first
# [demucs_option]
* description: Extra option(s) to use for Demucs; Replace True/False with 1/0; E.g. --demucs_option "shifts=3" --demucs_option "overlap=0.5"
* type: str
* usage: --demucs_option "<option>"
# [refine_option]
* description: Extra option(s) to use for refining timestamps; Replace True/False with 1/0; E.g. --refine_option "steps=sese" --refine_option "rel_prob_decrease=0.05"
* type: str
* usage: --refine_option "<option>"
# [model_option]
* description: Extra option(s) to use for loading the model; Replace True/False with 1/0; E.g. --model_option "in_memory=1" --model_option "cpu_threads=4"
* type: str
* usage: --model_option "<option>"
# [transcribe_option]
* description: Extra option(s) to use for transcribing/alignment; Replace True/False with 1/0; E.g. --transcribe_option "ignore_compatibility=1"
* type: str
* usage: --transcribe_option "<option>"
# [save_option]
* description: Extra option(s) to use for text outputs; Replace True/False with 1/0; E.g. --save_option "highlight_color=ffffff"
* type: str
* usage: --save_option "<option>"
Wether to enable postprocessing result by filtering it
The path to the filter file (.json) containing all the filter in differetent languages supported by Whisper. Base filter file is provided by default, user can customize it if they want.
Punctuation to ignore when filtering. (Default "',.?!
)
Wether to strip any space when filtering. (Default is checked
)
Wether the case of the string needs to match. (Default is unchecked
)
Similarity rate to use when not using "Exact match" for comparing the filter an the result in the segment. (Default is 0.75
)
Wethere the string need to be exactly the same as the reult in the segmenet to be removed. (Default is unchecked
for record and checked
for file import)
Set the mode for export. You can choose between the following:
- Segment level
- Word level
segment_level=True
+ word_level=True
00:00:07.760 --> 00:00:09.900
But<00:00:07.860> when<00:00:08.040> you<00:00:08.280> arrived<00:00:08.580> at<00:00:08.800> that<00:00:09.000> distant<00:00:09.400> world,
segment_level=True
+ word_level=False
00:00:07.760 --> 00:00:09.900
But when you arrived at that distant world,
segment_level=False
+ word_level=True
00:00:07.760 --> 00:00:07.860
But
00:00:07.860 --> 00:00:08.040
when
00:00:08.040 --> 00:00:08.280
you
00:00:08.280 --> 00:00:08.580
arrived
...
Can choose between the following:
- Text
- Json
- SRT
- ASS
- VTT
- TSV
- CSV
It is recommended to have the json output always enabled just in case you want to further modify the results with the tool
menu in the main menu
Wether to visualize visualize which parts of the audio will likely be suppressed (i.e. marked as silent)
Set the export folder location
Wether to auto open the export folder for file import
Wether to enable remove words that repeat consecutively.
Example 1: "This is is is a test." -> "This is a test." If you set max words to 1, it will remove the last two "is".
Example 2: "This is is is a test this is a test." -> "This is a test." If you set max words to 4, it will remove the second " is" and third " is", then remove the last "this is a test". "this is a test" will get remove ' because it consists of 4 words and the max words is 4.
Set the maximum number of words allowed in each segment. (Default unset
)
Set the maximum number of characters allowed in each segment. (Default unset
)
Wether to use line break or splitting into separate segments on split points. (Default Split
)
Whether to evenly split a segment in length if it exceeds max_chars
or max_words
.
Amount to slice the filename from the start
Amount to slice the filename from the end
Set the filename export format. It is recommended to always have one of the task format set because without it the file might get mixed up and could be overwritten. The following are the options for all the export format:
Default value: %Y-%m-%d %f {file}/{task-lang}
To folderize the result you can use / in the format. Example: {file}/{task-lang-with}
Available parameters:
----- Parameters that can be used in any situation -----
{strftime format such as %Y %m %d %H %M %f ...}
To see the full list of strftime format, see https://strftime.org/
{file}
Will be replaced with the file name
{lang-source}
Will be replaced with the source language if available.
Example: english
{lang-target}
Will be replaced with the target language if available.
Example: french
{transcribe-with}
Will be replaced with the transcription model name if available.
Example: tiny
{translate-with}
Will be replaced with the translation engine name if available.
Example: google translate
----------- Parameters only related to task ------------
{task}
Will be replaced with the task name.
Example: transcribed or translated
{task-lang}
Will be replaced with the task name alongside the language.
Example: transcribed english or translated english to french
{task-with}
Will be replaced with the task name alongside the model or engine name.
Example: transcribed with tiny or translated with google translate
{task-lang-with}
Will be replaced with the task name alongside the language and model or engine name.
Example: transcribed english with tiny or translated english to french with google translate
{task-short}
Will be replaced with the task name but shorten.
Example: tc or tl
{task-short-lang}
Will be replaced with the task name but shorten and alongside the language.
Example: tc english or tl english to french
{task-short-with}
Will be replaced with the task name but shorten and alongside the model or engine name.
Example: tc tiny or tl google translate
{task-short-lang-with}
Will be replaced with the task name but shorten and alongside the language and model or engine name.
Example: tc english with tiny or tl english to french with google translate
Set the proxies list for HTTPS. Each proxies is separated by new line tab or space.
Set the proxies list for HTTP. Each proxies is separated by new line tab or space.
Set the host for libre translate. Example:
- If you are hosting it locally you can set it to
http://127.0.0.1:5000
. - If you are using the official instance you can set it to
https://libretranslate.com
Set the API key for libre translate.
Wether to supress the warning if the API key is empty.
Set the max character shown in the textbox.
Set the max character shown per line in the textbox.
Set the font for the textbox.
Wether to colorize the text based on confidence value when available. (Default checked
)
Wether to automatically scroll to the bottom when new text is added
Set the color for low confidence value. (Default #ff0000
)
Set the color for high confidence value. (Default #00ff00
)
Set the colorize per. You can choose between the following:
- Segment
- Word
You can only choose one of the option.
Go to: Download Page - Wiki Home - Code