Speech to Text

admin · February 7, 2024, 12:30pm

Speech to Text

Speech is a fundamental part of being human.

88.io provides tools to help in the push towards citizens not only owning their data but also the intelligence that comes from their data.

A major weakness of many voice recognition system is the sending of your voice to the Cloud for recognition or for training.

Speech is a simple and convenient user interface, unfortunately its uses have been dominated by the Cloud platforms eg. Apple, Google, Amazon etc.

By taking advantage of Partition AI enables Private Cyberspaces come with their own independent Voice Recognition system, which can be trained privately by you and be used cross different platforms.

Client Voice Recognition
Server Voice Recognition

With your own Entity Agent, your are the only one with access to your voice in order to:

Give Voice Commands to your Agent
Train your Agent to recognise your Voice Commands

NOTHING is sent out of your personal device. The process does NOT using any external APIs, all commands and their training remains on your device.

Speech Quality

We have tuned the STT to work even on the traditional telephone network (using A-law codec with 8kHz sampling rate).

Introduction to SST

With private cyberspace EVERYONE (yes you) got to train their own STT engine, for those who want to learn a bit about the technology behind the STT they use everyday, the following are some good introductions:

admin · February 7, 2024, 12:43pm

Default STT Engine

Kaldi

For commands the default engine is

Whisper

The default engine for general STT is OpenAI Whisper

Some projects using Whisper:

GitHub - m-bain/whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
GitHub - pluja/whishper: Transcribe any audio to text, translate and edit subtitles 100% locally with a web UI. Powered by whisper models!
GitHub - thewh1teagle/vibe: Transcribe on your own!

Vosk

Languages

English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish

Software

Asterisk
GitHub - alphacep/vosk-asterisk: Speech Recognition in Asterisk with Vosk Server
Rhasspy Server
voskJs/examples at master · solyarisoftware/voskJs · GitHub

Sherpa

GitHub - k2-fsa/sherpa: Speech-to-text server framework with next-gen Kaldi

admin · February 11, 2024, 9:48pm

Faster Whisper

Faster Whisper can be used in most Private Cyberspace deployments.

which is based on OpenAI's Whisper:

Currently you can speak in 99 languages to your Entity Agent:

    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",

There is also an "auto" language option you can select which will attempt to automatically detect the language you are speaking in, but the performance is LOWER than that if you tell it to focus on a specify spoken language of yours.

Models

Whisper has a number of models which you can pick for your Private Cyberspace depending on the compute power of the hardware you have access to.

Model	Parameters	Memory	Speed	Default
Tiny	39 M	~1 GB	~32x
Base	74 M	~1 GB	~16x	Y
Small	244 M	~2 GB	~6x
Medium	769 M	~5 GB	~2x
Large	1550 M	~10 GB	1x

People who own less powerful personal hardware can scale down from default Base model to the Tiny Model, while people who share more powerful community hardware can scale up to larger models.

An unique advantage of Private Cyberspace is the availability of the Partition AI layer on top of Whisper, enabling you to use achieve much better resuls with smaller models than possible with Whisper alone.

admin · August 8, 2024, 1:59am

Base Words

These are the words that your agent initially understands, which you can start training your agent with.

Numbers

zero
one
two
three
four
five
six
seven
eight
nine

Direction

up
down
left
right
stop

Status

yes
no

This vocabulary is always there in additional to other vocabularies, to handle words that your agent cannot recognise.

Command

Command Menu

0 - Help
1 - Tracking - Active / Passive
2 - Venue - Arrive / Depart
3 - Agent - do to conversion with anything llm api
4 - Contact - lost property, attractive person, missing person
5 - Advertisement - reality ads, ads on vehicles, signage
6 - News - other interesting events NOT covered by categories above
7 - Review - good performer, good restaurant, traffic delay
8 - Hazard - pot hole on road, people with flu symptoms, broken vehicle, rubbish on road
9 - Emergency - crime, medial, fire

Acknowledge Menu

0 - No
1 - Yes
2 - Sub-Menu

Tracking

Active

Browser - once every 1 minute
Owntracks iOS - Move Mode - every 5 minute OR move more than 50 meters
Owntracks Andriod - Move Mode - every 10 seconds
Home Assistant iOS -
Home Assistant Android - High Accuracy Mode -

Passive

Browser - check once every 5 minutes
Owntracks iOS - Significant Mode - every 5 minute AND move more than 500 meters
Owntracks Android - Move Mode - every 5 minute AND move more than 50 meters

admin · August 8, 2024, 2:03am

Web Browser STT

Runs inside web browser can be used OFFLINE. The is the most private option as your voice never leaves your phone. In development, currently of limited capacity.

Demonstration at: https://speech.88.io

For a demonstration of web browser STT, go to https://speech.88.io and see how much your pre-trained agent can already recognise your voice BEFORE training.

Currently each word is represented by 696 speech spectrogram numbers holding the frequency information of the word pronounced by you.

admin · August 8, 2024, 2:04am

Rhasspy

https://speech.quuvoo4ohcequuox.0.88.io/

admin · August 13, 2024, 4:25am

VOSK

admin · August 13, 2024, 4:25am

FunASR

admin · August 24, 2024, 3:19am

Willow

admin · September 3, 2024, 10:57pm

Phoneme Recognition

1. Allosaurus

GitHub - xinjli/allosaurus: Allosaurus is a pretrained universal phone recognizer for more than 2000 languages

2. Multilingual-PR

GitHub - ASR-project/Multilingual-PR: Phoneme Recognition using pre-trained models Wav2vec2, HuBERT and WavLM. Throughout this project, we compared specifically three different self-supervised models, Wav2vec (2019, 2020), HuBERT (2021) and WavLM (2022) pretrained on a corpus of English speech that we will use in various ways to perform phoneme recognition for different languages with a network trained with Connectionist Temporal Classification (CTC) algorithm.

3. Pocketsphinx

Phoneme Recognition (caveat emptor) – CMUSphinx Open Source Speech Recognition

CMU Pronouncing Dictionary

sphinxdict

Sphinx-compatible version of the CMU dictionary for Speech to Text.

AA
AE
AH
AO
AW
AY
B
CH
D
DH
EH
ER
EY
F
G
HH
IH
IY
JH
K
L
M
N
NG
OW
OY
P
R
S
SH
T
TH
UH
UW
V
W
Y
Z
ZH
SIL

p/cmusphinx/code - Revision 13291: /trunk/cmudict/sphinxdict

4. Phoneme Conversion

Conversion between to Phoneme formats are required in some cases, for example, between International Phonetic Alphabet (IPA) and ARPABET which is used by CMU Sphinx:

ɔ 	AO
ɔː 	AO
o       AO
oː      AO
ɑ 	AA
ɑː 	AA
ɒ 	AO
iː 	IY
i 	IY
uː 	UW
u 	UW
ɛ 	EH
ɪ 	IH
ʊ 	UH
ʌ 	AH
ɐ	AH
ə 	AH
æ 	AE
a 	AE
e 	AE
eɪ 	EY
aɪ 	AY
oʊ 	OW
aʊ 	AW
əʊ      OW
iə      EH
eə      EH
ɔɪ 	OY
ɝ 	ER
ɜ       ER
ɜː      ER
ɹ       R
r       R
p 	P
b 	B
t 	T
d 	D
k 	K
ɡ 	G
ʧ       CH
tʃ 	CH
dʒ 	JH
f 	F
v 	V
θ 	TH
ð 	DH
s 	S
z 	Z
ʃ 	SH
ʒ 	ZH
h 	HH
m 	M
n 	N
ŋ 	NG
l 	L
j 	Y
w 	W
ʔ 	Q
'
ˈ
ː
ˌ
+SPACE+  SIL
x       K
ɲ       N
### ɑ̃      N
### ɣ       ZH