Speech to Text

Speech to Text

Speech is a fundamental part of being human.

88.io provides tools to help in the push towards citizens not only owning their data but also the intelligence that comes from their data.

A major weakness of many voice recognition system is the sending of your voice to the Cloud for recognition or for training.

Speech is a simple and convenient user interface, unfortunately its uses have been dominated by the Cloud platforms eg. Apple, Google, Amazon etc.

By taking advantage of Partition AI enables Private Cyberspaces come with their own independent Voice Recognition system, which can be trained privately by you and be used cross different platforms.

  • Client Voice Recognition
  • Server Voice Recognition

With your own Entity Agent, your are the only one with access to your voice in order to:

  1. Give Voice Commands to your Agent
  2. Train your Agent to recognise your Voice Commands

NOTHING is sent out of your personal device. The process does NOT using any external APIs, all commands and their training remains on your device.

Speech Quality

We have tuned the STT to work even on the traditional telephone network (using A-law codec with 8kHz sampling rate).

Introduction to SST

With private cyberspace EVERYONE (yes you) got to train their own STT engine, for those who want to learn a bit about the technology behind the STT they use everyday, the following are some good introductions:

Default STT Engine

Kaldi

For commands the default engine is

Whisper

The default engine for general STT is OpenAI Whisper

Some projects using Whisper:

Vosk

Languages

English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech, Polish

Software

Sherpa

Faster Whisper

Faster Whisper can be used in most Private Cyberspace deployments.

which is based on OpenAI's Whisper:

Currently you can speak in 99 languages to your Entity Agent:

    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",

There is also an "auto" language option you can select which will attempt to automatically detect the language you are speaking in, but the performance is LOWER than that if you tell it to focus on a specify spoken language of yours.

Models

Whisper has a number of models which you can pick for your Private Cyberspace depending on the compute power of the hardware you have access to.

Model Parameters Memory Speed Default
Tiny 39 M ~1 GB ~32x
Base 74 M ~1 GB ~16x Y
Small 244 M ~2 GB ~6x
Medium 769 M ~5 GB ~2x
Large 1550 M ~10 GB 1x

People who own less powerful personal hardware can scale down from default Base model to the Tiny Model, while people who share more powerful community hardware can scale up to larger models.

An unique advantage of Private Cyberspace is the availability of the Partition AI layer on top of Whisper, enabling you to use achieve much better resuls with smaller models than possible with Whisper alone.

Base Words

These are the words that your agent initially understands, which you can start training your agent with.

Numbers

  1. zero
  2. one
  3. two
  4. three
  5. four
  6. five
  7. six
  8. seven
  9. eight
  10. nine

Direction

  1. up
  2. down
  3. left
  4. right
  5. stop

Status

  1. yes
  2. no
  • This vocabulary is always there in additional to other vocabularies, to handle words that your agent cannot recognise.

Command

Command Menu

0 - Help
1 - Tracking - Active / Passive
2 - Venue - Arrive / Depart
3 - Agent - do to conversion with anything llm api
4 - Contact - lost property, attractive person, missing person
5 - Advertisement - reality ads, ads on vehicles, signage
6 - News - other interesting events NOT covered by categories above
7 - Review - good performer, good restaurant, traffic delay
8 - Hazard - pot hole on road, people with flu symptoms, broken vehicle, rubbish on road
9 - Emergency - crime, medial, fire

Acknowledge Menu

0 - No
1 - Yes
2 - Sub-Menu

Tracking

Active

Browser - once every 1 minute
Owntracks iOS - Move Mode - every 5 minute OR move more than 50 meters
Owntracks Andriod - Move Mode - every 10 seconds
Home Assistant iOS -
Home Assistant Android - High Accuracy Mode -

Passive

Browser - check once every 5 minutes
Owntracks iOS - Significant Mode - every 5 minute AND move more than 500 meters
Owntracks Android - Move Mode - every 5 minute AND move more than 50 meters

Web Browser STT

Runs inside web browser can be used OFFLINE. The is the most private option as your voice never leaves your phone. In development, currently of limited capacity.

Demonstration at: https://speech.88.io

For a demonstration of web browser STT, go to https://speech.88.io and see how much your pre-trained agent can already recognise your voice BEFORE training.

Currently each word is represented by 696 speech spectrogram numbers holding the frequency information of the word pronounced by you.

Rhasspy

https://speech.quuvoo4ohcequuox.0.88.io/

VOSK

FunASR

Willow

Phoneme Recognition

1. Allosaurus

2. Multilingual-PR

3. Pocketsphinx

CMU Pronouncing Dictionary

sphinxdict

Sphinx-compatible version of the CMU dictionary for Speech to Text.

AA
AE
AH
AO
AW
AY
B
CH
D
DH
EH
ER
EY
F
G
HH
IH
IY
JH
K
L
M
N
NG
OW
OY
P
R
S
SH
T
TH
UH
UW
V
W
Y
Z
ZH
SIL

4. Phoneme Conversion

Conversion between to Phoneme formats are required in some cases, for example, between International Phonetic Alphabet (IPA) and ARPABET which is used by CMU Sphinx:

ɔ 	AO
ɔː 	AO
o       AO
oː      AO
ɑ 	AA
ɑː 	AA
ɒ 	AO
iː 	IY
i 	IY
uː 	UW
u 	UW
ɛ 	EH
ɪ 	IH
ʊ 	UH
ʌ 	AH
ɐ	AH
ə 	AH
æ 	AE
a 	AE
e 	AE
eɪ 	EY
aɪ 	AY
oʊ 	OW
aʊ 	AW
əʊ      OW
iə      EH
eə      EH
ɔɪ 	OY
ɝ 	ER
ɜ       ER
ɜː      ER
ɹ       R
r       R
p 	P
b 	B
t 	T
d 	D
k 	K
ɡ 	G
ʧ       CH
tʃ 	CH
dʒ 	JH
f 	F
v 	V
θ 	TH
ð 	DH
s 	S
z 	Z
ʃ 	SH
ʒ 	ZH
h 	HH
m 	M
n 	N
ŋ 	NG
l 	L
j 	Y
w 	W
ʔ 	Q
'
ˈ
ː
ˌ
+SPACE+  SIL
x       K
ɲ       N
### ɑ̃      N
### ɣ       ZH