I’ve complained previously about Google’s romanization of Thai, which is hopeless and looks like this:
M?w n??ng xy?? bn m? th?hi w
and the RTGS transcription doesn’t mark tone. I can’t really experiment much more with someone else’s machine translation before I sort this out, so here’s an interim result from my zu Hause gebastelt romanizer:
CCCCC CCvC*CCCVCCVCVCVCVC
vC*VC*VCVCCCCC*CVCCVC CCCCCCCCVCCVvCCCCVCVCVC (3 CVCVCC C.C. 2426-13 CVCVCVCC C.C. 2463) vC*CCCVCVCvCCCCCC*CV* 40 vCCCVCVCCCvC*CCCVCVCCCCvCC*VvC*VCCV*CVC vCVCCC*CV* 4 vCCCvC*CCCVCCVCVCCVCCCVCCCCVCVCVCVC CCVCVCCVCCCC vCVCCCvC*CCCvC*CCCVCCVCVCVCVC vCCCVCVCCCvC*CCCVCCCVCvCC*VvC*VCCV*CVC
As you can see, it distinguishes consonants, vowels that go after the consonant, vowels that go before the consonant but you would transliterate after the consonant, and single-byte UTF-8 characters and renders things that appear after U+0E44 as asterisks. Now that I can reduce everything to integers, which in PHP is needlessly fiddly,? I can build a finite-state transducer.
This sounds much more cat’s-whiskers-and-practical-electronics-for-the-technical-man than it really is.