I keep thinking U+0E00 onwards is the private use area

I’ve complained previously about Google’s romanization of Thai, which is hopeless and looks like this:

M?w n??ng xy?? bn m? th?hi w

and the RTGS transcription doesn’t mark tone. I can’t really experiment much more with someone else’s machine translation before I sort this out, so here’s an interim result from my zu Hause gebastelt romanizer:

CCCCC CCvC*CCCVCCVCVCVCVC
vC*VC*VCVCCCCC*CVCCVC CCCCCCCCVCCVvCCCCVCVCVC (3 CVCVCC C.C. 2426-13 CVCVCVCC C.C. 2463) vC*CCCVCVCvCCCCCC*CV* 40 vCCCVCVCCCvC*CCCVCVCCCCvCC*VvC*VCCV*CVC vCVCCC*CV* 4 vCCCvC*CCCVCCVCVCCVCCCVCCCCVCVCVCVC CCVCVCCVCCCC vCVCCCvC*CCCvC*CCCVCCVCVCVCVC vCCCVCVCCCvC*CCCVCCCVCvCC*VvC*VCCV*CVC

As you can see, it distinguishes consonants, vowels that go after the consonant, vowels that go before the consonant but you would transliterate after the consonant, and single-byte UTF-8 characters and renders things that appear after U+0E44 as asterisks. Now that I can reduce everything to integers, which in PHP is needlessly fiddly,? I can build a finite-state transducer.

This sounds much more cat’s-whiskers-and-practical-electronics-for-the-technical-man than it really is.

This entry was posted in th. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *