Category Archives: th

Syllabification in Thai

We conclude that in order to work out whether a string of characters in Thai text represents an open or closed syllable, you have to know up front what the text actually says. This makes trying to determine how Thai … Continue reading

Posted in th | Leave a comment

I didn’t build a finite-state transducer in the end

I just used regexes and a big switch… case… statement. Sorry. Maybe next time.

Posted in th | Leave a comment

No sir, I can’t abugida

The peculiar and rebarbative romanization Google Translate uses for Thai is ISO 11940, which is 86 Swiss Francs to you at the time of writing. I’m trying to work out something between that and the rather lossy RTGS, but my … Continue reading

Posted in th | Leave a comment

I keep thinking U+0E00 onwards is the private use area

I’ve complained previously about Google’s romanization of Thai, which is hopeless and looks like this: Mæw nạ̀ng xyū̀ bn mæ thṭhi w and the RTGS transcription doesn’t mark tone. I can’t really experiment much more with someone else’s machine translation … Continue reading

Posted in th | Leave a comment

Combining characters

Lots of Indic scripts are abugidas, which mean that a “consonant” on its own means consonant + schwa (roughly), and otherwise you have consonants and vowels as usual. What comes as a surprise to lots of people used to the … Continue reading

Posted in th | Leave a comment

Ergativity

REVISION: If languages are ergative, like Basque and others I can’t remember right now, then they mark the subjects of transitive verbs, but not intransitive verbs. So “John opened the door and ran away” would require two “Johns”, one in … Continue reading

Posted in th | Leave a comment

Thai, which turns out to have an abugida like Hindi or Tamil, marks tone on the initial consonant sometimes and also on the vowel; it seems that Google have rolled their own transliteration based very closely on the script rather than using one of the lossy pre-canned ones

Ah.

Posted in th | Leave a comment

Your own rules are made to be broken

I may cheat and find a less rebarbative, browser-friendlier romanization of Thai than the one I’m getting from Google. It would help to be able to type rather than cut and paste.

Posted in th | Leave a comment

Classifiers

C̄hạn sụ̄̂x s̄xng m̂ā C̄hạn sụ̄̂x m̂ā s̄ām tạw C̄hạn sụ̄̂x s̄ī̀ m̂ā C̄hạn sụ̄̂x m̂ā s̄ib Japanese, Korean and Chinese tend not to have separate plural forms for nouns, but use classifiers instead, like “rasher” in “five rashers of … Continue reading

Posted in th | Leave a comment

Every cat chases some dog.

There are, as well you know, two readings for that sentence. What does it look like in Thai? According to Google, like this: สุนัขแมวทุก chases บาง Maybe this gives us a clue about the training corpus. “Every cat loves some … Continue reading

Posted in th | Leave a comment