Detecting between Finnish and English

I wrote a quick and dirty script to detect whether an entry was English or Finnish. It’s based on two explicit lists of signifiers (frequent words that are in only in one language). Each signifier is assigned a weight in order to make up for the fact that signifiers may occur in bodies of text of another language.

It’s actually much easier to match English than Finnish.

I couldn’t find any real natural language models for PHP, which is really a shame. A Markovian language classifier would’ve been exactly what I needed. I bet Ispell would work.

---