The battle over the way we should speak

On the increasing usage of improper English, Joan Acocella of The New Yorker notes:

English is a melding of the languages of the many different peoples who have lived in Britain; it has also changed through commerce and conquest. English has always been a ragbag, and that encouraged further permissiveness. In the past half century or so, however, this situation has produced a serious quarrel, political as well as linguistic, with two combatant parties: the prescriptivists, who were bent on instructing us in how to write and speak; and the descriptivists, who felt that all we could legitimately do in discussing language was to say what the current practice was.

But the most curious flaw in the descriptivists’ reasoning is their failure to notice that it is now they who are doing the prescribing. By the eighties, the goal of objectivity had been replaced, at least in the universities, by the postmodern view that there is no such thing as objectivity: every statement is subjective, partial, full of biases and secret messages. And so the descriptivists, with what they regarded as their trump card—that they were being accurate—came to look naïve, and the prescriptivists, with their admission that they held a specific point of view, became the realists, the wised-up.

Source: New Yorker

I guess that will make me closer to a descriptivist since I think there’s nothing wrong with Singlish.

How words are learned

MIT researcher Deb Roy wanted to understand how his infant son learned language — so he wired up his house with videocameras to catch every moment (with exceptions) of his son’s life, then parsed 90,000 hours of home video to watch “gaaaa” slowly turn into “water.” Astonishing, data-rich research with deep implications for how we learn.

Deb Roy: The birth of a word

Watch what you retweet

You think you can just retweet something and get away with it? Well, actually you probably can. BUT not in China.


Cheng disappeared ten days later, on what was to be her wedding day, her whereabouts unknown until it emerged this week that she had been detained and sentenced by local police.

“Sentencing someone to a year in a labour camp, without trial, for simply repeating another person’s clearly satirical observation on Twitter demonstrates the level of China’s repression of online expression” said Sam Zarifi, Amnesty International’s Director for the Asia-Pacific.

The offending tweet was originally posted by Cheng’s fiancé Hua Chunhui, mocking China’s young nationalist demonstrators who had smashed Japanese products in protest over a maritime incident between China and Japan involving the disputed Diaoyu/Senkaku islands.

Hua’s original tweet said “Anti-Japanese demonstrations, smashing Japanese products, that was all done years ago by Guo Quan [an activist and expert on the Nanjing Massacre]. It’s no new trick. If you really wanted to kick it up a notch, you’d immediately fly to Shanghai to smash the Japanese Expo pavilion.” [Source:

The poor lady has been sentence to “re-education through labor.” An additional note is Twitter is blocked in China and the only way a Chinese can tweet is to go through a VPN.

An unrelated note: Chinese is a really compact written language. In 140 characters you write a sentence in English, but in Chinese you can write a short paragraph on your day.

What my Chinese name meant in Japanese

Well nobody really explained to me why my name is chosen for me. At the later age, I have made the assumption that it probably is a good name and I just have to trust whoever gave me the name, that would be my grandfather. My Chinese name would mean something like ‘good’ and ‘great’, maybe my grandfather can’t decide between the two.

I got particularly interested how my Chinese name would be read in Japanese as a Kanji.


  • 嘉 means “applaud”, “esteem” or “praise”. When used in context, it typically means good and is pronounced as ‘ka’.
  • 偉 means “admirable”, “conceited” or “excellent”. When used in context, it typically means greatness and is pronounced as ‘i’ or ‘erai’.


Complete the sentence in Chinese

Found another joke in my mailbox, this time it’s about 造句:





5。题目:又 又








I really should avoid turning this into a Chinese 博客 (blog).

List of stop words

Stop words sometimes known as stopwords or Noise Words (in the case of SQL Server), is the name given to words which are filtered out prior to, or after, processing of natural language data (text). Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept in his design. It is controlled by human input and not automated. This is sometimes seen as a negative approach to the natural articles of speech as mentioned above. (Source: Wikipedia)

Here’s a list of stop words, it’s compiled from Mark Sanderson’s Information Retrieval linguistic utilities stop words list. It has been formatted to a PHP array for easy use:

[code lang=”php”]var $stop_words = array(“a”, “about”, “above”, “across”, “after”, “afterwards”, “again”, “against”, “all”, “almost”, “alone”, “along”, “already”, “also”, “although”, “always”, “am”, “among”, “amongst”, “amoungst”, “amount”, “an”, “and”, “another”, “any”, “anyhow”, “anyone”, “anything”, “anyway”, “anywhere”, “are”, “around”, “as”, “at”, “back”, “be”, “became”, “because”, “become”, “becomes”, “becoming”, “been”, “before”, “beforehand”, “behind”, “being”, “below”, “beside”, “besides”, “between”, “beyond”, “bill”, “both”, “bottom”, “but”, “by”, “call”, “can”, “cannot”, “cant”, “co”, “computer”, “con”, “could”, “couldnt”, “cry”, “de”, “describe”, “detail”, “do”, “done”, “down”, “due”, “during”, “each”, “eg”, “eight”, “either”, “eleven”, “else”, “elsewhere”, “empty”, “enough”, “etc”, “even”, “ever”, “every”, “everyone”, “everything”, “everywhere”, “except”, “few”, “fifteen”, “fify”, “fill”, “find”, “fire”, “first”, “five”, “for”, “former”, “formerly”, “forty”, “found”, “four”, “from”, “front”, “full”, “further”, “get”, “give”, “go”, “had”, “has”, “hasnt”, “have”, “he”, “hence”, “her”, “here”, “hereafter”, “hereby”, “herein”, “hereupon”, “hers”, “herself”, “him”, “himself”, “his”, “how”, “however”, “hundred”, “i”, “ie”, “if”, “in”, “inc”, “indeed”, “interest”, “into”, “is”, “it”, “its”, “itself”, “keep”, “last”, “latter”, “latterly”, “least”, “less”, “ltd”, “made”, “many”, “may”, “me”, “meanwhile”, “might”, “mill”, “mine”, “more”, “moreover”, “most”, “mostly”, “move”, “much”, “must”, “my”, “myself”, “name”, “namely”, “neither”, “never”, “nevertheless”, “next”, “nine”, “no”, “nobody”, “none”, “noone”, “nor”, “not”, “nothing”, “now”, “nowhere”, “of”, “off”, “often”, “on”, “once”, “one”, “only”, “onto”, “or”, “other”, “others”, “otherwise”, “our”, “ours”, “ourselves”, “out”, “over”, “own”, “part”, “per”, “perhaps”, “please”, “put”, “rather”, “re”, “same”, “see”, “seem”, “seemed”, “seeming”, “seems”, “serious”, “several”, “she”, “should”, “show”, “side”, “since”, “sincere”, “six”, “sixty”, “so”, “some”, “somehow”, “someone”, “something”, “sometime”, “sometimes”, “somewhere”, “still”, “such”, “system”, “take”, “ten”, “than”, “that”, “the”, “their”, “them”, “themselves”, “then”, “thence”, “there”, “thereafter”, “thereby”, “therefore”, “therein”, “thereupon”, “these”, “they”, “thick”, “thin”, “third”, “this”, “those”, “though”, “three”, “through”, “throughout”, “thru”, “thus”, “to”, “together”, “too”, “top”, “toward”, “towards”, “twelve”, “twenty”, “two”, “un”, “under”, “until”, “up”, “upon”, “us”, “very”, “via”, “was”, “we”, “well”, “were”, “what”, “whatever”, “when”, “whence”, “whenever”, “where”, “whereafter”, “whereas”, “whereby”, “wherein”, “whereupon”, “wherever”, “whether”, “which”, “while”, “whither”, “who”, “whoever”, “whole”, “whom”, “whose”, “why”, “will”, “with”, “within”, “without”, “would”, “yet”, “you”, “your”, “yours”, “yourself”, “yourselves”);[/code]

And here is a list of Google stop words, I can’t recall where I got this from but there’re numerous sites with such information. Once again formatted in a PHP array which you can quite easily convert to Java array:

[code lang=”php”]var $google_stop_words = array(“I” ,”a” ,”about” ,”an” ,”are” ,”as” ,”at” ,”be” ,”by” ,”com” ,”de” ,”en” ,”for” ,”from” ,”how” ,”in” ,”is” ,”it” ,”la” ,”of” ,”on” ,”or” ,”that” ,”the” ,”this” ,”to” ,”was” ,”what” ,”when” ,”where” ,”who” ,”will” ,”with” ,”und” ,”the” ,”www”);[/code]

This is useful for filtering out common words in an English paragraph that may be deemed insignificant. This is one of the things I used to implement something like a tag discoverer based on word frequency.

I want to go MAKДOHAЛД’C

Ever wonder how to pronounce Russian words like: компью́тер, студе́нт, па́спорт? Actually they aren’t that hard as Gadling teaches you to read the Cyrillic alphabet in 5 minutes. I took longer.

And that’s I want to go McDoanald’s in the title.

By the way, why is it “pronounce” but “pronunciation”?