Tag Archives: language

List of stop words

Stop words sometimes known as stopwords or Noise Words (in the case of SQL Server), is the name given to words which are filtered out prior to, or after, processing of natural language data (text). Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept in his design. It is controlled by human input and not automated. This is sometimes seen as a negative approach to the natural articles of speech as mentioned above. (Source: Wikipedia)

Here’s a list of stop words, it’s compiled from Mark Sanderson’s Information Retrieval linguistic utilities stop words list. It has been formatted to a PHP array for easy use:

var $stop_words = array("a", "about", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "computer", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves");

And here is a list of Google stop words, I can’t recall where I got this from but there’re numerous sites with such information. Once again formatted in a PHP array which you can quite easily convert to Java array:

var $google_stop_words = array("I" ,"a" ,"about" ,"an" ,"are" ,"as" ,"at" ,"be" ,"by" ,"com" ,"de" ,"en" ,"for" ,"from" ,"how" ,"in" ,"is" ,"it" ,"la" ,"of" ,"on" ,"or" ,"that" ,"the" ,"this" ,"to" ,"was" ,"what" ,"when" ,"where" ,"who" ,"will" ,"with" ,"und" ,"the" ,"www");

This is useful for filtering out common words in an English paragraph that may be deemed insignificant. This is one of the things I used to implement something like a tag discoverer based on word frequency.

I want to go MAKДOHAЛД’C

Ever wonder how to pronounce Russian words like: компью́тер, студе́нт, па́спорт? Actually they aren’t that hard as Gadling teaches you to read the Cyrillic alphabet in 5 minutes. I took longer.

And that’s I want to go McDoanald’s in the title.

By the way, why is it “pronounce” but “pronunciation”?

Why is learning programming so hard for some people?

I’ll try to explain based on my experience explaining Java to some friends. I never been through formal programming training which probably hence made me a poor teacher.

To me, it is because they cannot accept the language as it is. They question why are things done this way? Why not another way? The modern programming language is so abstract. It’s hard to see how the lower level components interact.

Some learners need to fiddle with the lower layers to accept and understand the higher level components. We just build tools on top of the lowest layer and then establish more and more layers thinking it is making life simple.

I lost my patience before and said to a friend, “Why can’t you just memorize it? It’s by design, if you don’t like then design your own language.” Actually it’s just an excuse because the real reason is too long to explain. It’s like telling a primary school kid that light travels in a straight line even though you well know it doesn’t and thank god it doesn’t.

Perhaps it does make life simple for the already programmers, but it makes learning a lot harder.

Worst language I wrote in – Fortran

One of the worst programming languages I ever wrote in is Fortran. It’s got a rather limited set of features. I dreaded to go work every day staring at the lines of codes that basically represent a cholesterol research paper’s equations.

I was using the g77 compiler. The only thing that I can remember is all the nonsensical representation of while loops. It has got the most basic support for structural programming. I had the impression that programmers in the 1970s are like artists, they paint a first layer and paint a second then the third and if there’s a mistake they cover it up with a thick coat of paint. The whole software is like a gibberish piece of code and no amount of comment ever made my life easier the next day.

However it is through Fortran that I start appreciating the more modern programming languages. I look at for each loops imagining how confusing would it be to represent the same code in Fortran.

Every time I hear people whine about how many lines of codes and how confusing a code chunk looked, I wish those people could see things from my point of view. Imagine the number of mistakes made and later corrected for the supposed better.

Obama campaign introduces Al the shoesalesman

This is a brilliant ad by the Obama campaign. For those of you who ain’t familiar with what’s going on, American politics is really interesting. McCain-Palin (Republicans) brought in phrases into American newspapers such as “hockey moms”, “Joe Six Pack” and “Joe the plumber”. These phrases are used to stereotype the typical American.

The thing that got me interested in politics is not the results the politicians are going to deliver. After all, staying thousands of miles away from the USA makes little difference on who’s elected anyway. What made me look at politics is the speeches, or more precisely, the ingenious use of the English language to reach people emotionally.

Introducing characters is just one way of doing so. As stupid as these phrases sound, people actually remember them. You can laugh at time but as long as you talk about it (even in a negative way), you are spreading the point of the politicians indirectly.

I think of these characters as stock characters (in the theater arts way) as they’re recycled time and again for every election. And politicians would just rebrand them in some little ways to make them sound new again.

McCain-Palin campaign has numerous such characters. I’m sick of them but I still laugh at them (alone, since no one bothers about US in Singapore). Anyway, here’s one endorsed by the Obama campaign:

Obama campaign introduces Al the shoesalesman

Find out your tax cut under Barack’s plan at http://taxcut.barackobama.com whether you’re single or married with children.

Previously John McCain repeated mentioned Joe the Plumber during his speeches, claiming he is a concern citizen who prefers the McCain tax plans.

Just to digress

For those people who knows the location of my other blog, it’s a tough decision if I want to put this post in this blog or that which is rather US. In the end I figured I should put it here since I want this blog to have more of my opinion. The other blog is visited by McCain supporters and they blast me even when I post a video that’s pro-Obama. That’s freedom of expression for me I guess.

And speaking of “plumber”, Uzyn corrected me on my pronunciation. I had always been pronouncing it as “plumb-ber”. Read it wrong for many years. “Plum-er,” he corrected me.