Monthly Archives: June 2007

extracting fields in shell

A lot of shell scripts require processing some kind of data structured in fields or columns separated by special characters ( space, coma, semi colon, etc... )

This is a short tutorial that shows you how you can extract the fields in a stream of data. There are several ways of doing this and each has it's advantages of disadvantages.

Here is what I use:

  1. Using cut

    The 'cut' program will allow you to extract the fields separated by one character. you can specify which field to extract, and what is the field separator.
    Example: echo "a:b:c" | cut -f2 -d':' will output b
    The cut program has the advantage that it is simple to use, almost ( all ) Unix flavors have it included in the base distribution and is relatively lightweight ( ~33Kb with no library dependency other then libc on my gentoo Linux )
    The problem with cut is that the field separator can only be a single character.

  2. Using awk

    awk is a pattern scanning and processing language somehow similar perl. Actually it is believed that perl was inspired by languages like awk, perl, C, and some others. Awk is a lot more flexible then cur and can do a lot more. You can actually specify a regular expression for the field separator.
    Here is an example for extracting the fields separated by one or more spaces:
    echo "a b c"|awk '{print $2}' - this will print the second field. As you can see I have not specified any separator because awk uses <space> as the default separator. <space> means any number of spaces here.
    You can specify a different field separator by using the -F parameter.

  3. Using a shell function

    this may be the simplest and fastest solution but will only work if the field separator is composed of spaces or tabs only. As you may know the parameters are passed to a shell function separated by spaces. so you can just make a function that has the sole purpose of returning the field ( parameter ) you want.
    If I want to get the third field from a line I would do a function like this

    1.  

    getfield a b ccc ddd would display 'ccc' . This is more useful in a script where you need to get a field value from a variable containing some text but not so mush with whole files.

Do you know any other/better method ? Feel free to share them in the comments

Comment relish optimization

Comment Relish is a wordpress plugin that sends an email the first time someone comments on your blog. The plugin allows you to define the message in wp-admin and you can embed tags like author's name, email, website and others in the message.

This can be used to send a welcome message to the first time commentators and maybe invite them to subscribe to your rss feed.

The Problem

John Chow, tried to use it on his blog but the plugin and his huge number of comments ( 43000 ? ) put the site down for about 2 hours. He said the plugin misbehaved and he would not recommend it if you have a lot of comments.

I did a little simulation on a wordpres blog with about 25000 comments and I can see how the plugin misbehaves :). It took more then tho minutes to select the email addresses that should be emailed before I decided to stop it, because that's just unacceptable, but I think it would have taken a lot more.

I managed to optimize it easily by just modifying a field in the table used by this plugin. If you don't care how I did it and just want the optimized plugin skip to the end of the post. If not then read on...

How it works

The plugin uses a table ( cr_emailed ) to remember which addresses received an email, so that it does not send a message more then once. Basically the plugin just inserts a new record in this table every time someone new comments on the blog. It tries to see if someone is new by doing a LEFT JOIN select on two tables: comments and cr_emailed, and then filters the results that have cr_emailed.email = NULL. The problem with this is that the email column has no index so if you have 43000 comments the query will have a huge result set and mysql will have to process each result to find out if cr_emailed.email is null ( the email has not received a message yet ) .

The solution

This was really simple, just convert the email field to a varchar ( needed in order to be able to set a fixed length key on this field ) and then add an index on it.

After doing this the query that took more then two minutes, now takes a little less then 1 second.

Some one on John Chow's blog suggested to add an index to wp_comments.comment_author_email. I tried that but I noticed no improvements, and using explain on the sql query shows the index on comment_author_email is not used, so that's useless.

Looking more over the code I noticed another weird thing. The query executed to find the email addresses looks like this:

  1. span style="color: #ff0000;">'1'

I wonder why the author used the posts table in this query? The results from the posts table are not used anywhere in the code. I really couldn't see a good reason for keeping the posts table in this query so I removed the posts table from this query, making it even faster.

Get the source code

Here is a diff file between my version and the original version and here is a my new version . If you have this plugin already installed you have to uninstall it and remove the wp_cr_emailed table and then copy the new version and activate it in wp-admin.

I'm using the new version on this blog and is working but I have few comments here. I have only tested the modifications on this blog and on a fictional blog with 25000 automatically generated comments so the usual disclaimer applies... you know all that "WARNING NO WARRANTIES" stuff...

If you have a blog with many comments and you want to try this, let me know how it works in a real environment.

Google set to kill link ads

Google has a way of reporting paid links now. They say buying links is an attempt to game their pagerank algorithm and they want you to report sites that sell or buy paid links.

They agree links are a good way of advertising and are not against it, but they want those that display text link to put the rel="nofollow" attribute in the links. Using the "nofollow" attribute will means that GoogleBot will not follow the link, thus will not use it when computing the page rank for the destination url.

I think the only reason you would want text links on a site is because of that, to get a higher page rank and relevance, so by requiring webmasters to use nofollow, they are just killing text link advertising networks like Text Link Ads that work especially because they sell link ads that are followed and transfer page rank.

Google says this violates their guidelines. How can you violate a guideline, you can violate a rule, but if it's just a guideline that means you shouldn't be penalized for not following it.
And there are other problems with this policy. Links are supposed to mean that the owner of the site thinks that some other site is relevant, and that is why he links to it. Paid or not it can be relevant. Page rank is about relevancy, right ?
If I want my site in google search ads, I pay google for it, does that mean my ads are not relevant ? Google says it shows contextual ads because they are relevant to the content the user is seeing. It seems to me, it's relevant only if you pay google for the ad.

And here's another problem: How can google tell if the person that reports such violation does not lie? If I want to get my competitor out of google index or set him on a lower page rank I could just report him for buying text links. A lot of web sites have text links pointing to them, paid or not. It's hard to tell. Some disclose them, other's don't. This may influence the ones that do disclose them, not to disclose the links anymore. Why add to the risk of being reported?

css class names in IE 6

I just realized today that IE 6 class names must not start with something else other then [a-zA-Z]. So don't you dare name your classes  like _class1 cause it will not work in IE6.

It works well in Firefox 2.0, IE7, opera and safari and that will just make it harder when you'll try to discover the problem.

I know this seems like a lame mistake and something any designer should know, but well I'm not really a designer and some things you never know unless they hit you.

Agloco viewbar available

AGLOCO the company that pays you shares to browse the net, has finally released their view bar.

This is the tool bar you install in internet explorer or firefox that will display ads that are contextually targeted, then all you have to do is use your browser and navigate like you always do, but now you're getting paid for it.

AGLOCO aims to be the company that is 100% owned by it's members. The idea is that if you see ads you should be paid for it, after all, the ads are renting your screen's real estate.  the company is founded by the people that had a similar company in the past,  All Advantage, but the new model is promising because they will not spend money directly, instead they will pay the members in shares, that members should be able to exchange at some rate in the future. 

Get the viewbar, install it and start owning a piece of AGLOCO.

The temple of sun

Sun Microsystems created an online flash game for developers named "the temple of sun" as an attempt to get developers to try Sun studio 11.

The game is pretty much simplistic , you're in a temple similar to those created by the maya civilization but this one was created by developers to protect Sun Studio 11, because the people or their time were just not ready for it. you have to avoid some obstacles, collect some gold and answer C++ questions, and at the final you have to answer a question that can only be answered if you download sun studio and compile some code on it.

Unforunately sun studio doesn't install on my gentoo machine so there goes my chance to win up to $5000.

Linux apps anywhere

LINA is a virtual machine that aims to run linux applications on any operating systems.

The concept is similar to the Java VM, write your application in C, C++ ( and other languages supported in the future ), compile it with LINA and it will run on Windows, Linux, Mac OSX and maybe other OS. You will be able to run command line programs as well as gui applications that rely on GTK or QT and they will look native to the OS were you will run them.
Of course there is some overhead that will make your applications around 2x slower, but they say the performance will improve.

LINA VM and the tools that will be used to compile applications will be released using a dual license similar to MySQL's licensing model: GPL for those that release their applications under GPL and commercial for those that want to release proprietary applications.

The developement seems to have started 4 years ago but nothing has been released yet. The plan is to release the virtual machine and compiler tools this month.

Yet another perfect server

A of howtos start like that "The perfect Server" or "The Perfect Descktop" or "The perfect Setup". Howtoforge has lots of howtos like that. Their latest "the perfect server" literally shows you, ( because the howto has more screenshots then words 🙂 ) how to install Centos 4.5 and the servers that you need for hosting sites with mysql, php, email ftp and a control panel - ISPConfig.

A howto for each version

This is not their first "the perfect server for centos" howto, they had one for centos 4.4 and one for centos 5.0. Wonder how much different is Centos 4.4 from 4.5 and more exactly how much different if the howto for 4.4 from the one for 4.5? Well they replaced the version everywhere in the text and luckily the install gui was similar so they did not have to change the screenshots :). For 5.0 they did not have so much luck, they had to change the screenshots, and a add some more.

Made for Robots

What I don't like about these kind of howtos is that they don't explain much about the software you're installing, and they don't tell you why you do what they tell you to do. At some point they show you a screenshot and the only text is "in the next screen click next". That's not very useful, you don't learn anything like that. If something goes wrong or if you're trying to set things up on a different distro or disfferent OS you will not know what to do unless you're lucky to find a similar howto for your distro.

One other thing I don't like is that they show you even how to install Centos. Why not do a separate howto just for that, this way I will not get bored by those installation screenshots and they would get more space for writing some more/good details about what they are doing.