extracting fields in shell

A lot of shell scripts require processing some kind of data structured in fields or columns separated by special characters ( space, coma, semi colon, etc... )

This is a short tutorial that shows you how you can extract the fields in a stream of data. There are several ways of doing this and each has it's advantages of disadvantages.

Here is what I use:

  1. Using cut

    The 'cut' program will allow you to extract the fields separated by one character. you can specify which field to extract, and what is the field separator.
    Example: echo "a:b:c" | cut -f2 -d':' will output b
    The cut program has the advantage that it is simple to use, almost ( all ) Unix flavors have it included in the base distribution and is relatively lightweight ( ~33Kb with no library dependency other then libc on my gentoo Linux )
    The problem with cut is that the field separator can only be a single character.

  2. Using awk

    awk is a pattern scanning and processing language somehow similar perl. Actually it is believed that perl was inspired by languages like awk, perl, C, and some others. Awk is a lot more flexible then cur and can do a lot more. You can actually specify a regular expression for the field separator.
    Here is an example for extracting the fields separated by one or more spaces:
    echo "a b c"|awk '{print $2}' - this will print the second field. As you can see I have not specified any separator because awk uses <space> as the default separator. <space> means any number of spaces here.
    You can specify a different field separator by using the -F parameter.

  3. Using a shell function

    this may be the simplest and fastest solution but will only work if the field separator is composed of spaces or tabs only. As you may know the parameters are passed to a shell function separated by spaces. so you can just make a function that has the sole purpose of returning the field ( parameter ) you want.
    If I want to get the third field from a line I would do a function like this

    1.  

    getfield a b ccc ddd would display 'ccc' . This is more useful in a script where you need to get a field value from a variable containing some text but not so mush with whole files.

Do you know any other/better method ? Feel free to share them in the comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.