Dec 082012
 Posted by on December 8, 2012 at 6:33 pm bash, FAQ, Shell Scripting Tagged with:  Add comments
Shell Scripting

The most basic method of reading the contents of a file is a while loop with its input redirected:

while read ## no name supplied so the variable REPLY is used do

: do something with "$REPLY" here done < "$kjv"

The file will be stored, one line at a time, in the variable REPLY. More commonly, one or more variable names will be supplied as arguments to read:

while read name phone do

printf "Name: %-10s\tPhone: %s\n" "$name" "$phone" done < "$file"

The lines are split using the characters in IFS as word delimiters. If the file contained in $file

contains these two lines:

John 555-1234

Jane 555-7531

the output of the previous snippet will be as follows:

Name: John     Phone: 555-1234

Name: Jane     Phone: 555-7531

By changing the value of IFS before the read command, other characters can be used for word splitting. The same script, using only a hyphen in IFS instead of the default space, tab, and newline, would produce this:

$ while IFS=- read name phone

> do

> printf "Name: %-10s\tPhone: %s\n" "$name" "$phone"

> done < "$file"

Name: John 555 Phone: 1234

Name: Jane 555 Phone: 7531

Placing an assignment in front of a command causes it to be local to that command and does not change its value elsewhere in the script.

To read the King James version of the Bible (henceforth referred to as KJV), the field separator IFS

should be set to a colon so that lines can be split into book, chapter, verse, and text, each being assigned to a separate variable.

kjvfirsts, Print Book, Chapter, Verse, and First Words from KJV

while IFS=: read book chapter verse text do

firstword=${text%% *}

printf "%s %s:%s %s\n" "$book" "$chapter" "$verse" "$firstword"

done < "$kjv"

The output (with more than 31,000 lines replaced by a single ellipsis) looks like this:

Genesis 001:001 In Genesis 001:002 And Genesis 001:003 And


Revelation 022:019 And Revelation 022:020 He Revelation 022:021 The

On my computer, a 1.6GHz Pentium 4 with many applications running, this script takes more than half a minute to run. The same task written in awk takes about a quarter of the time. See the section on awk later in this chapter for the script.

The awk programming language is often used in shell scripts when the shell itself is too slow (as in this case) or when features not present in the shell are required (for example, arithmetic using decimal fractions). The language is explained in somewhat more detail in the following section.

External Commands

You can accomplish many tasks using the shell without calling any external commands. Some use one or more commands to provide data for a script to process. Other scripts are best written with nothing but external commands.

Often, the functionality of an external command can be duplicated within the shell, and sometimes it cannot. Sometimes using the shell is the most efficient method; sometimes it is the slowest. Here I’ll

cover a number of external commands that process files and show how they are used (and often misused). These are not detailed explanations of the commands; usually they are an overview with, in

most cases, a look at how they are used—or misused—in shell scripts.


One of the most misused commands, cat reads all the files on its command line and prints their contents to the standard output. If no file names are supplied, cat reads the standard input. It is an appropriate command when more than one file needs to be read or when a file needs to be included with the output of other commands:

cat *.txt | tr aeiou AEIOU > upvowel.txt


date     ## Print the date and time

cat report.txt     ## Print the contents of the file

printf "Signed: "     ## Print "Signed: " without a newline whoami    ## Print the user's login name

} | mail -s "Here is the report" [email protected]

It is not necessary when the file or files could have been placed on the command line:

cat thisfile.txt | head -n 25 > thatfile.txt ## WRONG

head -n 25 thisfile.txt > thatfile.txt    ## CORRECT

It is useful when more than one file (or none) needs to be supplied to a command that cannot take a file name as an argument or can take only a single file, as in redirection. It is useful when one or more file names may or may not be on the command line. If no files are given, the standard input is used:

cat "$@" | while read x; do whatever; done

The same thing can be done using process substitution, the advantage being that variables modified within the while loop will be visible to the rest of the script. The disadvantage is that it makes the script less portable.

while read x; do : whatever; done < <( cat "$@" )

Another frequent misuse of cat is to use the output as a list with for:

for line in $( cat "$kjv" ); do n=$(( ${n:-0} + 1 )); done

That script does not put lines into the line variable; it reads each word into it. The value of n will be

795989, which is the number of words in the file. There are 31,102 lines in the file. (And if you really wanted that information, you would use the wc command.)


By default, head prints the first ten lines of each file on the command line, or from the standard input if no file name is given. The -n option changes that default:

$ head -n 1 "$kjv"

Genesis:001:001:In the beginning God created the heaven and the earth.

The output of head, like that of any command, can be stored in a variable:

filetop=$( head -n 1 "$kjv")

In that instance, head is unnecessary; this shell one-liner does the same thing without any external command:

read filetop < "$kjv"

Using head to read one line is especially inefficient when the variable then has to be split into its constituent parts:



That can be accomplished much more rapidly with read:

$ IFS=: read book chapter verse text < "$kjv"

$ sa "$book" "$chapter" "$verse" "${text%% *}"





Even reading multiple lines into variables can be faster using the shell instead of head:


read line1

read line2 read line3 read line4

} < "$kjv"

or, you can put the lines into an array:

for n in {1..4}


read lines[${#lines[@]}]

done < "$kjv"

In bash-4.0, the new builtin command mapfile can also be used to populate an array:

mapfile -tn 4 lines < "$kjv"


The default action of touch is to update the timestamp of a file to the current time, creating an empty file if it doesn’t exist. An argument to the -d option changes the timestamp to that time rather than the present. It is not necessary to use touch to create a file. The shell can do it with redirection:

> filename

Even to create multiple files, the shell is faster:

for file in {a..z}$RANDOM


> "$file" done


Unless used with one or more options, the ls command offers little functional advantage over shell file name expansion. Both list files in alphabetical order. If you want the files displayed in neat columns across the screen, ls is useful. If you want to do anything with those file names, it can be done better, and often more safely, in the shell.

With options, however, it’s a different matter. The -l option prints more information about the file, including its permissions, owner, size, and date of modification. The -t option sorts the files by last modification time, most recent first. The order (whether by name or by time) is reversed with the -r option.

I often see ls misused in a manner that can break a script. File names containing spaces are an

abomination, but they are so common nowadays that scripts must take their possibility (or should I say inevitability?) into account. In the following construction (that I see all too often), not only is ls unnecessary, but its use will break the script if any file names contain spaces:

for file in $(ls); do

The result of command substitution is subject to word splitting, so file will be assigned to each word in a file name if it contains spaces:

$ touch {zzz,xxx,yyy}\ a ## create 3 files with a space in their names

$ for file in $(ls *\ *); do echo "$file"; done xxx

a yyy a zzz a

On the other hand, using file name expansion gives the desired (that is, correct) results:

$ for file in *\ *; do echo "$file"; done xxx a

yyy a zzz a


The cut command extracts portions of a line, specified either by character or by field. Cut reads from files listed on the command line or from the standard input if no files are specified. The selection to be printed is done by using one of three options, -b, -c, and -f, which stand for bytes, characters, and fields. Bytes and characters differ only when used in locales with multibyte characters. Fields are delimited by a single tab (consecutive tabs delimit empty fields), but that can be changed with the -d option.

The -c option is followed by one or more character positions. Multiple columns (or fields when the

-f option is used) can be expressed by a comma-separated list or by a range:

$ cut -c 22 "$kjv" | head -n3 e

h o

$ cut -c 22,24,26 "$kjv" | head -n3 ebg

h a

o a

$ cut -c 22-26 "$kjv" | head -n3 e beg

he ea od sa

A frequent misuse of cut is to extract a portion of a string. Such manipulations can be done with shell parameter expansion. Even if it takes two or three steps, it will be much faster than calling an external command.

$ boys="Brian,Carl,Dennis,Mike,Al"

$ printf "%s\n" "$boys" | cut -d, -f3 ## WRONG


$ IFS=,     ## Better, no external command used

$ boyarray=( $boys )

$ printf "%s\n" "${boyarray[2]}" Dennis

$ temp=${boys#*,*,} ## Better still, and more portable

$ printf "%s\n" "${temp%%,*}" Dennis


To count the number of lines, words, or bytes in a file, use wc. By default, it prints all three pieces of information in that order followed by the name of the file. If multiple file names are given on the command line, it prints a line of information for each one and then the total:

$ wc "$kjv" /etc/passwd

31102 795989 4639798 /home/chris/kjv.txt

50    124    2409 /etc/passwd

31152 796113 4642207 total

If there are no files on the command line, cut reads from the standard input:

$ wc < "$kjv"

31102 795989 4639798

The output can be limited to one or two pieces of information by using the -c, -w, or -l option. If any options are used, wc prints only the information requested:

$ wc -l "$kjv"

31102 /home/chris/kjv.txt

Newer versions of wc have another option, -m, which prints the number of characters, which will be less than the number of bytes if the file contains multibyte characters. The default output remains the same, however.

As with so many commands, wc is often misused to get information about a string rather than a file. To get the length of a string held in a variable, use parameter expansion: ${#var}. To get the number of words, use set and the special parameter $#:

set -f

set -- $var echo $#

To get the number of lines, use this:

IFS=$'\n' set -f

set -- $var

echo $#

Regular Expressions

Regular expressions (often called regexes or regexps) are a more powerful form of pattern matching than file name globbing and can express a much wider range of patterns more precisely. They range from very simple (a letter or number is a regex that matches itself) to the mind-bogglingly complex. Long expressions are built with a concatenation of shorter expressions and, when broken down, are not hard to understand.

There are similarities between regexes and file-globbing patterns: a list of characters within square brackets matches any of the characters in the list. An asterisk matches zero or more—not any character as in file expansion—of the preceding character. A dot matches any character, so .* matches any string of any length, much as an asterisk does in a globbing pattern.

Three important commands use regular expressions: grep, sed, and awk. The first is used for

searching files, the second for editing files, and the third for almost anything because it is a complete programming language in its own right.


grep searches files on the command line, or the standard input if no files are given, and prints lines matching a string or regular expression.

$ grep ':0[57]0:001:' "$kjv" | cut -c -78

Genesis:050:001:And Joseph fell upon his father’s face, and wept upon him, and

Psalms:050:001:The mighty God, even the LORD, hath spoken, and called the eart

Psalms:070:001:MAKE HASTE, O GOD, TO DELIVER ME; MAKE HASTE TO HELP ME, O LORD Isaiah:050:001:Thus saith the LORD, Where is the bill of your mother’s divorce

Jeremiah:050:001:The word that the LORD spake against Babylon and against the

The shell itself could have done the job:

while read line do

case $line in

*0[57]0:001:*) printf "%s\n" "${line:0:78}" ;;


done < "$kjv"

but it takes many times longer.

Often grep and other external commands are used to select a small number of lines from a file and

pipe the results to a shell script for further processing:

$ grep 'Psalms:023' "$kjv" |

> {

> total=0

> while IFS=: read book chapter verse text

> do

>    set -- $text ## put the verse into the positional parameters

>    total=$(( $total + $# )) ## add the number of parameters

> done

> echo $total



grep should not be used to check whether one string is contained in another. For that, there is case

or bash’s expression evaluator, [[ … ]].


For replacing a string or pattern with another string, nothing beats the stream editor sed. It is also good for pulling a particular line or range of lines from a file. To get the first three lines of the book of Leviticus and convert the name of the book to uppercase, you’d use this:

$ sed -n '/Lev.*:001:001/,/Lev.*:001:003/ s/Leviticus/LEVITICUS/p' "$kjv" |

> cut -c -78

LEVITICUS:001:001:And the LORD called unto Moses, and spake unto him out of th

LEVITICUS:001:002:Speak unto the children of Israel, and say unto them, If any

LEVITICUS:001:003:If his offering be a burnt sacrifice of the herd, let him of

The -n option tells sed not to print anything unless specifically told to do so; the default is to print all lines whether modified or not. The two regexes, enclosed in slashes and separated by a comma, define a range from the line that matches the first one to the line that matches the second; s is a command to search and replace and is probably the most often used.

When modifying a file, the standard Unix practice is to save the output to a new file and then move it to the place of the old one if the command is successful:

sed 's/this/that/g' "$file" > tempfile && mv tempfile "$file"

Some recent versions of sed have an -i option that will change the file in situ. If used, the option should be given a suffix to make a backup copy in case the script mangles the original irretrievably:

sed -i.bak 's/this/that/g' "$file"

More complicated scripts are possible with sed, but they quickly become very hard to read. This example is far from the worst I’ve seen, but it takes much more than a glance to figure out what it is doing. (It searches for Jesus wept and prints lines containing it along with the lines before and after; you can find a commented version at

sed -n '/Jesus wept/ !{h}/Jesus wept/ { NxG p a\---s/.*\n.*\n\(.*\)$/\1/h}' "$kjv"

As you’ll see shortly, the same program in awk is comparatively easy to understand.


awk is a pattern scanning and processing language. An awk script is composed of one or more condition- action pairs. The condition is applied to each line in the file or files passed on the command line or to the standard input if no files are given. When the condition resolves successfully, the corresponding action is performed.

The condition may be a regular expression, a test of a variable, an arithmetic expression, or anything that produces a nonzero or nonempty result. It may represent a range by giving two condition separated

by a comma; once a line matches the first condition, the action is performed until a line matches the second condition. For example, this condition matches input lines 10 to 20 inclusive (NR is a variable that

contains the current line number):

NR == 10, NR == 20

There are two special conditions, BEGIN and END. The action associated with BEGIN is performed before any lines are read. The END action is performed after all the lines have been read or another action executes an exit statement.

The action can be any computation task. It can modify the input line, it can save it in a variable, it can perform a calculation on it, it can print some or all of the line, and it can do anything else you can think of.

Either the condition or the action may be missing. If there is no condition, the action is applied to all lines. If there is no action, matching lines are printed.

Each line is split into fields based on the contents of the variable FS. By default, it is any whitespace. The fields are numbered: $1, $2, and so on. $0 contains the entire line. The variable NF contains the

number of fields in the line.

In the awk version of the kjvfirsts script, the field separator is changed to a colon using the -F

command-line option (Listing 8-2). There is no condition, so the action is performed for every line. It splits the fourth field, the verse itself, into words, and then it prints the first three fields and the first word of the verse.

kjvfirsts-awk, Print Book, Chapter, Verse, and First Words from the KJV

awk -F: ' ## -F: sets the field delimiter to a colon


## split the fourth field into an array of words split($4,words," ")

## printf the first three fields and the first word of the fourth

printf "%s %s:%s %s\n", $1, $2, $3, words[1]

}' "$kjv"

To find the shortest verse in the KJV, the next script checks the length of the fourth field. If it is less than the value of the shortest field seen so far, its length (minus the length of the name of the book), measured with the length() function, is stored in min, and the line is stored in verse. At the end, the line stored in verse is printed.

$ awk -F: 'BEGIN { min = 999 } ## set min larger than any verse length length($0) - length($1) < min {

min = length($0) – length($1)

verse = $0


END { print verse }' "$kjv" John:011:035:Jesus wept.

As promised, here is an awk script that searches for a string (in this case, Jesus wept) and prints it along with the previous and next lines:

awk '/Jesus wept/ { print previousline print $0

n = 1 next


n == 1 {

print $0

print "---" n = 2



previousline = $0

}' "$kjv"

To total a column of numbers:

$ printf "%s\n" {12..34} | awk '{ total += $1 }

> END { print total }'


This has been a very rudimentary look at awk. There will be a few more awk scripts later in the book, but for a full understanding, there are various books on awk.

© Incase of any copyright infringements please check copyrights page for faster resolutions.

Leave a Reply

Show Buttons
Hide Buttons