MX : Unix C Shell Scripting

I can accomplish a lot of tasks using simple Unix shell scripts, and this is a short guide to give you ideas about how to get started. While you can use GUIs like CCP4i for a series of tasks, automating those tasks or applying them to a large number of datasets or PDB files is something beyond the scope of most GUIs. You're also beholden to what options the GUI-writer considered important. If you get comfortable with Unix shell scripting you can do whatever you want as long as programs exist to do it. The next step is to write your own programs in C, C++, Fortran or the trendy language of the month, but we're not going to get into that here.

C Shells and Bourne Again Shells

A Unix shell is just the intepreter that adds a more human interface to the Unix operating system. Graphical desktop environments have supplanted many of the commoner applications of these shells, but if you're logging in from home the Unix shell is your primary interface. Get comfortable with it. Historically the C shell (csh) and Bourne shell (sh) were the commonest ones and variants of them are still your main options today. Tcsh (TENEX C shell) is an enhanced version of csh and is found widely - I use this as my default shell. Bash (Bourne Again Sh) is an enhanced version of sh and is also found quite widely. There are also several other shells like ksh, zsh etc - see Wikipedia entry for Unix shells.

A certain amount of zealotry and dogma can accompany which shell to use. Csh and tcsh are probably the more common ones in crystallography because of the enhancements in tcsh were available more widely than bash at some time in the past when we were all still working on Silicon Graphics hardware. For those of us coming from VMS, the ability to use up-arrow to recall a previous command put tcsh ahead of sh by miles. All crystallographic packages that I'm aware of come as a set of Unix programs, although often with GUI capability incorporated. In the case of CCP4i for example the GUI basically invokes the underlying CCP4 program running in something resembling shell mode. Programs that are solely GUI based, like Coot or Pymol also have scripting capabilities built into them.

Program suites often come with two different startup or configuration files compatible with the different (sh, csh) shell syntaxes, however when there is only one config file version it is more likely to be written in C shell than in Bourne shell. There's a difference between using shells for command line execution and using them to write elaborate scripts. Perl and Tcl (and perhaps even Python) are better suited to such scripts. Sh/Bash zealots like to point out C shell limitations in the widely disseminated page C shell programming considered harmful but frankly there are better scripting languages to do most of the more advanced system administration functions in Unix.

Consequently, learn Perl or Tcl (or even C) if you want to do excessively cute things with scripts, and Keep It Simple in tcsh or bash.

You can also RTFM tcsh. There are any number of shell guides and introductions if you just Google for them.

Writing the script

Emacs or vi or their variants (xemacs, vim) are the main ways to write your script. Programs like Nedit and Pico/Nano are alternatives. Scripts need to be written as "plain text" aka ASCII text and have no extra formatting information and the aforementioned programs are good at that. Whatever editor you choose, do NOT write your script in a word processor like Micro$loth Word or Textedit on OSX. Word will write all sorts of formatting codes into the file that you don't want and which will spawn errors on script execution. You just want to save the characters that you type into the editor window and nothing else.

The simple non-graphical editors like emacs, vi and nano are also main option for doing script development on remote machines where you're probably logging in over ssh or telnet. I'm writing this using xemacs, for example.

Hash Pling Slash (#!/)

This somewhat odd series of characters tell the system what shell you want the script to execute with. It's not all that safe to assume that Unix will use your current shell to execute the script, so the first line of your command file should look like:

#!/bin/csh -f

When you run a script (e.g. with ./hello_world.csh) then the shell reads this line and understands it to mean execute the script as input to the program listed after the #! characters . You can write Perl scripts, for example, with #!/usr/bin/perl in the header of the script file, and do the same with tcl, awk, python etc. It's a very useful feature. In the case shown above the "-f" flag tells csh not to execute your .cshrc when it starts up - this speeds script execution and hopefully it inherits the relevant shell environment when it starts from the command shell. This also avoids any commands in .cshrc that really should be in .login (things that modify the terminal settings, for example). Remove the -f if there seem to be problems with picking up program setups - in particular aliases do not always seem to get inherited.

The archetypal "Hello World" shell script could look like:

#!/bin/csh -f
#
# this is a comment
#
echo "hello world"

which illustrates the first (interpreter) line, the comment lines beginning with "#" and a small piece of code that does something - using the command "echo" to print "hello world" on the screen. Comments are useful, nay essential in more complex scripts. This example runs under C shell because that's what we told it to start under. This same script will execute just find in sh if you change the first line to read

#!/bin/sh

Download hello_world.csh and/or hello_world.sh as C shell or Bourne shell versions respectively. Although file extensions are not mandatory in Unix they are a very useful self-documenting feature for what the script is written as, and adding an extension distinguishes a script from a compiled program (which is either called a.out or more conventionally lacks any sort of name extension).

Making it runnable

If you write the script using a regular text editor, the script is essentially just a bit of text data sitting around on disk. You will still need to make the script executible or you won't be able to run the script. Use

chmod +x hello_world.csh

to make it executible by yourself, or

chmod a+x hello_world.csh

to make it executible by everyone. See the docs for chmod. "+x" means "add execution bit" to the permissions for the file, and the character preceding the "+" tells chmod how many people that should apply to (u+x = yourself, g+x = group, o+x = world, a+x = all, etc). Most Unix systems remove the executible bit from text files upon creation (via "umask 066") for security and sanity reasons so any arbitary text file you create will not be executible unless you explicitly enable it.

How you can actually execute this script depends on how you've told the shell where to look for executible programs. In Unix this is usually via the PATH variable, which is a list of directories telling the shell to look for executible scripts and programs. Including the current directory (".") in PATH is somewhat of a security risk. If "." is in there, you can invoke the program by name "hello_world.csh" if it is in your current directory. If "." is not in PATH you'll have to tell the script where to look, i.e. "./hello_world.csh". To look at your path do the following in C shell:

echo $path

which gives a space-delimited list of directories. Or in Bourne shell:

echo $PATH

which will be a colon-delimited list. One of the differences between C shell and Bourne shell is that the former tends to have lower case variables and the latter tends to have upper case ones but there's some coexistence between the two variable types.

You can always use the full path to the file as its name and execute it that way:

/Users/phil/scripts/hello_world.csh

will run the program without consulting the PATH variable because you've explicitly given the location of the file. Absolute paths to the file (the name begins with /, as in "/Users/phil/scripts/hello_world.csh") and relative paths to the file (e.g. "scripts/hello_world.csh" if you're in /Users/phil) would both work if you're pointing to the right file location. I tend to use absolute paths in scripts to refer to file locations because if the programs are in the user environment via an alias often that's not transferred to (inherited by) the script when it runs.

Simple C shell Syntax

The "C Shell Field Guide" by Anderson & Anderson is the definitive reference book, but to summarise simple syntax:

# signifies a comment line
the first word on the line is assumed to be an executible program or script, program or equivalent built-in command - the program can have its path explicitly specified (/bin/ls or ./hackit), or implicitly specified (ls). In the latter case the shell consults the PATH variable to find the first match to the program "ls".
"string" or 'string' designates a character string - in the double-quote case "$test" is converted to the value of the variable "test" if it exists or if it does not the shell will throw an error. The value of '$test' is in fact literally $test - it does not do variable substitution inside single quote marks. The shell strips off the outermost set of quotes if it's an argument to a program/script so the program doesn't see them.
set name = value assigns a value to the variable "name" which can be referenced by $name. Variables are lower case by convention. The variable values are always a string except interpreted using special cases using the "@" math method. There must be a space around the = sign.
alias allows you to make shorthands. I use alias ll 'ls -alF' to define a command "ll" that is similar to some other flavors of unix.
Simple flow control similar to the C programming language like:
```
if (expression) then
  ...
endif
```
and
```
while (expression) then
  ...
end
```
also work. The C-like switch statement is also supported.
Simple lists are enclosed with parantheses: (a b c d e) is a list with each of the letters as a list member. (*.mtz) generates a list populated by the filenames matching "*.mtz" in the current directory. There are no lists of lists.
```
foreach varname list
...
end
```
is a flow control structure that takes advantage of lists, assigning varname to each list element in turn.
Command line parameters: $0 is the name of the script as it was invoked. $1...$9 are the first nine command-line parameters. $# is the number of command line parameters. $* is all the command line parameters.
The shell does string comparisons but not numeric comparisons by default although simple comparisons like "if (18 < 23)" seem to work fine.
- "str1 == str2" checks the identity of two strings
- "str1 != str2" checks to see if the strings are different
- "str1 =~ expr" checks if the string matches the expression which may contain "*"
- "str1 !~ expr" checks the identity of the string against the expression which may contain "*"

If you're writing shell scripts I assume you already know a little about redirection, but to reiterate:

> filename - redirects output to the filename - it will overwrite existing files, not append.
>& filename - redirects standard output and standard error to the filename
>> filename and >>& filename appends to, rather than overwrites, the file.
< filename - redirects input from the file to the program
<< word - see below
| - connect the standard output of the prior command/program to the standard input of the subsequent command/program (i.e. a pipe).
|& - same thing as | but also connects standard error to standard input

The & forms are useful only when you're trying to specifically trap error messages. Do not confuse this usage of & with the "run command in background" method of appending & to the end of the command line. If you have "noclobber" set in tcsh the redirection commands may refuse to overwrite existing files or append to files not already in existence. You have to use ">!" and ">>!" to override this behavior - consult the tcsh manual.

The csh/tcsh feature that we are most concerned with is how to get my data into my program. Specifically you want to get the shell to shove a series of lines into the program being executed rather than interpret them in shell syntax. One very tedious way to achieve this is to do:

echo "first line of program input" > instructions.dat
echo "second line of program input" >> instructions.dat
echo "third line of program input" >> instructions.dat
program_name < instructions.dat

i.e. writing the input syntax to a file via the standard shell syntax for redirection > and >> and then getting it to read from that file via <. There are many situations in which the "program < instructions" paradigm is used, but in the shell there is another way to write the example above via a more compact version:

program_name << EOF-prog
first line of program input
second line of program input
third line of program input
EOF-prog

which says "take everything between the two instances of EOF-prog and treat it as input to the program". Note that the last EOF-prog is not seen by the program - the input is terminated at the third line. EOF-prog is not some sort of special string, it's just any old word. If you have the word in single or double quotes, variable substitution in the program input lines is disabled, but otherwise happens if there's no quotation of the word (this is actually useful in shell programming). If your data lines contain "$" as input and the shell keeps throwing "variable not set" errors, you might want to use single quotes: <<'EOF' rather than the plain <<EOF, the former disables the substitutions. The quote from the manual is:

Reads the C Shell input up to a line that is identical to word. word is not 
subjected to variable, file name or command substitution, and each input 
line is compared to word before any substitutions are done on the input line. 
Unless a quoting \, ", ', or ` appears in word, variable and command 
substitution is performed on the intervening lines, allowing \ to quote $, 
\ and `. Commands that are substituted have all blanks, tabs, and newlines 
preserved, except for the final newline which is dropped. The resultant text 
is placed in an anonymous temporary file that is given to the command as its 
standard input.

I use this construction all the time in program scripts, as in this one to run SHELXC on some data:

#!/bin/csh -f
# 
# run SHELXC
#
/usr/local/shelx/macosx/shelxc se1 << EOF
HREM se1rm.sca
PEAK se1pk.sca
INFL se1in.sca
LREM se1lo.sca
CELL 35.104  63.496  76.458 90. 90. 90.
SPAG P212121
FIND 8
NTRY 50
EOF
#

SHELXC sees input commands starting at "HREM ..." and ending at "NTRY ...". I'm lazy and always use the string EOF for this method of delimiting program data. However you will reduce the potential for mayhem if you make each "EOF" have a distinct name and make sure they appear as pairs. I think any simple string is valid, not just the ones based on "EOF" but EOF is an acronym for End Of File so it has some archaic relevance to what we are doing. Less so if you're not a programmer. In any event make it something distinctive, probably partially upper case, so as not to make it look like data or the name of a program.

Getting Into Trouble With More Advanced Shell Syntax

Various examples:

You could create a simple disk space monitoring script:

#!/bin/csh -f
#
#
while (1)
sleep 60
df -kl
end

which introduces the syntax for while....end and also the sleep command - this one just sits there and runs "df -kl" every minute until you kill it using something like control-C. "1" and "0" are the same as "true" and "false".

You can simplify laborious tasks for doing things like calculating the Mean Fractional Isomorphous Difference (MFID) between all possible pairs of MTZ (.mtz) files containing single datasets:

#!/bin/csh -f
#
echo "" > mfid.log
for file1 (*.mtz)
 for file2 (*.mtz)
  echo "Using $file1 and $file2" >> mfid.log
  ./mfid.csh $file1 $file2 >> mfid.log
 end
end

OK, so there's a lot going on in this short script. First the syntax (*.mtz) is a list of filenames matching the pattern specified (i.e. all files that end in .mtz. This is an example of "globbing". Globbing and regular expressions have related but different syntax, with the latter being more powerful. The () parentheses The syntax for name (list) ..... end cycles through this list one filename (list entry) at a time and assigns the list value in turn to the variable whose name you specified. I used two nested "for" loops in the above example. Then the mfid.csh script takes the two file names as arguments and the output of that script gets concatentated to the log file "mfid.log". The two echo commands first create a blank existing log file and also write information on the filenames to the log file (but mfid.csh could also write those names to the output, as an alternative).

Now, how do you get mfid.csh to accept the filenames as arguments ? Well the shell allows this via special variables $0, $1, $2 etc:

#!/bin/csh -f
#
\rm mfid_merged.mtz
#
cad HKLIN1 $1 HKLIN2 $2 HKLOUT  mfid_merged.mtz << eof-cad              

RESOLUTION OVERALL 50.0 6.
SYMMETRY P6122
TITLE  merge two data files together

LABIN  FILE 1 E1=F   E2=SIGF
CTYPE  FILE 1 E1=F   E2=Q  
LABOUT FILE 1 E1=FN1 E2=SIGFN1

LABIN  FILE 2 E1=F   E2=SIGF 
CTYPE  FILE 2 E1=F   E2=Q 
LABOUT FILE 2 E1=FN2 E2=SIGFN2

END
eof-cad

so here the value of the first argument ($1) and the second argument ($2) are used as filenames assigned to HKLIN1 and HKLIN2 by the program "cad". Scaleit, the program that actually calculates the MFID, is not shown but is also part of this script. $0 is the name of the script itself, and it can be useful to print that for debugging purposes. As before I'm using "eof-cad" to delimit the data that is read by the program as input data. For this script the data is constant, but in fact one could incorporate variable values into the program data using the conventional $variable syntax.

More examples could go here if I felt that they would do more good than harm.

Tests for file name existence:

if (-e $1) echo "File $1 exists"
if (! -e $1) echo "File $1 does not exist"

More tests like that could use the test command - see "man test". The ones for tcsh include: -e (exist); -o (owned by user); -r (readable); -w (writable); -x (executible); -z (zero size); -d (is directory). All these tests fail if the file does not exist.

Tests for numeric values:

if ($a > $b) echo "A ($a) is more than B ($b)"
if ($a == 9) echo "A is equal to 9"
if ($a <= 9) echo "A is less than or equal to 9"

String comparions:

if ("$q" == "yes") echo "Answer is yes"

If you get to here you are way beyond the point where you should have read the C Shell Field Guide or at least reviewed the WikiBooks on C shell scripting.

File name modifiers

Consider a code snippet like:

 
set file = /Users/phil/mx/UnixShellScripting.html

then you might want to do some work on the filename, perhaps modified, or use the parent directory of the file. You can pull this off using simple shell variable modifiers, which work as "$variable:t" i.e. the variable name followed by a colon and character.

$file:h returns the head, /Users/phil/mx
$file:t returns the tail, UnixShellScripting.html
$file:e returns the extension, html
$file:r returns the root, /Users/phil/mx/UnixShellScripting which now lacks the extension
$file:r:t returns just the filename without extension, UnixShellScripting

Try this for yourself with:

set file = /Users/phil/mx/UnixShellScripting.html
echo head $file:h
echo tail $file:t
echo extn  $file:e
echo root $file:r
echo stub $file:r:t

giving

head /Users/phil/mx
tail UnixShellScripting.html
extn html
root /Users/phil/mx/UnixShellScripting
stub UnixShellScripting

NOTE that there's nothing in this script code that requires the file to exist, so it can be done with arbitary filename-like strings. I use constructions of the type {$file:r}_new.dat to create new filenames automatically from the previous name with new extensions.

Making the SHELXC script smarter

Recall the the SHELXC script I used earlier:

#!/bin/csh -f
# 
# run SHELXC
#
/usr/local/shelx/macosx/shelxc se1 << EOF
HREM se1rm.sca
PEAK se1pk.sca
INFL se1in.sca
LREM se1lo.sca
CELL 35.104  63.496  76.458 90. 90. 90.
SPAG P212121
FIND 8
NTRY 50
EOF
#

It certainly does it's job - I use this sort of thing all the time - but it is less automated than it could be. The file names, cell dimensions and space group are hard wired - you have to edit the script each time. First step toward automation would be to replace some of these as variables. For example if this is a dedicated MAD script, INFL, HREM and PEAK could be replaced by $2, $3 and $1 respectively:

#!/bin/csh -f
# 
# run SHELXC
#
/usr/local/shelx/macosx/shelxc se1 << EOF
PEAK $1
INFL $2
HREM $3
CELL 35.104  63.496  76.458 90. 90. 90.
SPAG P212121
FIND 8
NTRY 50
EOF
#

so you could then invoke it as shelxc.csh se1pk.sca se1in.sca se1rm.sca. You could also test for the existence of each file using "test -e" (see above for examples) and you could also test for the number of arguments using the special variable $# which reports the number of command line arguments. You could obviously create a switched script that automatically compensates for 1, 2 and 3 arguments. In this case I've simply added a couple of syntactical tests:

#!/bin/csh -f
# 
# run SHELXC
#
if ($# != 3) then
  echo "Syntax: $0: peak.sca infl.sca hrem.sca"
  exit 1
endif
if (! -e $1) then
  echo "File $1 does not exist"
  exit 2
endif
if (! -e $2) then
  echo "File $2 does not exist"
  exit 2
endif
if (! -e $3) then
  echo "File $3 does not exist"
  exit 2
endif
#
/usr/local/shelx/macosx/shelxc se1 << EOF
PEAK $1
INFL $2
HREM $3
CELL 35.104  63.496  76.458 90. 90. 90.
SPAG P212121
FIND 8
NTRY 50
EOF
#

So this is the start of making the script smarter. Finally I've attempted to extract the cell dimensions and space group automatically from the first .sca file. We know that the cell dimensions and space group are in the third line of the .sca file. We can grab the line by doing "head -3 $1 | tail -1" but we want to assign this to a variable. This is possible using the backwards quotes `` which when surrounding a Unix shell command - even one containing a pipe as in this case - execute the command and store the result in a variable.

set header = `head -3 $1 | tail -1`

then it remains to extract the values from the line. We can do this via awk where we can print specific fields using the syntax:

echo "a b c d e f" | awk '{print $1,$2,$3,$4,$5,$6}'

The $1...$6 field identifiers are obvious in the awk script, but remember to enclose them in single quotes or the shell will attempt to do variable substitution. We can wrap the header extractor command and the awk commands to get cell dimensions and space group as:

set header = `head -3 $1 | tail -1`
set cell = `echo $header | awk '{print $1,$2,$3,$4,$5,$6;}'`
set space = `echo $header | awk '{print $7;}'`

which uses three sets of `cmd1 | cmd2` to parse the line. There are doubtless more compact ways to do this within awk, but this also facilities debugging. The script isn't bomb-proof since large cell dimension values will make the number fields adjacent without a space, but for that we'd need a program that can read formally formatted strings. C shell is not really that beast.

#!/bin/csh -f
# 
# run SHELXC
#
# syntax checking
#
if ($# != 3) then
  echo "Syntax: $0 : peak.sca infl.sca hrem.sca"
  exit 1
endif
if (! -e $1) then
  echo "File $1 does not exist"
  exit 2
endif
if (! -e $2) then
  echo "File $2 does not exist"
  exit 2
endif
if (! -e $3) then
  echo "File $3 does not exist"
  exit 2
endif
#
# Extract cell dimensions and space group
#
set header = `head -3 $1 | tail -1`
set cell = `echo $header | awk '{print $1,$2,$3,$4,$5,$6;}'`
set space = `echo $header | awk '{print $7;}'`
#
#  Execute shelxc
#
/usr/local/shelx/macosx/shelxc se1 << EOF
PEAK $1
INFL $2
HREM $3
CELL $cell
SPAG $space
FIND 8
NTRY 50
EOF
#

So we've added the

set value = `command`

syntax, and also simple string manipulation using awk to come up with a more versatile script that can detect the number of files, and pull out cell dimensions and space group. This makes the new script shelxc.csh more useful in an automated setting where you could use it to scan multiple datasets for MAD solution. More embellishments could be added to the script (automatic SAD/MAD switching), etc. But this does give you a flavor of what can be achieved with some simple scripting.

Mathematics

Mathematical calculations in C shell are problematic with unweildy syntax and limited functionality. Just use some other script language if you want to do such calculations. C shell cannot do floating point calculations. Read more here if you are some sort of masochist determined to use C shell for simple math. If you're going to do that sort of thing Perl, Tcl or Python are almost certainly better options.

If you're feeling brave:

set nn = 5+4    - $nn has the value '5*4' since everything is a string and math ops are not implicitly used
                - inconveniently since
@ nn = 5+4      - the "@" symbol causes math conversions, so nn is now 9
@ nn++          - nn becomes 10
Simple arithmetic expressions using +, -, *, /, ++, -- are allowed.

Debugging

Echo is your main friend here. Since you'll basically be handling variables, just echo the values to the screen. You can also start up the shell in more verbose mode via:

#!/bin/csh -fv

where the "v" means verbose. In tcsh "set echo" and "set verbose" accomplish many of the same things. Be aware that you start to get a lot of output with these options that can take some time to weed through, so a few targeted echo statements are probably your best initial point of attack.

Simple Unix C Shell Scripting