Regular Expressons

Objectives

  • Why does Unix use Text files?
  • Regular Expressions.
  • Regular Expression Definitions
  • Regular Expression Usage
  • Text processing tools.
  • Home Work

Why does Unix use Text files?

I would love to give you an authoratative answer, but I can’t since I don’t know one. But, I will give you my best guess.

The reason is that they are easy to understand and easy to manipulate. Contrary to what people think, programmers want people to understand how to use their tools. So the easiest way to make something understood is to use text. Remember that the programs are written in text before the compiler work on them. So you need editors to create programs, you need editors to document programs, so why not use the same tools to configure the programs.

Since people communicate in words, why not allow the programs to read text. It is easier to understand what is going on, and allows many different types of input and output devices.

So, the creators of unix build many tools to work with text. The first thing of course was an editor. The grand daddy of the editors was ed . The man page for ed starts out saying:

       ed  is  a  line-oriented  text  editor.  It is used to create,
       display, modify and otherwise manipulate text files.  red is a
       restricted ed: it can only edit files in the current directory
       and  cannot  execute  shell commands.

       If  invoked with a file argument, then a copy of file is read
       into the editor's buffer.  Changes are made to this copy and
       not directly to file itself.  Upon quitting ed, any changes not
       explicitly saved  with  a  `w' command are lost.

       Editing  is  done  in two distinct modes: command and input.
       When first invoked, ed is in command mode.  In this mode
       commands are read from the standard input and executed to
       manipulate the contents  of  the  editor buffer.  

OK, does this sound familiar to anyone? I hope so, these are the modes used in Vi. Now, you might wonder why anyone would use a line editor. The answer lies in the earily terminals. They were basically printers with a keyboard attached. You didn’t move around on the printed page, that was absurbed. Now, don’t get me wrong, I do not want to edit files using ed. It can be done, but is not easy.

Ed included a long list of commands for moving around, finding words, replacing one string with another, etc. The sematics look something like this section of the manual.

       (.,.)s/re/replacement/
       (.,.)s/re/replacement/g
       (.,.)s/re/replacement/n
               Replaces text in the addressed lines matching a regular
               expression re with replacement.  By default, only the
               first match in each line is replaced.  If the `g'
               (global)  suffix  is  given,  then  every match  to be
               replaced.  The `n' suffix, where n is a postive number,
               causes only the nth match to be replaced.  It is an
               error if no substitutions are performed on any of the
               addressed lines.  The current address is set the last
               line affected.  

               re  and replacement may be delimited by any character
               other than space and newline (see the `s' command
               below).  If one or two of the last delimiters is
               omitted,  then  the  last  line  affected  is 
               printed as though the print suffix `p' were specified.


               An  unescaped  `&' in replacement is replaced by the
               currently matched text.  The character sequence
               `\m', where m is a number in the range [1,9], is
               replaced by the mth backreference expression of the
               matched  text.  If replacement consists of a single
               `%', then replacement from the last substitution
               is used.  Newlines may be embedded in replacement if
               they are escaped with a backslash (\).

       (.,.)s  Repeats the last substitution.  This form of the `s'
               command accepts a count suffix `n', or any com-
               bination  of  the  characters  `r', `g', and `p'.  If a
               count suffix `n' is given, then only the nth
               match is replaced.  The `r' suffix causes the regular
               expression of  the  last  search  to  be  used
               instead  of the that of the last substitution.  The `g'
               suffix toggles the global suffix of the last
               substitution.  The `p' suffix toggles the print suffix
               of the last substitution The current  address
               is set to the last line affected.

       (.,.)t(.)
               Copies  (i.e., transfers) the addressed lines to after
               the right-hand destination address, which may
               be the address 0 (zero).  The current address is set to
               the last line copied.

       u       Undoes the last command and restores the current
               address to what it was  before  the  command.   The
               global  commands  `g',  `G', `v', and `V'.  are treated
               as a single command by undo.  `u' is its own inverse.

       (1,$)v/re/command-list
               Applies command-list to each of the addressed lines not
               matching a regular expression re.   This  is similar to
               the `g' command. 

       (1,$)V/re/
               Interactively  edits  the  addressed lines not matching
               a regular expression re.  This is similar to the `G'
               command. 

       (1,$)w file
               Writes the addressed lines to file.  Any previous
               contents of file  is  lost  without  warning.   If
               there  is  no default filename, then the default
               filename is set to file, otherwise it is unchanged.
               If no filename is specified, then the default filename
               is used.  The current address is unchanged.

Not the easiest commands to remember, but it works. The functionality led to the creation of the Line editor ex . Ex is an extremely powerful editor, provided you remember the commands. The next editor which came along was vi . Vi is infact the visual mode of the ex editor.

Now ed was never designed to deal with large files. Remember the original computers did not have a lot of ram. So they created a stream editor called sed . Sed uses the same command as ed but it does not read the entire file into memory at any one time. Instead it works on the file line by line. This allows it to run on a computer with 8 meg of memory and edit a file 100 meg in size.

All of these editors share a common trait, they use regular expressions to allow them the ability to understand more complex words.

Regular Expressions

First of all, lets recognize that Regular Expression patterns are one of the things which turn people off to Unix. When you see an expression like:

    DISPLAY=`who am i -l | perl -ne '/\((.+)\)/ ; print $1'`":0.0"

Most people’s eyes start to cross. But lets take this apart to see what it really means.

  1. We are using the program who which displays who is logged onto the computer. The who man page starts with.

                /usr/bin/who [ -abdHlmpqrstTu ] [ file ]
                /usr/bin/who -q [ -n x ] [ file ]
                /usr/bin/who am i
                /usr/bin/who am I
    
         The who utility can list the  user's  name,  terminal  line,
         login  time,  elapsed  time  since  activity occurred on the
         line, and the process-ID of the command interpreter  (shell)
         for   each  current  UNIX  system  user.   

    We are using the who command with the am i options.

  2. Next, the output of the who command is piped into perl to do some string handling. The expression /\((.+)\)/ grabs the information within the the parentheses. Or in somewhat more verbose english.
    • The beginning and ending / start and end the regular expression.
    • Then we look for a single opening parentheses with \(.
    • Then we look for one or more characters with (.+) .
    • Lastly we are looking for a closing parentheses with \).
  3. The string is then printed with the statement print $1 .
  4. This is inclosed in back tics ` to cause it to be executed and the output saved as a value.
  5. We then append the string :0.0 on to the end of the string.
  6. Lastly, we assign the value to the variable DISPLAY .

Now, isn’t that as clear as mud!?! Ok lets take a look at a man page describing regular expressions. source: perldoc perlre

       Regular Expressions

       The patterns used in Perl pattern matching derive from
       supplied in the Version 8 regex routines.  (The routines
       are derived (distantly) from Henry Spencer's freely
       redistributable reimplementation of the V8 routines.)  See
       the Version 8 Regular Expressions entry elsewhere in this
       document for details.

       In particular the following metacharacters have their
       standard egrep-ish meanings:

           \   Quote the next metacharacter
           ^   Match the beginning of the line
           .   Match any character (except newline)
           $   Match the end of the line (or before newline at the end)
           |   Alternation
           ()  Grouping
           []  Character class

       By default, the "^" character is guaranteed to match only
       the beginning of the string, the "$" character only the
       end (or before the newline at the end), and Perl does
       certain optimizations with the assumption that the string
       contains only one line.  Embedded newlines will not be
       matched by "^" or "$".  You may, however, wish to treat a
       string as a multi-line buffer, such that the "^" will
       match after any newline within the string, and "$" will
       match before any newline.  At the cost of a little more
       overhead, you can do this by using the /m modifier on the
       pattern match operator.  (Older programs did this by
       setting "$*", but this practice is now deprecated.)

       To simplify multi-line substitutions, the "." character
       never matches a newline unless you use the "/s" modifier,
       which in effect tells Perl to pretend the string is a
       single line--even if it isn't.  The "/s" modifier also
       overrides the setting of "$*", in case you have some
       (badly behaved) older code that sets it in another module.

       The following standard quantifiers are recognized:

           *      Match 0 or more times
           +      Match 1 or more times
           ?      Match 1 or 0 times
           {n}    Match exactly n times
           {n,}   Match at least n times
           {n,m}  Match at least n but not more than m times

       (If a curly bracket occurs in any other context, it is
       treated as a regular character.)  The "*" modifier is
       equivalent to "{0,}", the "+" modifier to "{1,}", and the
       "?" modifier to "{0,1}".  n and m are limited to integral
       values less than a preset limit defined when perl is
       built.  This is usually 32766 on the most common
       platforms.  

Another piece that is commonly used is the square bracket [ ] . They are defined as follows in the man page for regex source: man 7 regex

       A  bracket  expression is a list of characters enclosed in
       `[]'.  It normally matches any single character  from  the
       list  (but  see  below).   If the list begins with `^', it
       matches any single character (but see below) not from  the
       rest of the list.  If two characters in the list are sepa-
       rated by `-', this is shorthand  for  the  full  range  of
       characters  between those two (inclusive) in the collating
       sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
       It  is  illegal  for two ranges to share an endpoint, e.g.
       `a-c-e'.  Ranges  are  very  collating-sequence-dependent,
       and portable programs should avoid relying on them.

       To  include  a  literal `]' in the list, make it the first
       character (following a possible `^').  To include  a  lit-
       eral `-', make it the first or last character, or the sec-
       ond endpoint of a range.  To use  a  literal  `-'  as  the
       first  endpoint of a range, enclose it in `[.' and `.]' to
       make it a collating element (see below).  With the  excep-
       tion  of  these  and some combinations using `[' (see next
       paragraphs), all other special characters, including  `\',
       lose  their  special significance within a bracket expres-
       sion.

       Within a bracket expression, a collating element (a  char-
       acter,  a  multi-character sequence that collates as if it
       were a single character, or a collating-sequence name  for
       either)  enclosed in `[.' and `.]' stands for the sequence
       of characters of that collating element.  The sequence  is
       a  single  element  of  the  bracket expression's list.  A
       bracket expression containing a multi-character  collating
       element  can  thus  match more than one character, e.g. if
       the collating sequence includes a `ch' collating  element,
       then the RE `[[.ch.]]*c' matches the first five characters
       of `chchcc'.

Regular Expression Definitions

OK Lets start by looking at definitions for the terms used in Regular expressions. Here is a web page which does a good job of defineing the terms.

Regular expressions is the term used for a codified method of searching ‘invented’ or ‘defined’ by the American mathematician Stephen Kleene.

What is the definition of regular expressions. Lets look at a formal definition: regex7

Regular Expression Usage

Now that we have played with some of the definitions, lets get down to seeing how to apply some of these ideas. Here is an article I found on the web about using Regular Expressions with Web pages.

So What’s A $#!%% Regular Expression, Anyway?!

OK now that we have seen these uses, lets try looking at some tutorial information on regular expressions.

This comes from the Rute tutorial, which I highly recommend. 5. Regular Expressions

Text processing tools

The Unix system’s includes many text processing tools. A few of them are diff, aspell, indent, less, cat, cut, sort, cksum, comm, csplit, expand, fmt, fold, head, join, md5sum, nl, od, paste, pr, ptx, sha1sum, split, sum, tac, tail, tr, tsort, unexpand, uniq, wc, sed, awk, egrep, sort, and perl. Lets take a look at what these tools are good at.

#
# tac: cat backwards; reverse the order of lines in a file
#
awk '{print NR "#" $0}' "$@" |
sort -t# +0nr -1 |
sed 's/^[0-9]*#//'

Homework

I would like you to take a look at the man pages for egrep, sed, and awk. Create at least two shell scripts for each tool demonstrating how to use them on these text files.comic.txt or baseball.txt



Written by John F. Moore

Last Revised: Wed Oct 18 11:01:31 EDT 2017

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
HTML5 Powered with CSS3 / Styling, and Semantics