I would love to give you an authoratative answer, but I can’t since I don’t know one. But, I will give you my best guess.
The reason is that they are easy to understand and easy to manipulate. Contrary to what people think, programmers want people to understand how to use their tools. So the easiest way to make something understood is to use text. Remember that the programs are written in text before the compiler work on them. So you need editors to create programs, you need editors to document programs, so why not use the same tools to configure the programs.
Since people communicate in words, why not allow the programs to read text. It is easier to understand what is going on, and allows many different types of input and output devices.
So, the creators of unix build many tools to work with text. The first thing of course was an editor. The grand daddy of the editors was ed . The man page for ed starts out saying:
ed is a line-oriented text editor. It is used to create,
display, modify and otherwise manipulate text files. red is a
restricted ed: it can only edit files in the current directory
and cannot execute shell commands.
If invoked with a file argument, then a copy of file is read
into the editor's buffer. Changes are made to this copy and
not directly to file itself. Upon quitting ed, any changes not
explicitly saved with a `w' command are lost.
Editing is done in two distinct modes: command and input.
When first invoked, ed is in command mode. In this mode
commands are read from the standard input and executed to
manipulate the contents of the editor buffer.
OK, does this sound familiar to anyone? I hope so, these are the modes used in Vi. Now, you might wonder why anyone would use a line editor. The answer lies in the earily terminals. They were basically printers with a keyboard attached. You didn’t move around on the printed page, that was absurbed. Now, don’t get me wrong, I do not want to edit files using ed. It can be done, but is not easy.
Ed included a long list of commands for moving around, finding words, replacing one string with another, etc. The sematics look something like this section of the manual.
(.,.)s/re/replacement/
(.,.)s/re/replacement/g
(.,.)s/re/replacement/n
Replaces text in the addressed lines matching a regular
expression re with replacement. By default, only the
first match in each line is replaced. If the `g'
(global) suffix is given, then every match to be
replaced. The `n' suffix, where n is a postive number,
causes only the nth match to be replaced. It is an
error if no substitutions are performed on any of the
addressed lines. The current address is set the last
line affected.
re and replacement may be delimited by any character
other than space and newline (see the `s' command
below). If one or two of the last delimiters is
omitted, then the last line affected is
printed as though the print suffix `p' were specified.
An unescaped `&' in replacement is replaced by the
currently matched text. The character sequence
`\m', where m is a number in the range [1,9], is
replaced by the mth backreference expression of the
matched text. If replacement consists of a single
`%', then replacement from the last substitution
is used. Newlines may be embedded in replacement if
they are escaped with a backslash (\).
(.,.)s Repeats the last substitution. This form of the `s'
command accepts a count suffix `n', or any com-
bination of the characters `r', `g', and `p'. If a
count suffix `n' is given, then only the nth
match is replaced. The `r' suffix causes the regular
expression of the last search to be used
instead of the that of the last substitution. The `g'
suffix toggles the global suffix of the last
substitution. The `p' suffix toggles the print suffix
of the last substitution The current address
is set to the last line affected.
(.,.)t(.)
Copies (i.e., transfers) the addressed lines to after
the right-hand destination address, which may
be the address 0 (zero). The current address is set to
the last line copied.
u Undoes the last command and restores the current
address to what it was before the command. The
global commands `g', `G', `v', and `V'. are treated
as a single command by undo. `u' is its own inverse.
(1,$)v/re/command-list
Applies command-list to each of the addressed lines not
matching a regular expression re. This is similar to
the `g' command.
(1,$)V/re/
Interactively edits the addressed lines not matching
a regular expression re. This is similar to the `G'
command.
(1,$)w file
Writes the addressed lines to file. Any previous
contents of file is lost without warning. If
there is no default filename, then the default
filename is set to file, otherwise it is unchanged.
If no filename is specified, then the default filename
is used. The current address is unchanged.
Not the easiest commands to remember, but it works. The functionality led to the creation of the Line editor ex . Ex is an extremely powerful editor, provided you remember the commands. The next editor which came along was vi . Vi is infact the visual mode of the ex editor.
Now ed was never designed to deal with large files. Remember the original computers did not have a lot of ram. So they created a stream editor called sed . Sed uses the same command as ed but it does not read the entire file into memory at any one time. Instead it works on the file line by line. This allows it to run on a computer with 8 meg of memory and edit a file 100 meg in size.
All of these editors share a common trait, they use regular expressions to allow them the ability to understand more complex words.
First of all, lets recognize that Regular Expression patterns are one of the things which turn people off to Unix. When you see an expression like:
DISPLAY=`who am i -l | perl -ne '/\((.+)\)/ ; print $1'`":0.0"
Most people’s eyes start to cross. But lets take this apart to see what it really means.
We are using the program who which displays who is logged onto the computer. The who man page starts with.
/usr/bin/who [ -abdHlmpqrstTu ] [ file ]
/usr/bin/who -q [ -n x ] [ file ]
/usr/bin/who am i
/usr/bin/who am I
The who utility can list the user's name, terminal line,
login time, elapsed time since activity occurred on the
line, and the process-ID of the command interpreter (shell)
for each current UNIX system user.
We are using the who command with the am i options.
Lastly, we assign the value to the variable DISPLAY .
Now, isn’t that as clear as mud!?! Ok lets take a look at a man page describing regular expressions. source: perldoc perlre
Regular Expressions
The patterns used in Perl pattern matching derive from
supplied in the Version 8 regex routines. (The routines
are derived (distantly) from Henry Spencer's freely
redistributable reimplementation of the V8 routines.) See
the Version 8 Regular Expressions entry elsewhere in this
document for details.
In particular the following metacharacters have their
standard egrep-ish meanings:
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
By default, the "^" character is guaranteed to match only
the beginning of the string, the "$" character only the
end (or before the newline at the end), and Perl does
certain optimizations with the assumption that the string
contains only one line. Embedded newlines will not be
matched by "^" or "$". You may, however, wish to treat a
string as a multi-line buffer, such that the "^" will
match after any newline within the string, and "$" will
match before any newline. At the cost of a little more
overhead, you can do this by using the /m modifier on the
pattern match operator. (Older programs did this by
setting "$*", but this practice is now deprecated.)
To simplify multi-line substitutions, the "." character
never matches a newline unless you use the "/s" modifier,
which in effect tells Perl to pretend the string is a
single line--even if it isn't. The "/s" modifier also
overrides the setting of "$*", in case you have some
(badly behaved) older code that sets it in another module.
The following standard quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
(If a curly bracket occurs in any other context, it is
treated as a regular character.) The "*" modifier is
equivalent to "{0,}", the "+" modifier to "{1,}", and the
"?" modifier to "{0,1}". n and m are limited to integral
values less than a preset limit defined when perl is
built. This is usually 32766 on the most common
platforms.
Another piece that is commonly used is the square bracket [ ] . They are defined as follows in the man page for regex source: man 7 regex
A bracket expression is a list of characters enclosed in
`[]'. It normally matches any single character from the
list (but see below). If the list begins with `^', it
matches any single character (but see below) not from the
rest of the list. If two characters in the list are sepa-
rated by `-', this is shorthand for the full range of
characters between those two (inclusive) in the collating
sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
It is illegal for two ranges to share an endpoint, e.g.
`a-c-e'. Ranges are very collating-sequence-dependent,
and portable programs should avoid relying on them.
To include a literal `]' in the list, make it the first
character (following a possible `^'). To include a lit-
eral `-', make it the first or last character, or the sec-
ond endpoint of a range. To use a literal `-' as the
first endpoint of a range, enclose it in `[.' and `.]' to
make it a collating element (see below). With the excep-
tion of these and some combinations using `[' (see next
paragraphs), all other special characters, including `\',
lose their special significance within a bracket expres-
sion.
Within a bracket expression, a collating element (a char-
acter, a multi-character sequence that collates as if it
were a single character, or a collating-sequence name for
either) enclosed in `[.' and `.]' stands for the sequence
of characters of that collating element. The sequence is
a single element of the bracket expression's list. A
bracket expression containing a multi-character collating
element can thus match more than one character, e.g. if
the collating sequence includes a `ch' collating element,
then the RE `[[.ch.]]*c' matches the first five characters
of `chchcc'.
OK Lets start by looking at definitions for the terms used in Regular expressions. Here is a web page which does a good job of defineing the terms.
Regular expressions is the term used for a codified method of searching ‘invented’ or ‘defined’ by the American mathematician Stephen Kleene.
What is the definition of regular expressions. Lets look at a formal definition: regex7
Now that we have played with some of the definitions, lets get down to seeing how to apply some of these ideas. Here is an article I found on the web about using Regular Expressions with Web pages.
So What’s A $#!%% Regular Expression, Anyway?!
OK now that we have seen these uses, lets try looking at some tutorial information on regular expressions.
This comes from the Rute tutorial, which I highly recommend. 5. Regular Expressions
The Unix system’s includes many text processing tools. A few of them are diff, aspell, indent, less, cat, cut, sort, cksum, comm, csplit, expand, fmt, fold, head, join, md5sum, nl, od, paste, pr, ptx, sha1sum, split, sum, tac, tail, tr, tsort, unexpand, uniq, wc, sed, awk, egrep, sort, and perl. Lets take a look at what these tools are good at.
#
# tac: cat backwards; reverse the order of lines in a file
#
awk '{print NR "#" $0}' "$@" |
sort -t# +0nr -1 |
sed 's/^[0-9]*#//'
I would like you to take a look at the man pages for egrep, sed, and awk. Create at least two shell scripts for each tool demonstrating how to use them on these text files.comic.txt or baseball.txt
Written by John F. Moore
Last Revised: Wed Oct 18 11:01:31 EDT 2017