Text Processing
March 2012
John
F.
Moore
john.wpcug@lions-wing.net
JFM
1.0March 2012JFM
Abstract
In the Unix world there have been a number of systems for creating
documentation. A number of them are still around and doing well.
This talk will explore a few of them to give you an idea how you can
convert words on a page into a presentation document.
Creative Commons
./cc.88x31.png
This work is licensed under a
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License
A Book
A Book
We are all familiar with the written word. And I would imagine that
all of us have written things on the computer. So the question is how
do we get from words on a page into some presentation format.
To answer this question lets look at some of the system used. For
this talk I am going to ignore the work processing programs such as
Open Offce Writer, or Abiword, or KWord or others. For more
information on this topic have a look at
Linux Word Processing
Man Pages
One of the first programs for formating text on Unix was roff.
TYPSET and RUNOFF
RUNOFF is a direct predecessor of the runoff document formatting
program of Multics, which in turn was the ancestor of the roff and
nroff document formatting programs of Unix, and their descendants. It
was also the ancestor of FORMAT for the IBM System/360, and of course
indirectly for every computerized word processing system.
This program is still in use in the Linux system as man pages. Lets
have a look at a man page.
For more information on man page structure here is a good page from
the internet Man Page Howto
Tip
View in a terminal "man groff" and in
another terminal "zless /usr/share/man/man1/groff.1.gz"
WordStar
One of the early word processing programs for CP/M and later Dos was
WordStar.
WordStar
WordStar is a word processor application, published by MicroPro
International, originally written for the CP/M operating system but
later ported to DOS, that enjoyed a dominant market share during the
early- to mid-1980s. Although Seymour I. Rubinstein was the principal
owner of the company, Rob Barnaby was the sole author of the early
versions of the program; starting with WordStar 4.0, the program was
built on new code written principally by Peter Mierau.
WordStar was a text-based word processing program, meaning that it
worked with files that were essentially text, with markup
language-like formatting commands (such as the "dot commands"); this
made the files relatively small. By contrast, most word processors
today are code-based, and save their documents in much larger files.
Even though this was never a Linux tool, it’s method of saving the
data has a lot of similarity to man pages. I include it here because
it used text files with embeded format commands. The difference was
that the interface was WYSIWYG so there was never the need to
process the files.
Here is the specification for the WordStar File Format.
TeX and LaTeX
One of the next system that I experimented with was the TeX system.
TeX
Together with the Metafont language for font description and the
Computer Modern family of typefaces, TeX was designed with two main
goals in mind: to allow anybody to produce high-quality books using a
reasonable amount of effort, and to provide a system that would give
exactly the same results on all computers, now and in the future.
TeX is a popular means by which to typeset complex mathematical
formulae; it has been noted as one of the most sophisticated digital
typographical systems in the world. TeX is popular in academia,
especially in mathematics, computer science, economics, engineering,
physics, statistics, and quantitative psychology. It has largely
displaced Unix troff, the other favored formatter, in many Unix
installations, which use both for different purposes. It is now also
being used for many other typesetting tasks, especially in the form of
LaTeX and other template packages.
This system takes some time to learn, but the output often worth the
work, especially if you take the time to use it’s macro ability.
An example of this power comes from a project I did for my son’s grade
school. The idea was to get the kids to write books. So 3rd and 4th
graders would write up a story. Then a parent would take the story
and type it into a template file I provided. I would then take their
file, and merge it with a number of macros to produce a book printed
on regular 8 1/2 by 11 paper, in a two up format. So you would fold
the pages in the middle and you would have a 8 1/2 by 5 1/2 inch
book. When we wrapped the pages in a cover. The students could take
home a book they wrote themselves and I publish for them. Needless to
say they were thrilled.
Tip
View in a terminal the files amato.tex
cas-320.ps cas-320.tex tolly.poem.ps tolly.poem.tex
Markup Language
Now that we have see how a document can be tagged for how it will
look, lets take a different approach to documents. This time sets see
what happens if we mark up the text for what it is instead of what it
will look like.
Markup Language
A markup language is a modern system for annotating a document in a
way that is syntactically distinguishable from the text. The idea and
terminology evolved from the "marking up" of manuscripts, i.e., the
revision instructions by editors, traditionally written with a blue
pencil on authors' manuscripts. Examples are typesetting instructions
such as those found in troff and LaTeX, or structural markers such as
XML tags. Markup is typically omitted from the version of the text
that is displayed for end-user consumption. Some markup languages,
such as HTML, have presentation semantics, meaning that their
specification prescribes how the structured data are to be presented,
but other markup languages, like XML, have no predefined semantics.
A well-known example of a markup language in widespread use today is
HyperText Markup Language (HTML), one of the document formats of the
World Wide Web. HTML, which is an instance of SGML (though, strictly,
it does not comply with all the rules of SGML), follows many of the
markup conventions used in the publishing industry in the
communication of printed work between authors, editors, and printers.
HTML
The first common use of a markup language by most people are web
pages. Lets take a look at a tutorial on creating web pages. I am
going to suggest two. There are hundreds of web sites that want to
teach you about writing HTML. I am going to suggest a few I think are
some of the best, at least in my opinion
Web page creations for
Teachers. WRITING HTML is one
of the oldest tutorials I know, but I think it
is still one of the best.
Tizag’s Tutorias is another old timer web page tutorial.
Beginner’s Web Site Creating Guide
leads with little to distract you. Although this site has not
updated since 2008 it is still good.
W3Schools is a more recent addition to tutorials.
Learn to Create Websites has
tutorials on many web and non-web technologies related to Markup
languages.
Lets discuss markup in more detail. HTML was the first time many
people started using markup to indicte what something was as opposed
to how it looks. In HTML we mark things with opening and closing
tags. The opening tag is repeated for the closing tag but includes a
slash / before the word. Here are some of the opening tags. All of
these have closing tags except where marked.
<title> (Title)
<H1> (Header Level 1)
<H2> (Header Level 2)
<P> (Paragraph)
<UL> (Bullet List)
<OL> (Ordered List)
<DL> (Definition List)
<LI> (List Item)
<A HREF="URL"></A> (Link to a web page)
<IMG SRC="URL"> (Link to an image)
XML
XML is a newer version of markup languages. It was designed to allow
documents to be shared and understood between computers. Here is how
wikipedia defines XML.
XML
Extensible Markup Language (XML) is a markup language that defines a
set of rules for encoding documents in a format that is both
human-readable and machine-readable. It is defined in the XML 1.0
Specification produced by the W3C, and several other related
specifications, all gratis open standards.
The design goals of XML emphasize simplicity, generality, and
usability over the Internet.[7] It is a textual data format with
strong support via Unicode for the languages of the world. Although
the design of XML focuses on documents, it is widely used for the
representation of arbitrary data structures, for example in web
services.
XML markup can be used to exchange records between different computers
and even different Operating Systems. One of it’s strengths is that
you can create a DTD (Document Type Definition) that defines a set of
tags and the rules for using them. Once a DTD exists it can be used
to create documents which can be easly understood on both ends of a
transfer. XML documents are often used in manufacturing automation
where a manufacturer and a parts supplier exchange information between
computers. The XML documents server as the transfer document so two
totally different systems can understand what is each other is saying
without a custom interface.
One group created an XML schema which is used for document publishing
called DocBook.
Docbook
What is Docbook
DocBook is a schema (available in several languages including RELAX
NG, SGML and XML DTDs, and W3C XML Schema) maintained by the DocBook
Technical Committee of OASIS. It is particularly well suited to books
and papers about computer hardware and software (though it is by no
means limited to these applications).
I found Docbook useful and have used it for several years to write
these presentations. Lets take a look at a presentation and the
source for the presentation.
First lets look at the source code WhyNot
Source Once this is typed into the computer, it is converted into
xhtml code using a script and then look like this:
WhyNot Presentation
Asciidoc
This last markup is the one I am using for this talk. It has many
similarities to the markup used by Wikipedia.
Asciidoc Introduction
AsciiDoc is a text document format for writing notes, documentation,
articles, books, ebooks, slideshows, web pages, man pages and
blogs. AsciiDoc files can be translated to many formats including
HTML, PDF, EPUB, man page.
AsciiDoc is highly configurable: both the AsciiDoc source file syntax
and the backend output markups (which can be almost any type of
SGML/XML markup) can be customized and extended by the user.
The interesting thing about Asciidoc is the number of formats it can
be converted into. It makes the work of typing up a talk like this
fairly straight forward. The down side of this format is that it
moves beck to presentation markup as opposed to content markup.
For an example of
AsciiDoc let use their
website.
As a comparison, lets look at the markup used in
Wikipedia Main Page
Presenting and Transforming Content
We have discussed two types of markup in this talk Descriptive markup
and Presentational markup. This debate is framed by the uses of the
information and the way it will be used. The argument goes back to
the early computer documentation systems. Examples of Presentation
markup comes from ROFF typesetting, and TeX markup. Examples of
Descriptive markup comes from
IBM
BookMaster, and SGML.
The Presentation markup aims to define how the output is going to be
presented. Which font, what line spacing, where to highlight, etc.
This type of markup is commonly used in individual documents or web
pages. It’s use in books is well known and was infact the basis for
some of the original computer markup.
The Descriptive markup aims to define what the content of the document
is. Author name, abstract, section, block quote, reference, etc. The
need for this type of markup is to make it obvious to the computer
what the words represent. The computer can then use programs to
convert this markup into presentation markup for output to different
devices. Additionally, Descriptive markup allows comptuters to create
indexes and search capabilities over a large number of documents.
Here is an interesting article I cam across which discusses markup
Introducing Markup
Languages.
This was implemented using
SCRIPT for IBM.
Another discussion of using computers to document knowledge is
presented in
Wiki:
A Systems Programming Productivity Tool
An interesting cross over language to cover both types of markup might
be Asciidoc, described above since it seems to use Presentational
markup, but includes tools to convert to Descriptive markup.
Example of AsciiDoc Processing
So I am going to show you what I was able to produce from this same
input file using different processing steps.
The original file for this talk looks like text.adoc
To create this web page I used the transform command 'asciidoc -a
data-uri -a icons -a toc -a max-width=55em text.adoc' which produced
this version of the html document.
To create a Docbook version of this file I used the command 'asciidoc
-f /etc/asciidoc/docbook.conf --backend=docbook -o text.xml
text.adoc' which produced the file
text-docbook.xml after which I converted the
output using a script I have named 'process.sh text-docbook'. This
processing, when combined with the CSS style sheet, style-ob.css,
produced this version text-docbook.html.