Abstract

Text documents, Word documents, PDF files, Markup documents, Markdown documents, Web pages, *.docx, *.html, *.odt, *.tex confused yet? On a computer we have multiple types of documents to express thoughts, ideas, or just letters. Lets look at some methods of marking documents.

Now that we know what types of documents we are creating, how can we use them in a wiki or a web browser.

File Types

We spend a lot of time on the computer reading documents, when we are not watching videos or listening to music. But have you ever wondered what the files contain. We take for granted that the when we double click on a file the computer will open the document in the correct application to view or edit the document.

We are going to focus on documents for reading tonight. We are going to look at the underlying formats of the documents.

For a previous version of this talk look at Text Processing

Text Document
Text Document

Text Documents

This is the oldest type of documents on the computer. The numbers representing characters are defined by the ASCII (American Standard Code for Information Interchange) standard.

ASCII abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII, although they support many additional characters. ASCII

This was a standard to allow numbers to represent letters, numbers, and control characters on a computer. We can see the ASCII Codes Table here.

Under windows you can create a text file with the notepad editor or any other text or programming editor. The upside of text files is that they are readable on any computer. The line endings change between windoes, mac, and linux/unix. The downsides is that the characters have no formatting, no font resizing, nothing to make the documents more interesting.

Formated Documents

Now lets look at methods of defining formatting of the document. Formatting means that we tell the document what we want it to look like.

I will review a few formatting methods. I am not going to discuss how to use the tools to write the documents, I am focusing on the formatting codes embedded in the documents.

This not meant to be an all inclusive list, only ones that I think are demonstrative.

Man Docs

This formation know as Runoff, or roff, was the format used on unix systems to create documentation for the commands. It was one of the earliest formatting methods used. The idea was to embed commands in the file to specify the formatting.

The files are basically ASCII files with commands beginning with a period in column one. As an example of what they look like have a look at Man Page Template and Example, or How should a formatted man page look?.

RTF

Introduced and documented by Microsoft, the Rich Text Format (RTF) represents a method of encoding formatted text and graphics for use within applications. What is a RTF file

I am not going to attempt to explain the format, but if you want more details let me refer you to Wikipedia Rich Text Format - Code Syntax.

Wordstar

Wordstar
Wordstar

WordStar is a word processor application for microcomputers. It dominated the market in the early and mid-1980s, succeeding the market leader Electric Pencil. It was published by MicroPro International, originally written for the CP/M-80 operating system, and later written also for MS-DOS and other 16-bit PC OSes. WordStar

Wordstar used the upper ASCII values to define the formatting. For those interested you can find a definition at wordstar file format release 7.0

Wordperfect

WordPerfect (WP) is a word processing application, now owned by Corel,[3] with a long history on multiple personal computer platforms. At the height of its popularity in the 1980s and early 1990s, it was the dominant player in the word processor market, displacing the prior market leader WordStar. WordPerfect

This was a favorite word processor for legal offices for many years due to some special features. One feature that was prized was its Reveal Codes feature.

Reveal Codes
Reveal Codes

Present since the earliest versions of WordPerfect, the Reveal Codes feature distinguishes it from other word processors; Microsoft Word’s equivalent is much less powerful.[13] It displays and allows editing the codes, reduces retyping, and enables easy formatting changes.[3] It is a second editing screen that can be toggled open and closed, and sized as desired.

The codes for formatting and locating text are displayed, interspersed with tags and the occasional objects, with the tags and objects represented by named tokens. This provides a more detailed view to troubleshoot problems than with styles-based word processors, and object tokens can be clicked with a pointing device to directly open the configuration editor for the particular object type, e.g. clicking on a style token brings up the style editor with the particular style type displayed. WordPerfect had this feature already in its DOS incarnations. Reveal codes

DOCX

Office Open XML (also informally known as OOXML) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents. The format was initially standardized by the Ecma (as ECMA-376), and by the ISO and IEC (as ISO/IEC 29500) in later versions.

Microsoft Office 2010 provides read support for ECMA-376, read/write support for ISO/IEC 29500 Transitional, and read support for ISO/IEC 29500 Strict.[4] Microsoft Office 2013 and Microsoft Office 2016 additionally support both reading and writing of ISO/IEC 29500 Strict.[5] While Office 2013 and onward have full read/write support for ISO/IEC 29500 Strict, Office Open XML

This format can be read and written by both Open Office and Microsoft Word. This type of formatting is common to many users in the Windows and Mac world. This format is based on the Office Open XML standard.

The docx file is actually a zip archive of a number of XML files. To see what this means have a look at Sample.docx.archive which is what a docx file looks like when you use unzip -l on it. Inside you will find a file named document.xml which contains the actual text from the document. Here is the sample.docx for comparison.

For those who would like to know more about the docx structure, I would recommend An Informal Introduction to DOCX a fairly detailed article about the structure of a docx file.

TEX / LATEX

TeX (/tɛx/, see below), stylized within the system as TEX, is a typesetting system which was designed and written by Donald Knuth and first released in 1978. TeX is a popular means of typesetting complex mathematical formulae; it has been noted as one of the most sophisticated digital typographical systems.TeX

TeX and LaTeX are the same system. LaTeX comes with a number of predefined macros to make the formatting easier.

The system was designed by a mathematician to correctly display math formulas. Here is an example Example Tex Math

To see what a LaTeX document looks like before and after processing have a look at Getting Started with LaTeX.

One major advantage of this system is that you can define a template with comments. The author then inserts the text in the appropriate places and the document comes out with a specific look and feel.

As an example of what you can do with LaTeX here is a template I gave to the second grade parents to enter a book written by the second graders. Amato.tex and here is the output after processing Amato.pdf

Presentation Format

Presentation format is a method of displaying documents over multiple platforms. This category was created because the formatting of many documents changed when they are viewed on different systems.

PS

PostScript is a very powerful language that looks a bit like Forth, another computer language. From the beginning, PostScript needed a pretty powerful system to run on. In fact, during the first years of its existence, PostScript printers had more processing power that the Macintoshes that were connected to them.

PostScript offered some huge advantages that other systems did not offer:

PostScript is device-independent. This means that a PostScript file can run on any PostScript device. On a laser printer, you get 300 dpi output, while the same file gives you beautiful and crisp 2400 or 2540 dpi output on an imagesetter. For users, this meant that they were no longer tied to one manufacturer and could choose the devices that best fit their purpose. The history of PostScript

Post script was a method of both printing and distributing a file. The computers could use a program called Ghostscript to allow you to view a postscript file. To give you an idea of what you could do with postscript, here is a post script file tiger.ps and here is the output of the file when printed tiger.png.

PDF

PDF started off on the dream of a paperless office, as the pet project of one of Adobe’s founders, John Warnock. Initially, it was an internal project at Adobe to create a file format so documents could be spread throughout the company and displayed on any computer using any operating system. In his paper which led to the development of PDF, John Warnock wrote: ‘Imagine being able to send full text and graphics documents (newspapers, magazine articles, technical manuals, etc.) over electronic mail distribution networks. These documents could be viewed on any machine and any selected document could be printed locally. This capability would truly change the way information is managed.’ The history of PDF

PDF has largely replaced Post Script as a display format on computers. It’s advantage is that is displays the document across any computer system constantly. Originally you could not edit PDF documents. This has changed due to PDF editors. It still remains the best tool to exchange written documents between computers.

Markup Documents

Markup documents differ from Fromatting documents in that the text is tagged for what it represents instead of how it looks. Now some of these document types can define what a specific type looks like, but the value tag defines the look.

In this section we will discuss the markup language as well as tools for editing and transforming them.

SGML

SGML (Standard Generalized Markup Language) is an openly documented and freely implementable international standard for semantic markup of textual documents in a manner that permits the separation of the underlying content from the formatting instructions for display or printing.

SGML was designed to enable the sharing of machine-readable documents across different technical environments and to support a long readable life, particularly as required for documents produced in government, law, and industry. An SGML file is encoded as plain text. As a “generalized” markup language, it incorporated the concept of a “document type definition” (DTD). For a particular document application, a company or government agency would develop an SGML DTD, declaring names and constraints for document elements and their attributes. Standard Generalized Markup Language (SGML). ISO 8879:1986

This is a long way of saying that the government picked a markup standard that is used for formal documents. The DTD is a Document Type Definition, which defines what the tags mean and how they are displayed. This standard is used by industry when preparing documents for the military so they can be universally understood by both people and computers.

To get an idea of what an SGML document have a look at: Article headers.

This standard has been largely replaced by XML which is a subset of SGML.

HTML

One of the most popular applications of SGML came with the development of HyperText Markup Language (HTML) by Tim Berners Lee in the late 1980s (Raggett, Lam, Alexander & Kmiec, 1998). Since its development HTML has somewhat become a victim of it’s own popularity, as it was rapidly adopted and extended in many ways, beyond it’s original vision. It remains popular today, though as a presentation technology, and is considered unsuitable as a general purpose data storage format.

When it comes to data storage and interchange, HTML is a bad fit, as it was originally intended as a presentation technology, while SGML is considered too complex for general use. XML bridges this gap by being both human and machine readable, while being flexible enough to support platform and architecture independent data interchange. A Brief History of XML

Prior to XML came HTML which was designed to share research papers on the internet. HTML is interpreted by the browser, so it is not consistent across different browsers. It has also been expanded to accommodate a number of extensions for movies, and sound.

If you wanted to write HTML code directly you can learn how from this tutorial Writing HTML. This was written originally for teachers. The only tools you need are a text editor like Notepad and a web browser.

XML

XML
XML

Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. XML

This markup language has taken over from both SGML and HTML as the most attractive markup language.

XML is a markup language that can be used for many different purposes. It is used for web pages, but it can also be used for databases as a backup or exchange mechanism. This subject could have a whole talk of it’s own, but for now if you want to learn more have a look at XML Tutorial or A Really, Really, Really Good Introduction to XML

Docbook

DocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software, but it can be used for any other sort of documentation.[1]

As a semantic language, DocBook enables its users to create document content in a presentation-neutral form that captures the logical structure of the content; that content can then be published in a variety of formats, including HTML, XHTML, EPUB, PDF, man pages, Web help[2] and HTML Help, without requiring users to make any changes to the source. DocBook

This is a form of markup that allows a single source document and multiple output formats. This is sometimes referred to as write once, publish multiples. To learn more consult the bible for DocBook at Docbook.org. Or for a good tutorial see An introduction to DocBook, a flexible markup language worth learning. We will see later a method of transforming docbook to other formats using Pandoc.

Markdown

Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages. What is Markdown?

Markdown is a simplified why of adding formatting to a file which can be converted to a number of different formats. It is used in a wiki as well as many other document types. As an example, this document was written in markdown.

WIKI

What is a ‘wiki’ defined as today? This term “wiki” actually means quick in Hawaiian. The journey from that definition to today’s definition of “a website that allows collaborative editing of its content and structure by its users” is quite the interesting story, best told by Ward Cunningham, the father of the modern wiki.

The important part of wikis—what makes them different from any other type of website—is collaborative editing by the users. Think about that for a moment: the ability for the users of a wiki to collaboratively edit it. If you can read it, you can edit it. It seems simple at first, yet profoundly powerful in practice—and it’s what both Wikipedia and WikiLeaks have in common. What Are Wikis, and Why Should You Use Them?

The wiki is a useful web server because it allows you to edit them from any where through a web server. The formatting language on a wiki page is markdown.

Pandoc

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library.

Pandoc can convert between numerous markup and word processing formats, including, but not limited to, various flavors of Markdown, HTML, LaTeX and Word docx. For the full lists of input and output formats, see the –from and –to options below. Pandoc can also produce PDF output: see creating a PDF, below.

Pandoc’s enhanced version of Markdown includes syntax for tables, definition lists, metadata blocks, footnotes, citations, math, and much more. Pandoc User’s Guide

Pandoc is sometimes referred to as the Swiss army knife of documentation. It gets that title because it can convert many format to from and to another format.

For anyone interested in exploring pandoc themselves the installing instruction are at Installing pandoc

Here is a web page showing how pandoc and convert markdown to a number of other formats Pandoc Samples.

Finally if you want a different look to your pandoc html files here are a few web pages with suggestions.

Displaying your Information

In this section we will discuss how to use internet service tools to display information. We will focus on using a wiki and a traditional web server.

Both types of internet servers can display multiple types of data from simple text to videos and music. Additionally you can use these tools to share information via file transfer.

Wiki

Now that we have seen multiple types of text formatting, lets have a look at the wiki we designed last time we met. Mine is at Main Page. Your link will be different than mine. The talk about creating the wiki are at Raspberry Pi Web Server

The first thing we are going to do is to log in, in the upper right corner. During the setup you created a log in for yourself. Use that log in now to allow editing. Hopefully you wrote down that log in information when you created it. Otherwise you will need to repeat the section Mediawiki Configuration to restore the log in information.

For a complete text on Mediawiki I recommend looking at the book Working with MediaWiki. I especially recommend chapter 4 about formatting.

Text Formatting Markup

Lets take a page from the Mediawiki instruction pages about formatting: Help:Formatting

Adding Help to Main Page

Once I was logged in I added the Mediawiki Help:contents to the getting started list on the main page. Here is how I did it.

  1. I logged into the wiki

  2. I clicked on the Edit Source button

  3. I made a new entry just below Getting Started

  4. I added the line
    • [https://www.mediawiki.org/wiki/Cheatsheet Cheatsheet]
  5. I clicked save changes at the bottom.

Adding a PDF document

I suggested last meeting that you could use a wiki to save manuals for appliances in your home. Let me go through the steps I used to add the Free42 Manual to the wiki. I already had the manual on my computer.

  1. From the main page I clicked on Upload File under tools.

  2. I selected the file via the choose file button. I then filled in the fields requested, and clicked upload file.

  3. I created a new wiki page named Calculator-42.

  4. I edited the page to say: Notice the tags <pdf> and </pdf> around the file name.

    If want to use the Free42 simulator for the HP-42 calculator see <pdf>File:Free42-Manual-2.pdf</pdf>

  5. I saved the page and returned to main page.

  6. I edited the main page adding the section and link to the calculator-42 page: Notice the internal links use [[ and ]].

    === Sample PDF page ===

    [[Calculator-42]]

Web Server

Now that I have described how markup and markdown look, lets take a look at the method I use to create my web site. By now it should be obvious that I write my talks in pandoc markdown. I then use a custom set CSS files, and templates I came across on the internet. Finally I use a make file to handle all the details of maintaining the files.

My method is somewhat simple.

  1. I create a directory to hold a new talk. I choose the base directory from my current top level directories: Lessons in Linux, Maker’s Movement, STEM, or Troubleshooting.

  2. I cd to the new folder and create a file ending in .md. For this talk I created the file stem/editing-words/editing.md

  3. I open the file in a programmers editor, Emacs, and add the heading like this.

    % Editing Words, Formatted and Tagged
    % John F. Moore
    % April 2022
    <!--
    # File: /mnt/john/work/1and1/stem/editing-words/editing.md
    # Date: Wed 30 Mar 2022 11:14:16 EDT
    # Author: John F. Moore <john.moore@lions-wing.net>
    # Last Revised: Time-stamp: <2022-04-20 14:51:55 john> maintained by emacs
    # Description: Word and web page editing
    -->

### Abstract ###

> Text documents, Word documents, PDF files, Markup documents,
  1. The first line is the title

  2. The second line the author

  3. The third line is the date

  4. The symbol <!– begins a comment.

  5. The lines starting with # are my standard header for a file. Note: the line Last Revised: is a time stamp maintained by emacs.

  1. I add the third level header ### Abstract ### and put in a summary of what the talk is about.

  2. From there on I write what I want to talk about. I include Quotes, Links and pictures. When I want a youtube video I copy the embed share code directly in the file.

  3. I open the file in the base directory named Makefile and add a new entry for the file with options. Using make simplifies the conversion of the markdown file to an HTML file.

editing-words/editing.html: editing-words/editing.md $(FORMATING)
    echo Build: editing.html
    $(SED) "s/%DATE%/$(UDATE)/" base-footer.html >footer.html
    $(PANDOC) editing-words/editing.md -o editing-words/editing.html -f markdown --css ../../template.css --toc -A footer.html -t html5 -s --template standalone.html
  1. As I progress in the file I run make to recreate the html file which I view in a web browser. I store any files or images used in the talk in the directory.

  2. When I get the talk complete I add an entry to the top index.md file and the sub directory index.md files.

  3. Finally when everything is working correctly on my laptop, I use the program FileZilla to upload the files to my website.

Adding to the local Web Server

As an example lets use the sample.md as a starting point. From there I will process the file with the command:

pandoc -f markdown -t html5 sample.md -s -o sample.html

This produces the file sample.html

Now it is time to upload the file to the web server. So you start up putty in ftp mode. From a command line type psftp 192.168.1.142 where 192.168.1.142 is the IP address of your server. You will be asked for a user pi and a password. Once you have logged in type in the command put sample.html to transfer the file. After the transfer is complete type exit to leave the sftp program.

Now you need to login to the server using Putty again. Once you are connected to the server you need to make yourself the admin user. Type the command sudo -i to become root user. Now you need to move the file to the base of the web server. As root type the command mv /home/pi/sample.html /var/www/html/sample.html. This moves the file sample.html to the root of the web server. You need to do this as root since the directory the html pages live in is owned by root.

Now lets look for the file. In a web browser type the link address: http://192.168.1.142/sample.html which should show you the sample file on your web server.

So now that you know how to upload web pages to your web server you can add both html pages or enter data using the Wiki.


Written by John F. Moore

Last Revised: Wed 20 Apr 2022 06:03:22 PM EDT

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
HTML5 Powered with CSS3 / Styling, and
      Semantics