Strings in CMU.TXT

Strings in CMU.TXT


Data stream formats in the Andrew User Interface System 
Wilfred J. Hansen
Andrew Consortium
Carnegie Mellon University
(The Andrew Toolkit (ATK) is the architecture and tools for building
application in the Andrew User Interface System.)
In order to support the inclusion of arbitrary objects in multi-media
editors, the Andrew Toolkit requires data objects to conform to a set of
conventions for their file representation. A data object must write its
data enclosed in a begin/end marker pair. The marker must include a tag
denoting the type of the object being written and a unique identifier,
used for referencing the data object by other data objects. If a data
object includes other data objects, they must be properly nested. The
begin/end markers make it possible to find the data associated with an
object without actually parsing the data.
For example, a text with an embedded picture has the format:
\begindata{text,1}
<text data>
\begindata{picture,2}
<picture data>
\enddata{picture,2}
\view {pictureview,2}
<more text data>
\enddata{text,1}
In order to transport files across most networks, data streams use only
printable 7-bit ASCII characters, including tab, space and new-line, and
keep line lengths below 80 characters.
____________________________________
Text format
Text data streams in the Andrew User Interface System follow the general
principles for Andrew Toolkit data streams. The overall structure
of a text data stream is
A. \begindata line
B. \textdsversion line
C. \template line
D. definitions of additional styles
E. the text body itself
F. styled text
G. embedded objects in text body
H. \enddata line
Subsequent sections of this document describe each of these components.
As usual in ATK, the appropriate way to read or write the data stream is
to call upon the corresponding Read or Write method from the AUIS
distribution. Only in this way is your code likely to continue to work in
the face of changes to the data stream definition. Moreover, there are
a number of special features--mostly outdated data streams--that are
implemented in the code, but not described here.
A. \begindata line
Standard ATK begindata line having the form
\begindata{text,99999}
where 99999 is some identifying number unique within this data stream.
B. \textdsversion line
This line always has the form
\textdsversion{12}
There exist files written with earlier data stream versions having values
other than 12.
C. \template line
If the file utilizes a style template, there will be a line of this form:
\template{default}
where 'default' is whatever template name is used. This template name is
the prefix of a filename. The name is appended with the suffix ".tpl" and
sought in the directories named in the user's atktemplatepath preference
value. If there is none, the default directory is $ANDREWDIR/lib/tpls.
'default' is the most usual template name. Every installation of AUIS
is expected to have $ANDREWDIR/lib/tpls/default.tpl and its styles are
not defined further in the document.
D. definitions of additional styles
A document may define and use styles that are not in the template. Each
such definition is two or more lines:
\define{internalstylename
menuname
attribute
. . .
attribute}
The internalstylename is lower case and may have digits, but no spaces.
There may be no menuname, in which case there is an empty line; if there
a menuname line, it is of the form
menu:[Menu card name,Style name]
If there are no attributes, the closing '}' appears at the end of the
menuname line. Each attribute line is of the form
attr:[attributename basis units value]
where the first three are strings and the fourth is an integer, possibly
signed. The specific values allowed are beyond the scope of this document;
they do correspond closely to values in style.H.
E. the text body itself
Text is represented by itself. n consecutive newlines in the text are
represented by n+1 newlines in the data stream. Single newlines are used
to break the stream into lines of less than 80 bytes; these are ignored
when the file is read. Earlier data stream versions required a sapce before
a newline if there was to be a space in the text; version 12 invents a
space before the newline if one is not there. The latter is prevented by
ending the line with a single backslash (\). If a sentence ends a line and
has more than one space after its punctuation, the additional spaces must
appear at the start of the next line. The characters backslash, left
brace, and right brace are always preceded in the text with a backslash.
There is a convention for representing non-ASCII ISO-8859 characters, but
I don't know what it is offhand.
F. styled text
If text in the body is to be displayed in a style, e.g. italic, the text is
preceded with
\internalstylename{
and followed by a closing curly brace. The internal style name is
one of the names defined either in the template or in a \define line.
G. embedded objects in text body
When an object is embedded in a text body, two items appear: the data
stream for the object and a \view line. The \begindata for the object is
always at the beginning of a line. (The previous line is terminated with
backslash if there is to be no space before the object.) The \enddata
line for the object always ends with a newline (which is not treated as a
space).
The \view line has the form:
\view{rasterview,8888,777,0,0}
In future data stream versions, other items may appear before the '}'; each
such item is preceded by a comma. The first item in the list is the textual
name of the view object to be used to display the dataobject. The second
item is the identifing integer that also appears in the \begindata for the
object. The third value is ignored. The fourth and fifth items are
usually zero; however, if non-zero the specify the desired width and height
to display the object.
H. \enddata line
Has the form
\enddata{text,99999}
that is, it is the same as the \begindata line, but has 'end' instead
of 'begin'.
____________________________________
Format of ATK raster images
The raster data object writes a standard ATK data stream beginning with
a \begindata line and ending with a \enddata line. Between these comes
a header and possibly an image body.
The first line of the header looks like this:
2 0 65536 65536 0 0 484 603
Where the values are these:
RasterVersion: '2'
This specification describes the second version of this encoding.
Options: '0'
This field may specify changes to the image before displaying it:
raster_INVERT(1<<0)/* exchange black and white */
raster_FLIP(1<<1)/* exch top and bottom */
raster_FLOP(1<<2)/* exch left and right */
raster_ROTATE(1<<3)/* rotate 90 clockwise */
xScale, yScale: '65536 65536'
These scale factors affect the size at which the image is printed.
The value raster_UNITSCALE (136535) will print the image at
approximately the size on the screen. The default scale of
65536 is approximately half the screen size. (It is not
exactly half screen size in an effort to simplify scaling on
300-dots-per-inch printers.)
x, y, width, height: '0 0 484 603'
It is possible for a raster object to display a portion of an
image. These fields select this portion by specifying the
index of the upper left pixel and the width and height of the
image in pixels.
In all instances so far, x and y are both zero and the width
and height specify the entire raster.
The second header line specifies the actual raster in one of three forms;
but only the first of these forms is actually used.
First form:
bits 10156544 484 603
RasterType: 'bits'
This form.
RasterId: '10156544'
An identifier so other raster objects can refer to this one.
Usually this is the same identifier as in the \begindata line.
Width, Height: '484 603'
Describes the width of each row and the number of rows.
This many rows follow one subsequent lines.
Second form: refer 10135624
The current data object does not have the bits, but refers to the
bits as stored in another data object (which should appear earlier
in the same data stream.) 'refer' identifies this form and
the integer is the identifying number.
Third form: file 10235498 filename path
The raster is not in the current data object, but is in a file.
'file' identifies this form. The id number '10235498' allows
this raster data to be refered to by a 'refer' form. The filename
is the full pathname of the file. Path is the element of a
"rasterpath" list against which the filename was resolved.
(This is not fully implemented. The idea is to acheive a
measure of recovery in case the file is moved.)
In the first form ('bits'), the header is followed by lines specifying the
image. There is at least one line per raster row, though some rows may take
more lines. The bits of a row are encoded in blocks of eight; a multiple
of 8 bits are specified, though trailing bits will be ignored after reading
the row. Following the last bits for a row are a space, a vertical bar (|),
and a newline. Basically, white space is to be ignored, so the bytes of the
encoding are broken into blocks of 13 or 14 bytes separated with tabs.
The bits of the row are run-length encoded by bytes. That is, a sequence
of identical bytes will be represented in only a few bytes rather than
at full length. Hexadecimal is a subset of this encoding with a one bit
representing black and zero for white. Here is the
interpretation of each range of byte values:
control characters and space:
Ignored.
@ [ ] ^ _ ` } ~ 0x7F and all characters with high bit set:
These are errors, but at present they are ignored.
{ \:
Illegal end of line. Treat as end of row.
Legal end of row. If there have not been enough
codes for the entire width, pad with white bits.
0x21 ... 0x2F (punctuation characters)
The next two bytes specify a hex value. This value is
to be repeated in the row the number of times given
by c-0x1F, where c is the input code. (That is, 0x21 means
to repeat the byte two times, 0x22 three times, and so on.)
0x30 ... 0x3F (digit or punctuation)
This is a hex digit and encodes one byte of the row
with the value c-0x30.
A ... F a ... f
These are hex digits with values 0xA ... 0xF.
g ... z
Multiple white bytes. c-'f' bytes of white are generated into row
G ... Z
Multiple black bytes. c- 'F' bytes of black are generated into row
\begindata{text,538375988}
\textdsversion{12}
\template{default}
\define{global
\define{up15
menu:[Justify,Up15]
attr:[Script PreviousScriptMovement Point -15]}
This is text in the document. \italic{This is
italic.} These two lines are one paragraph.
This paragraph is preceded by two newlines, but it will be
displayed with only one blank line between it and the previous one.
When two space are required between words, the second
must appear at the beginning of a line. When a newli\
ne is not to be replaced with a space, it must be preceded
with backslash.
\begindata{bp,9233088}
\enddata{bp,9233088}
\view{bpv,9233088,38,0,0}
This second page has a raster on it.
\begindata{raster,10156544}
2 0 68266 68266 0 0 484 603
bits 10156544 484 603
zzzg |
zzzg |
7fZZHfeKfeOc0g |
. . .
zzzg |
\enddata{raster, 10156544}
\view{rasterview,10156544,31,0,0}
\enddata{text,538375988}
-----------------------------
The only immediate comment I would add is that, if you come across a
file which purports to be an AUIS raster file, it may be either of two
things:
A) A raster datastream, as defined in Fred's document under "Format of
ATK raster images." The first line of this file would be
\begindata{raster,99999}
(with an arbitrary ID integer in place of the 999999). This would be
followed by the header and possibly the image body, and then the final
line would be
\enddata{raster,99999}
(with the same integer.)
B) A text datastream (or some other kind of datastream) containing a
raster as an embedded object and no other data. This is not the
preferred way to store a raster image, but it tends to happen every now
and then.
In this case, the raster datastream will occur, as described above,
somewhere within the larger datastream. It is legal to read in lines and
throw them away until you find a line that begins
\begindata{raster,
(The backslash will always be the first character on the line.) You then
read in the datastream until the \enddata line occurs, and ignore the
rest of the file. (You can compare the ID numbers as a consistency
check).