The structural features of each manuscript
Six structural features have been encoded in the corpus:
Manuscript
The transcription of each manuscript is contained between the <manuscript>
and </manuscript>
tags.
Two attributes form part of the opening tag: id
and name
:
-
id
is an abbreviated form of the name of the manuscript -
name
is the full name of the manuscript according to its entry in the catalogue of the repository where it is kept
The opening tag of the manuscript Peniarth 4, for example, is:
<manuscript id="P4" name="Peniarth 4">
All the manuscripts in the corpus and their identifiers are listed in The corpus.
To the beginning of the pagePage
The extent of a manuscript page or folio is encoded between the <page>
and </page>
tags.
For the purposes of analysis each page or folio of a manuscript needs a unique number. At various times and for various reasons, however, some manuscripts have had their pages renumbered. Others may have been incorrectly numbered. The exact numbering in each manuscript is recorded in its header file.
Pages are typically numbered sequentially, beginning with 1, 2, 3 etc. We encode such numbering
in the <page>
tag with the n
attribute. A typical opening page tag is:
<page n="1">
Folios are also typically numbered sequentially, beginning with 1r, 1v, 2r, 2v, etc. (In this convention,
'r' represents 'recto', and 'v' stands for 'verso'.) As with pages, such numbering is also conveyed
by the n
attribute, e.g.
To the beginning of the page<page n="1r">
Column
The texts in some manuscripts are written in two columns. In such cases the extent of each column
is denoted by the <col>
and </col>
tags. One attribute forms part of each
opening tag:
n
, which encodes the number of a column
The columns in some manuscripts have been numbered by someone else. In such cases we use those pre-existing numbers to identify the columns.
The columns in other manuscripts are not numbered. In such cases we have numbered the columns consecutively, beginning with 1.
A typical opening column tag is:
To the beginning of the page<col n="1">
Line
Line-breaks occur in our transcriptions in the same places as in the originals.
Lines are not numbered in the manuscripts. However, in order to identify each line of a page (or each line of a column when the page is divided into columns), we have assigned a number to each one, beginning with 1. Line numbers begin afresh on each page (or column).
A typical line tag is:
<line n="1">
Most lines have been aligned to the left by the scribe. Occasionally, however, a line of text may
be written in the centre of a page or column, or on the right-hand side. In such cases the alignment
is recorded by the
<line n="23" align="right">
<line n="34" align="center">
On the rare occasions that it has been considered necessary to record that a line has been aligned
to the left, this has been done by specifiying align="left"
.
Sometimes a scribe leaves a line without any writing on it. As such lines are structurally significant, they have been tagged and numbered.
To the beginning of the pageWords
Each word in the corpus has a unique address. This has been accomplished by assigning a unique number
to each word on a particular line. Word-numbering begins afresh on each line. The number is encoded
by the n
attribute, e.g.
<w n="1">ty</w>
In addition to number, <w>
may have a further four attributes:
rend
if a word is written in a colour other than blackla
if a word is written in a language other than Welshtype
if a word is written over two linesorder
if words are divided in an irregular order
If a word is written in red ink, rend
may have the value "red"
, e.g.
<w n="1" rend="red">judas</w>
la
is encoded using the ISO language codes:
cy
: Welshen
: Englishfr
: Frenchla
: Latin
Example:
<w n="1" lang="la">filius</w>
Medieval scribes frequently wrote the beginning of a word on one line and its end on the next, e.g. brenhin
could
be written:
... bren
hin ...
The "begin"
and "end"
values of the type
attribute show that both syllables
form part of the same word:
...<w n="25" type="begin">bren</w>
<w n="1" type="end">hin</w>
There is little that is unexpected in the above example: quite simply, the linguistic word brenhin
is
divided over two lines. Sometimes, though, encoding the relationship between parts of words which
have been divided over two lines can be rather more complex.
In the following example, not only has brenhin
been divided, but also gwlad
:
... bren
... gwl ... hin
ad ...
In this example hin
is the second half of the word brenhin
and ad
is
the second half of the word gwlad
. In order to match up the first and second
halves of both words, the order
of the elements of the first word is marked up as irregular
:
To the beginning of the page<w type="begin" order="irregular">bren</w>
<w type="begin">gwl</w>
<w type="end" order="irregular">hin</w>
<w type="end">ad<w>
Note on Word Division
In the medieval period it was not uncommon for one linguistic word to be written as two. For example,
the expected brenhin
might appear as bre nhin
. In such cases,
the linguistic word is marked up as one even though it includes a space:
<w>bre nhin</w>
This situation could have been dealt with in other ways. Had we, for example, been able to lemmatise the corpus, we could have given primacy to the orthographic word as we transcribed and encoded, say:
<lemma="brenin"><w>bre</w><w>nhin</w></lemma>
But lemmatisation introduces a host of other issues and was not part of the present project. We
did, however, wish to produce wordlists for the corpus and the <w>
bre
nhin
</w>
encoding enabled that to be accomplished: the space in the original is
represented and the complete word can also be included in the wordlist. The user may choose whether
or not to see the fruits of this exercise in the display mode.
The occasional gaps encountered in sequences such as bre nhin
may well have
been the product of the scribe lifting his pen momentarily. It is also clear that intra-word spaces
are not of uniform size: some of them could be considered half spaces. Writing what we would consider
to be two words as one, however, is a regular phenomenon. It primarily occurs when a proclitic
– an unaccented grammatical word – such as a possessive pronoun precedes a noun, or when
a relative pronoun precedes a verb. Not unreasonably, the unaccented words are joined to the following
one. Examples are vympenn
and aweleis
.
Were we concerned simply with representing the form of our manuscripts, the unaccented proclitic
phenomenon would not have been an issue: vympenn
and aweleis
could
have been encoded simply as:
<w>vympenn</w>
<w>aweleis</w>
But since we were also concerned to produce useful wordlists and to pave the way for lemmatisation,
we needed to be able to separate the two elements in sequences such as the above. This has been accomplished
with a <nospace/>
tag. <nospace/>
functions to separate two linguistically
separate words even though there is no space between them in the source. vympenn
and aweleis
are
thus encoded:
<w>vym<nospace/>penn</w>
<w>a<nospace/>weleis</w>
On occasion three lexical elements may be written as one, e.g. Ympaffuryf
. Such
sequences are also disambiguated with <nospace/>
:
<w>Ym<nospace/>pa<nospace/>ffuryf</w>
There are also examples where the scribe has followed his ear and fused two words together. These
have been separated in the transcription and the target form of the second word given in a <sic>
,
tag. In Peniarth 5, for example, we have ychynt
, where ych
and hynt
have
been coalesced into one orthographical unit. This has been disambiguated and encoded as:
<w>ych<nospace/><sic: hynt>ynt</sic></w>
It is hoped that this encoding strategy will also help the novice reader who might otherwise spend fruitless hours scouring dictionaries for non-existent words.
To the beginning of the pageCharacters
The shape and colour of most characters are so predictable that it is not necessary or practicable to tag them. Nonetheless, every manuscript has characters which display special characteristics which need to be encoded. We have, for example, display capitals, ornate letters, and rubricated letters.
Characters which have special characteristics are marked up between <c>
and </c>
tags.
The special characteristics are encoded by three attributes:
rend
, which conveys a colour or rubricationtype
, which encodes all other characteristicslinespan
, which marks up the size of large initial letters
rend
generally takes the values "red" or "rubric". For example, if the initial letter of hywel
were
written in red ink or had been rubricated, it would be encoded as:
<w><c rend="red">h</c>ywel</w>
or
<w><c rend="rubric">h</c>ywel</w>
type
marks up various characteristics. It can have the following values:
initial
: a large initial capital lettersemi-cap
: a small initial capital letterminiature
: a letter with floral decorationornate
: a letter with ornamentationmedial a
: a medium-sizeda
dotted y
: ay
with a dot in its cupligature
: two lettersl
tied together6
: a character resembling the number6
whose belly is open
Examples:
<w><c type="initial">H</c>ywel</w>
<c type="ligature">ll</c>
<w>g<c type="6">w</c>r</w>
<c type="medial">a</c>
When a large initial letter occurs at the beginning of a line, it is not uncommon for it to span
several lines. In such cases, the value of the linespan
attribute conveys the number of lines
in question, e.g.
To the beginning of the page<w><c type="initial" linespan="3">H</c>ywel</w>