Breudwd Welsh Prose 1350-1425
Cymraeg

The structural features of each manuscript

Six structural features have been encoded in the corpus:

Manuscript

The transcription of each manuscript is contained between the <manuscript> and </manuscript> tags.

Two attributes form part of the opening tag: id and name:

The opening tag of the manuscript Peniarth 4, for example, is:

<manuscript id="P4" name="Peniarth 4">

All the manuscripts in the corpus and their identifiers are listed in The corpus.

To the beginning of the page

Page

The extent of a manuscript page or folio is encoded between the <page> and </page> tags.

For the purposes of analysis each page or folio of a manuscript needs a unique number. At various times and for various reasons, however, some manuscripts have had their pages renumbered. Others may have been incorrectly numbered. The exact numbering in each manuscript is recorded in its header file.

Pages are typically numbered sequentially, beginning with 1, 2, 3 etc. We encode such numbering in the <page> tag with the n attribute. A typical opening page tag is:

<page n="1">

Folios are also typically numbered sequentially, beginning with 1r, 1v, 2r, 2v, etc. (In this convention, 'r' represents 'recto', and 'v' stands for 'verso'.) As with pages, such numbering is also conveyed by the n attribute, e.g.

<page n="1r">
To the beginning of the page

Column

The texts in some manuscripts are written in two columns. In such cases the extent of each column is denoted by the <col> and </col> tags. One attribute forms part of each opening tag:

The columns in some manuscripts have been numbered by someone else. In such cases we use those pre-existing numbers to identify the columns.

The columns in other manuscripts are not numbered. In such cases we have numbered the columns consecutively, beginning with 1.

A typical opening column tag is:

<col n="1">
To the beginning of the page

Line

Line-breaks occur in our transcriptions in the same places as in the originals.

Lines are not numbered in the manuscripts. However, in order to identify each line of a page (or each line of a column when the page is divided into columns), we have assigned a number to each one, beginning with 1. Line numbers begin afresh on each page (or column).

A typical line tag is:

<line n="1">

Most lines have been aligned to the left by the scribe. Occasionally, however, a line of text may be written in the centre of a page or column, or on the right-hand side. In such cases the alignment is recorded by the align attribute. For example:

<line n="23" align="right">
<line n="34" align="center">

On the rare occasions that it has been considered necessary to record that a line has been aligned to the left, this has been done by specifiying align="left".

Sometimes a scribe leaves a line without any writing on it. As such lines are structurally significant, they have been tagged and numbered.

To the beginning of the page

Words

Each word in the corpus has a unique address. This has been accomplished by assigning a unique number to each word on a particular line. Word-numbering begins afresh on each line. The number is encoded by the n attribute, e.g.

<w n="1">ty</w>

In addition to number, <w> may have a further four attributes:

If a word is written in red ink, rend may have the value "red", e.g.

<w n="1" rend="red">judas</w>

la is encoded using the ISO language codes:

Example:

<w n="1" lang="la">filius</w>

Medieval scribes frequently wrote the beginning of a word on one line and its end on the next, e.g. brenhin could be written:

... bren
hin ...

The "begin" and "end" values of the type attribute show that both syllables form part of the same word:

... <w n="25" type="begin">bren</w>
<w n="1" type="end">hin</w>

There is little that is unexpected in the above example: quite simply, the linguistic word brenhin is divided over two lines. Sometimes, though, encoding the relationship between parts of words which have been divided over two lines can be rather more complex.

In the following example, not only has brenhin been divided, but also gwlad:

... bren
... gwl ... hin
ad ...

In this example hin is the second half of the word brenhin and ad is the second half of the word gwlad. In order to match up the first and second halves of both words, the order of the elements of the first word is marked up as irregular:

<w type="begin" order="irregular">bren</w>
<w type="begin">gwl</w>
<w type="end" order="irregular">hin</w>
<w type="end">ad<w>
To the beginning of the page
Note on Word Division

In the medieval period it was not uncommon for one linguistic word to be written as two. For example, the expected brenhin might appear as bre nhin. In such cases, the linguistic word is marked up as one even though it includes a space:

<w>bre nhin</w>

This situation could have been dealt with in other ways. Had we, for example, been able to lemmatise the corpus, we could have given primacy to the orthographic word as we transcribed and encoded, say:

<lemma="brenin"><w>bre</w><w>nhin</w></lemma>

But lemmatisation introduces a host of other issues and was not part of the present project. We did, however, wish to produce wordlists for the corpus and the <w>bre nhin</w> encoding enabled that to be accomplished: the space in the original is represented and the complete word can also be included in the wordlist. The user may choose whether or not to see the fruits of this exercise in the display mode.

The occasional gaps encountered in sequences such as bre nhin may well have been the product of the scribe lifting his pen momentarily. It is also clear that intra-word spaces are not of uniform size: some of them could be considered ‘half spaces’. Writing what we would consider to be two words as one, however, is a regular phenomenon. It primarily occurs when a proclitic – an unaccented grammatical word – such as a possessive pronoun precedes a noun, or when a relative pronoun precedes a verb. Not unreasonably, the unaccented words are joined to the following one. Examples are vympenn and aweleis.

Were we concerned simply with representing the form of our manuscripts, the unaccented proclitic phenomenon would not have been an issue: vympenn and aweleis could have been encoded simply as:

<w>vympenn</w>
<w>aweleis</w>

But since we were also concerned to produce useful wordlists and to pave the way for lemmatisation, we needed to be able to separate the two elements in sequences such as the above. This has been accomplished with a <nospace/> tag. <nospace/> functions to separate two linguistically separate words even though there is no space between them in the source. vympenn and aweleis are thus encoded:

<w>vym<nospace/>penn</w>
<w>a<nospace/>weleis</w>

On occasion three lexical elements may be written as one, e.g. Ympaffuryf. Such sequences are also disambiguated with <nospace/>:

<w>Ym<nospace/>pa<nospace/>ffuryf</w>

There are also examples where the scribe has followed his ear and fused two words together. These have been separated in the transcription and the target form of the second word given in a <sic>, tag. In Peniarth 5, for example, we have ychynt, where ych and hynt have been coalesced into one orthographical unit. This has been disambiguated and encoded as:

<w>ych<nospace/><sic: hynt>ynt</sic></w>

It is hoped that this encoding strategy will also help the novice reader who might otherwise spend fruitless hours scouring dictionaries for non-existent words.

To the beginning of the page

Characters

The shape and colour of most characters are so predictable that it is not necessary or practicable to tag them. Nonetheless, every manuscript has characters which display special characteristics which need to be encoded. We have, for example, display capitals, ornate letters, and rubricated letters.

Characters which have special characteristics are marked up between <c> and </c> tags. The special characteristics are encoded by three attributes:

rend generally takes the values "red" or "rubric". For example, if the initial letter of hywel were written in red ink or had been rubricated, it would be encoded as:

<w><c rend="red">h</c>ywel</w>
or
<w><c rend="rubric">h</c>ywel</w>

type marks up various characteristics. It can have the following values:

Examples:

<w><c type="initial">H</c>ywel</w>
<c type="ligature">ll</c>
<w>g<c type="6">w</c>r</w>
<c type="medial">a</c>

When a large initial letter occurs at the beginning of a line, it is not uncommon for it to span several lines. In such cases, the value of the linespan attribute conveys the number of lines in question, e.g.

<w><c type="initial" linespan="3">H</c>ywel</w>
To the beginning of the page