
An overview of (additional) corpora used jointly in the LCLV19 project can be found here. These materials include:

  • An electronic diplomatic/facsimile edition of nineteenth- and early twentieth-century private letters (approx. 1 million words).
  • An electronic corpus of newspapers and periodicals (approx. 1.4 million words).
  • The Icelandic parsed historical corpus (IcePaHC), mainly comprised of narrative and religious prose/fiction (1 million words, approx. 100,000 per century).

Corpora (being) developed as a part of the present PhD project:

  • Icelandic Corpus of Early Nineteenth-Century Correspondence (ICENCC). An electronic rendition of a collection of diplomatic and semi-normalised editions of private letters using Google Tesseract-OCR and Skrambi for post-correction. ICENCC was intended as an extension to the LCLV19 letter corpus of data up until the middle of the nineteenth century and currently consists of 670 letters written by 26 scribes, approx. 425,000 words. The text of the letters is available on GitHub, along with a letter inventory.
  • Corpus of Reykjavík Grammar School Essays (1847-48, 1852, 1855, 1860-61, 1875, 1890). A partial, experimental XML/TEI-based version of the text (1847-48) along with corrections by the teacher(s) of grammar, punctuation and style, transcribed using the HisTEI framework (see examples/screenshots below). The transcriptions are freely available on GitHub, along with the photographed samples from 1852, 1855, 1860-61, 1875 and 1890. In addition, a sample Corpus of Corrections based on this material will soon be available in CSV format, along with references to the photographs.

Feel free to join in! Over a thousand pages over at GitHub just waiting to be transcribed.


Sample screenshots of coded/transcribed data
(oXygen XML Editor, HisTEI)

I. Person list (student id)


<person xml:id="person_nth_yzd_4p"> <persName> <forename>Árni</forename> <surname>Bjarnason</surname> <surname>Thorsteinsson</surname> </persName> <sex value="1"/> <occupation when="1896" cert="high">landfógeti í Rvík. R. og dbrm Kgk. alþm.</occupation> <education when="1847" cert="high">II</education> <birth when="1828-04-05">


II. Page transcription (hand mark-up, student id)



<facsimile xml:base="transcriptions/1847-1848/"> <graphic mimeType="image/jpeg" url="IMAG0613.jpg" xml:id="image_046"/>


III. Page transcription (overwriting and correcting/underlining)


Line #1: student overwriting own text; line #3: teacher correcting by underlining.


hættulegt sé að láta þau fá of mikið vald yfir<lb break="yes"/> <del hand="#teacher" cause="fix" confidence="1" rend="underlining" status="correction" type="case-marking">sig</del>


IV. Page transcription (reordering by numeration in text)



 og það ógnaði<lb break="yes"/> þeim með <seg xml:id="bk01">guðs</seg> <metamark function="transposition" target="#ib01" place="above">2.</metamark> <seg xml:id="bk02">reiði</seg> <metamark function="transposition" target="#ib02" place="above">1.</metamark> <listTranspose> <transpose><ptr target="#bk02"/><ptr target="#bk01"/></transpose> </listTranspose>


V. Setting up document (student drop-down menu)



VI. Further examples


Skólastíll (1855)A sample student essay (1855).


A correction of the generic pronoun maður, teacher doubly underlines (see Viðarsson 2017).


Fróðlegt væri að vita hvernig Halldór Kr. Friðriksson leiðbeindi nemendum við að rita móðurmálið, hvað það var sem hann lagði áherslu á að leiðrétta hjá þeim. Er reyndar sagt að fátt hafi honum verið jafn-illa við og að orðið maður væri notað sem óákveðið fornafn eins og í dönsku. (Kjartan G. Ottósson 1990:96)


VII. GitHub repository (1847-48 transcriptions; 1852, '55, '60-'61, '75, '90)

Please visit GitHub to access the repository.