research - parte 1 de la cabecera research - parte 2 de la cabecera research - parte 3 de la cabecera research - parte 4 de la cabecera
remate superior



The Coruña Corpus of English Scientific Writing


The Coruña Corpus of English Scientific Writing is one of the projects currently being carried out in the University of A Coruña (Spain) by the Research Group for Multidimensional Corpus-based Studies in English (MuStE). The team is in the process of creating a corpus that can be used for the diachronic study of scientific discourse from most linguistic levels and thereby contribute to the study of the historical development of English for specific purposes. At the same time, we believe that the Coruña Corpus is an excellent tool for the study of the scientific register/style at particular moments in history: it offers the researcher the chance to analyse how this "Specific English" behaves from a synchronic point of view.

The compilation of the Coruña Corpus has been and is still governed by some of the most common parameters used in Corpus Linguistics, namely, external criteria for the delimitation of dates, sampling techniques, number of words per sample, etc.

Many pilot studies have already proved the utility of the Coruña Corpus.

The UNESCO classification of the fields of Science and Technology (International Standardisation of Statistics on Science and Technology, UNESCO 1978) has provided a starting point for text and discipline selection as we intend to include texts from all or most of these categories

The Coruña Corpus of English Scientific Writing (CC) is, therefore, a specialised corpus covering several different scientific disciplines. It is divided in subcorpora depending on domain or discipline. From the start of the project in 2004 it was designed to contain 10,000-word samples (at a rate of two per decade and discipline) of scientific works published between 1700 and 1900 and that had been directly written in English by English-speaking authors.

In order to avoid repetitions of patterns caused by idiosyncrasies of authors, one of the principles of the CC is to include only one sample per author in the whole corpus, even when some of them were certainly prolific in different fields of knowledge.

All the subcorpora in the CC share the same compilation principles, structure and mark-up and have been edited in XML following TEI conventions to facilitate contrastive studies. Similarly, they are all formed by a multi-field indexed textual repository which is used by the information retrieval platform accompanying them (Coruña Corpus Tool, CCT). This index allows searches using different criteria contained not only in the samples but also in the metadata files with information both about authors and texts.

Compilation has been carried out always following the same sampling criteria (see Crespo and Moskowich, 2010, CETA in the Context of the Coruña Corpus, Literary and Linguistic Computing, 25/2: 153-164). This includes preserving and representing certain special characters and symbols which can be seen when using the CCT.

Subcorpora in the Coruña Corpus


CETA (Corpus of English Texts on Astronomy) came out in 2012 accompanied by a book containing both works on methodological aspects as well as some pilot studies.

The team have also been working on the subject matter of Philosophy to compile CEPhiT (Corpus of English Philosophy Texts) whose compilation is detailed here and which was released in March 2016 together with a book.

We are currently finishing CHET (Corpus of Historical English Texts) and CECheT (Corpus of English Chemistry Texts), as well as starting the search for suitable samples relating to diverse aspects of the study of languages and philology for the compilation of CETeL (Corpus of English Texts on Language) in the near future. At the same time, we have been working in order to finisih the compilation of CELiST (Corpus of English Life Sciences Texts).


The Coruña Corpus of English Scientific Writing (Coruña Corpus or CC)


Compilers: Please see each particular corpus
Project Director: Isabel Moskowich
Period: 1700-1900
Size: extracts of ca. 10,000 words at a rate of two samples per decade and discipline, thus ca. 400,000 words in each subcorpus

Availability: Please see the info for each particular corpus


Subcorpora in the CC

CETA (Corpus of English Texts on Astronomy)

CEPhiT (Corpus of English Philosophy Texts)

CHET (Corpus of Historical English Texts)

CEChET (Corpus of English Chemistry Texts)

CELiST (Corpus of English Life Sciences Texts)

CETeL (Corpus of English Texts on Language)


How to cite the Coruña Corpus

If you want to make a general reference about the Coruña Corpus in your work, please use the following works:

-Moskowich, Isabel and Crespo García, Begoña. 2007. Presenting the Coruña Corpus: A Collection of Samples for the Historical Study of English Scientific Writing. In Pérez Guerra, Javier et al. (eds.) ‘Of Varying Language and Opposing Creed’: New Insights into Late Modern English. Bern: Peter Lang. 341–357.

-Moskowich, Isabel & Parapar López, Javier. 2008. Writing Science, Compiling Science. The Coruña Corpus of English Scientific Writing. In Lorenzo Modia, María Jesús (ed.) Proceedings from the 31st AEDEAN Conference. A Coruña: Universidade da Coruña. 531–544.

-Crespo García, Begoña & Isabel Moskowich. 2010. CETA in the Context of the Coruña Corpus. Literary and Linguistic Computing, 25(2): 153–164.

For reference to the CCT.

-Parapar López, Javier and Moskowich, Isabel. 2007. The Coruña Corpus Tool. Revista del Procesamiento de Lenguaje Natural, 39: 289–290. [Access]




remate inferior