The Coruña Corpus of English Scientific Writing is one of the projects currently being carried out in the University of A Coruña (Spain) by the Research Group for Multidimensional Corpus-based Studies in English (MuStE). The team is in the process of creating a corpus that can be used for the diachronic study of scientific discourse from most linguistic levels and thereby contribute to the study of the historical development of English for specific purposes. At the same time, we believe that the Coruña Corpus is an excellent tool for the study of the scientific register/style at particular moments in history: it offers the researcher the chance to analyse how this "Specific English" behaves from a synchronic point of view.
The compilation of the Coruña Corpus has been and is still governed by some of the most common parameters used in Corpus Linguistics, namely, external criteria for the delimitation of dates, sampling techniques, etc. Additionally, we also resort to some criteria of our own.
The UNESCO classification of the fields of Science and Technology (International Standardisation of Statistics on Science and Technology, UNESCO 1978, 1988) has provided a starting point for text and discipline selection as we intend to include texts from all or most of these categories.
The Coruña Corpus of English Scientific Writing (CC) is, therefore, a specialised corpus covering several different scientific disciplines. It is divided in subcorpora depending on domain or discipline. From the start of the project in 2004 it was designed to contain 10,000-word samples (at a rate of two per decade and discipline) of scientific works published between 1700 and 1900 and that had been directly written in English by English-speaking authors.
In order to avoid repetitions of patterns caused by idiosyncrasies of authors, one of the principles of the CC is to include only one sample per author in the whole corpus, even when some of them were certainly prolific in different fields of knowledge.
All the subcorpora in the CC share the same compilation principles, structure and mark-up and have been edited in XML following TEI conventions to facilitate contrastive studies. Similarly, they are all formed by a multi-field indexed textual repository which is used by the information retrieval platform accompanying them (Coruña Corpus Tool, CCT). This index allows searches using different criteria contained not only in the samples but also in the metadata files with information both about authors and texts.
Compilation has been carried out always following the same sampling criteria (see Crespo and Moskowich, 2010, CETA in the Context of the Coruña Corpus, Literary and Linguistic Computing, 25/2: 153-164). This includes preserving and representing certain special characters and symbols which can be seen when using the CCT.
Subcorpora in the Coruña Corpus
CETA (Corpus of English Texts on Astronomy) came out in 2012 accompanied by a book containing both works on methodological aspects as well as some pilot studies.
The team has also been working on the subject matter of Philosophy to compile CEPhiT (Corpus of English Philosophy Texts) whose compilation is detailed here and which was released in March 2016 together with a book.
From 2019, both CETA and CEPhiT, as well as the following subcorpora (CHET, Corpus of History English Texts, and CELiST, Corpus of English Life Sciences Texts, and CECheT so far) have been publshed in open access.
We are currently finishing CETeL (Corpus of English Texts on Language)and working on CETePhT (Corpus of English Texts on Physics), as well as starting the search for suitable samples relating to gwography for compilation of in the near future.
The Coruña Corpus of English Scientific Writing (Coruña Corpus or CC)
Compilers: Please see each individual corpus
Project Director: Isabel Moskowich
Period: 1700-1900
Size: extracts of ca. 10,000 words at a rate of two samples per decade and discipline, thus ca. 400,000 words in each subcorpus
Availability: Please see the info for each individual corpus
Subcorpora in the CC
Corpus of English Texts on Astronomy
The Corpus of English Texts on Astronomy (CETA) is part of the Coruña Corpus of English Scientific Writing (CC). As a specialised corpus, CETA has been compiled for the description of English Astronomy writing between 1700 and 1900, from a synchronic and diachronic perspective.
All text files in CETA are accompanied by a metadata file with extensive information about the text sampled and its author's sociolinguistic background. Samples can be filtered according to certain parameters contained in the metadata files.
Compilers
Isabel Moskowich, Inés Lareo, Gonzalo Camiña Rioboó and Begoña Crespo.
Research Assistants
Nuria Bello Piñón, María José Esteve Ramos, Marta González Orta, Irma González Souto, Emma Lezcano González, Paula Lojo Sandino.
Compilation dates
2003-2009
Release date
2012
Size
42 text samples (409,909 words)
How to cite CETA
Moskowich, Isabel; Inés Lareo, Gonzalo Camiña Rioboó and Begoña Crespo (comps.) 2012. Corpus of English Texts on Astronomy. A Coruña: Universidade da Coruña. https://doi.org/10.17979/spudc.9788497497084
2003-2006 and 2007-2010: Autonomous Government of Galicia. Secretary for Research and Development (grant numbers PGIDIT03PXIB10402PR and PGIDIT07PXIB104160PR).
2007-2008: Research network “English Language and Literature and Identity”. Autonomous Government of Galicia (grant number 2007/000145-0).
2008-2011: Spanish Ministry of Science and Innovation (MICINN) (grant number FI2008-01649).
2009-2010: UDC funding for Consolidated Research Groups.
Corpus of English Philosophy Texts
The Corpus of English Philosophy Texts (CEPhiT) is the second part of the Coruña Corpus of English Scientific Writing (CC). CEPhiT has been compiled for the description of English Philosophical writing between 1700 and 1900, both from a synchronic and diachronic perspective. As is characteristic of the Coruña Corpus, each text file in CEPhiT is accompanied by a metadata file containing information about the text sampled and its author's sociolinguistic background. Metadata files are also used to select the texts to work with through the Coruña Corpus Tool (CCT).
Compilers
Isabel Moskowich, Gonzalo Camiña Rioboó, Inés Lareo and Begoña Crespo.
Research Assistants
Iria Bello Viruega, María José Esteve Ramos, Paula Lojo Sandino, Leida Maria Monaco, Ana Montoya Reyes, Luis Puente-Castelo, Leticia Regueiro Naya, Estefanía Sánchez Barreiro and Sofía Zea Álvarez.
Compilation dates
2007-2012
Release date
2016
Size
40 text samples (400,416 words)
How to cite CEPhiT
Moskowich, Isabel; Camiña Rioboó, Gonzalo; Lareo, Inés and Crespo, Begoña (comps.) 2016. Corpus of English Philosophy Texts. A Coruña: Universidade da Coruña. https://doi.org/10.17979/spudc.9788497497077
2008-2011: Spanish Ministry of Science and Technology (grant number FFI2008-01649)
2009-2011: UDC funding for Consolidated Research Groups
Corpus of History English Texts
The Corpus of History English Texts (CHET) is the third part of the Coruña Corpus of English Scientific Writing (CC). It has been compiled to represent English History writing in late Modern English (1700-1900), and it can be used to describe such a tradition both from a synchronic and diachronic perspective.
As is characteristic of the Coruña Corpus, each text file in CHET is accompanied by a metadata file which provides information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.
Compilers
Isabel Moskowich, Estefanía Sánchez-Barreiro, Inés Lareo and Paula Lojo Sandino.
Research Assistants
Anabella Barsaglini-Castro, Iria Bello, Gonzalo Camiña Rioboó, Iria Domínguez, Agnieszka Kozera, Emma Lezcano, Leida Maria Monaco, Luis Puente-Castelo.
Compilation dates
2010-2018
Release date
2019
Size
40 text samples (404,311 words)
How to cite CHET
Moskowich, Isabel; Lareo, Inés; Lojo Sandino, Paula and Sánchez-Barreiro, Estefanía (comps.) 2019. Corpus of History English Texts. A Coruña: Universidade da Coruña. https://doi.org/10.17979/spudc.9788497497091
Further references
Crespo, Begoña and Moskowich, Isabel. 2015. A Corpus of History Texts (CHET) as part of the Coruña Corpus Project. In Proceedings of the international scientific conference Corpus linguistics - 2015. St Petersburgo: St Petersburgh State University. 14-23.
2013-2019: Spanish Ministry of Economy and Competitiveness (MINECO), National Programme for Excellence in Scientific and Technical Research (grant numbers FFI2013-42215-P and FFI2016-75599-P).
2014-2015: Research network “English Language and Literature and Identity II” ”. Autonomoous Government of Galicia (grant number R2014/043).
Corpus of English Life Sciences Texts
The Corpus of English Life Sciences Texts (CELiST) is a subcorpus of the Coruña Corpus of English Scientific Writing (CC). Its compilation tries to represent late Modern English (1700-1900) writing on Life Sciences (biology, entomology, zoology, botany and other disciplines), in order to describe this tradition both from a synchronic and a diachronic perspective. As with all the other subcorpora in the Coruña Corpus, each text file in CELiST is accompanied by a metadata file which provides information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.
Compilers
Inés Lareo, Leida Maria Monaco, María José Esteve-Ramos and Isabel Moskowich.
Research Assistants
Iria Bello Viruega, Paula Lojo Sandino, Luis Puente-Castelo and Estefanía Sánchez-Barreiro.
Compilation dates
2006-2016
Release date
2020
Size
40 text samples (400,305 words)
How to cite CELiST
Lareo, Inés; Monaco, Leida Maria; Esteve-Ramos, María-José and Moskowich, Isabel (comps.) 2020. Corpus of English Life Sciences Texts. A Coruña: Universidade da Coruña. https://doi.org/10.17979/spudc.9788497497848
2009-2014: Autonomous Government of Galicia/Xunta de Gaicia.
2009-2011: Ministerio de Ciencia y Tecnlogía.
Corpus of English Chemistry Texts
The Corpus of English Chemistry Texts (CECheT) is the fifth corpus of the Coruña Corpus of English Scientific Writing (CC). It has been compiled to represent English Chemistry and Alchemy writing in late Modern English (1700-1900), and it can be used to describe such tradition both from a synchronic and a diachronic perspective.
As with all the other subcorpora in the Coruña Corpus, each text file in CECheT is accompanied by a metadata file which provides information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.
Compilers
Isabel Moskowich, Luis Puente-Castelo, Leida Maria Monaco
Research Assistants
Anabella Barsaglini-Castro, Iria Bello, Gonzalo Camiña Rioboó, Begoña Crespo, María José Esteve, Inés Lareo, Margarita Mele, Estefanía Sánchez Barreiro.
Compilation dates
2013-2019
Release date
2022
Size
41 text samples (402,503 words)
How to cite CECheT
Moskowich, Isabel; Puente-Castelo, Luis and Monaco, Leida Maria (comps.) 2022. Corpus of English Chemistry Texts. A Coruña: Universidade da Coruña. https://doi.org/10.17979/spudc.9788497498388
Further references
Puente-Castelo, Luis & Leida Maria Monaco. 'it is proper subserviently, to inquire into the nature of experimental chemistry': Difficulties to harmonize disciplinary particularities and compilation criteria during the selection of samples for CECheT. EPiC Series in Language and Linguistics 1, 351-360.
Manual
Both the Manual for CECheT and the introduction to the corpus are available in the file you can download from the UDC repository.
2013-2019: Spanish Ministry of Economy and Competitiveness (MINECO), National Programme for Excellence in Scientific and Technical Research (grant numbers FFI2013-42215-P and FFI2016-75599-P).
2014-2015: Research network “English Language and Literature and Identity II” ”. Autonomoous Government of Galicia (grant number R2014/043).
2020-2023: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2019-105226GB-I00).
Corpus of English Texts on Language
This section is being updated. New information will be published shortly.
The Corpus of English Texts on Language (CETeL) will be a subcorpus of the Coruña Corpus of English Scientific Writing (CC). Its compilation will try to represent English writing on Linguistics, Philology and Languages in late Modern English (1700-1900), and it will be used to describe this tradition both from a synchronic and a diachronic perspective.
As with all the other subcorpora in the Coruña Corpus, each text file in CETeL will be accompanied by a metadata file providing information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.
2017-2019: Spanish Ministry of Economy and Competitiveness (MINECO), National Programme for Excellence in Scientific and Technical Research (grant number and FFI2016-75599-P)
2020-2023: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2019-105226GB-I00)
Corpus of English Texts on Physics
This section is being updated. New information will be published shortly.
The Corpus of English Texts on Physics (CETePh) will be the seventh subcorpus of the Coruña Corpus of English Scientific Writing (CC). The samples in it will try to represent late Modern English writing on different aspects of Physics from 1700 to 1900, thus containing texts on mechanics, hydraulics, electricity, magnetism and other domains of study in the field. CETePh will be used to describe the evolution of the language typical of this discipline as well as variation within it. As with all the other subcorpora in the Coruña Corpus, each text file in CETePh will be accompanied by a metadata file providing information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the 2019 version of the Coruña Corpus Tool.
Compilation dates
2019-
Release date
Not yet finished.
Size
(ca.400,000 words)
Further references
Moskowich, Isabel. 2016. "Philosophers and Scientists from the Modern Age: Compiling the Corpus of English Philosophy Texts (CEPhiT)". In Moskowich, Isabel; Camiña Rioboó, Gonzalo; Lareo, Inés and Crespo, Begoña (eds.), ‘The Conditioned and the Unconditioned’: Late Modern English Texts on Philosophy. Amsterdam: John Benjamins. 1-23.
Sources
Compilation in progress.
Funding
2020-2023: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2019-105226GB-I00).
Corpus of English Geography Texts
This section is being updated. New information will be published shortly.
The Corpus of English Geography Texts (CEGeT) will be the seventh subcorpus of the Coruña Corpus of English Scientific Writing (CC). The samples it will contain aim at representing faithfully late Modern English writing on what we would now label as Geography. Since the discipline was not well delimited, our texts (from 1700 to 1900) deal with a wide variety of topics and range from textbooks to travelogues where travellers describe the physical characteristics of particular territories. Users should not expect to find texts on certain branches of the field such as mathematical geography, political geography, cultural geography, climatology and others as they constitute recent developments and are the result of a more detailed taxonomy. CEGeT will be used to describe the evolution of the language typical of this discipline as well as variation within it. As with all the other subcorpora in the Coruña Corpus, each text file in CEGeT will be accompanied by a metadata file providing information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the 2019 version of the Coruña Corpus Tool.
Compilation dates
2024-
Release date
Not yet finished.
Size
(ca.400,000 words)
Sources
Compilation in progress.
Funding
2023-2026: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2022-136500NB-I00).
Tagged Coruña Corpus in CQPWeb
A POS tagged version of the CC can be found in CQPWeb with minimal extratextual information. Due to the characteristics of CQPWeb, users may also see that word counts radically differ from the complete versions.
Remember in most cases you must have a CQPWeb account to access. If you do not know how to create an account, click here.
You may also have to ask for permission to access these particular corpora.
This corpus is a beta version of CETA. The complete version is available either in open access or from John Benjamins.
Corpus of English Philosophy Texts
This corpus is a beta version of CEPhiT. The complete version is available either in open access or from John Benjamins.
Corpus of History English Texts
This corpus has been automatically tagged in Saarland University. Some mistakes may be found due to the peculiarities of the tagging process.
Corpus of English Life Sciences Texts
This corpus is a beta version of CELiST, published in open access here.
How to cite the Coruña Corpus
If you want to make a general reference about the Coruña Corpus in your work, please use the following works:
Moskowich, Isabel and Crespo, Begoña. 2007. Presenting the Coruña Corpus: A Collection of Samples for the Historical Study of English Scientific Writing. In Pérez Guerra, Javier et al. (eds.) ‘Of Varying Language and Opposing Creed’: New Insights into Late Modern English. Bern: Peter Lang. 341–357.
Moskowich, Isabel & Parapar López, Javier. 2008. Writing Science, Compiling Science. The Coruña Corpus of English Scientific Writing. In Lorenzo Modia, María Jesús (ed.) Proceedings from the 31st AEDEAN Conference. A Coruña: Universidade da Coruña. 531–544.
The Coruña Corpus Tool was first released in 2012 when CETA came out (CD-ROM). Since then, it has been often updated and improved. The two more relevant changes occurred in 2019 and 2020 in order to make the application lighter and able to deal with chemical formulae in the search option as both CELiST and CEChET may require this utility. The version to be found in open access will always be the last one.