A Coruña University Logotype Updated by
Anabella Barsaglini-Castro

Coruña Corpus

The Coruña Corpus of English Scientific Writing

The Coruña Corpus of English Scientific Writing is one of the projects currently being carried out in the University of A Coruña (Spain) by the Research Group for Multidimensional Corpus-based Studies in English (MuStE). The team is in the process of creating a corpus that can be used for the diachronic study of scientific discourse from most linguistic levels and thereby contribute to the study of the historical development of English for specific purposes. At the same time, we believe that the Coruña Corpus is an excellent tool for the study of the scientific register/style at particular moments in history: it offers the researcher the chance to analyse how this "Specific English" behaves from a synchronic point of view.

The compilation of the Coruña Corpus has been and is still governed by some of the most common parameters used in Corpus Linguistics, namely, external criteria for the delimitation of dates, sampling techniques, etc. Additionally, we also resort to some criteria of our own.

The UNESCO classification of the fields of Science and Technology (International Standardisation of Statistics on Science and Technology, UNESCO 1978, 1988) has provided a starting point for text and discipline selection as we intend to include texts from all or most of these categories.

The Coruña Corpus of English Scientific Writing (CC) is, therefore, a specialised corpus covering several different scientific disciplines. It is divided in subcorpora depending on domain or discipline. From the start of the project in 2004 it was designed to contain 10,000-word samples (at a rate of two per decade and discipline) of scientific works published between 1700 and 1900 and that had been directly written in English by English-speaking authors.

In order to avoid repetitions of patterns caused by idiosyncrasies of authors, one of the principles of the CC is to include only one sample per author in the whole corpus, even when some of them were certainly prolific in different fields of knowledge.

All the subcorpora in the CC share the same compilation principles, structure and mark-up and have been edited in XML following TEI conventions to facilitate contrastive studies. Similarly, they are all formed by a multi-field indexed textual repository which is used by the information retrieval platform accompanying them (Coruña Corpus Tool, CCT). This index allows searches using different criteria contained not only in the samples but also in the metadata files with information both about authors and texts.

Compilation has been carried out always following the same sampling criteria (see Crespo and Moskowich, 2010, CETA in the Context of the Coruña Corpus, Literary and Linguistic Computing, 25/2: 153-164). This includes preserving and representing certain special characters and symbols which can be seen when using the CCT.

Subcorpora in the Coruña Corpus

CETA (Corpus of English Texts on Astronomy) came out in 2012 accompanied by a book containing both works on methodological aspects as well as some pilot studies.

The team has also been working on the subject matter of Philosophy to compile CEPhiT (Corpus of English Philosophy Texts) whose compilation is detailed here and which was released in March 2016 together with a book.

From 2019, both CETA and CEPhiT, as well as the following subcorpora (CHET, Corpus of History English Texts, and CELiST, Corpus of English Life Sciences Texts, and CECheT so far) have been publshed in open access.

We are currently finishing CETeL (Corpus of English Texts on Language)and working on CETePhT (Corpus of English Texts on Physics), as well as starting the search for suitable samples relating to gwography for compilation of in the near future.


The Coruña Corpus of English Scientific Writing (Coruña Corpus or CC)

  • Compilers: Please see each individual corpus
  • Project Director: Isabel Moskowich
  • Period: 1700-1900
  • Size: extracts of ca. 10,000 words at a rate of two samples per decade and discipline, thus ca. 400,000 words in each subcorpus
  • Availability: Please see the info for each individual corpus

Subcorpora in the CC

CETA in open access

Corpus of English Texts on Astronomy

The Corpus of English Texts on Astronomy (CETA) is part of the Coruña Corpus of English Scientific Writing (CC). As a specialised corpus, CETA has been compiled for the description of English Astronomy writing between 1700 and 1900, from a synchronic and diachronic perspective. All text files in CETA are accompanied by a metadata file with extensive information about the text sampled and its author's sociolinguistic background. Samples can be filtered according to certain parameters contained in the metadata files.

Compilers

Isabel Moskowich, Inés Lareo, Gonzalo Camiña Rioboó and Begoña Crespo.

Research Assistants

Nuria Bello Piñón, María José Esteve Ramos, Marta González Orta, Irma González Souto, Emma Lezcano González, Paula Lojo Sandino.

Compilation dates

2003-2009

Release date

2012

Size

42 text samples (409,909 words)

How to cite CETA

Moskowich, Isabel; Inés Lareo, Gonzalo Camiña Rioboó and Begoña Crespo (comps.) 2012. Corpus of English Texts on Astronomy. A Coruña: Universidade da Coruña.
https://doi.org/10.17979/spudc.9788497497084

Further references

Manual

On CD, including the Coruña Corpus Tool (CCT), with some special characteristics. The manual is also contained in the open-access corpus.

Sources

Please click HERE

CETA in open access

Funding

  • 2003-2006 and 2007-2010: Autonomous Government of Galicia. Secretary for Research and Development (grant numbers PGIDIT03PXIB10402PR and PGIDIT07PXIB104160PR).
  • 2007-2008: Research network “English Language and Literature and Identity”. Autonomous Government of Galicia (grant number 2007/000145-0).
  • 2008-2011: Spanish Ministry of Science and Innovation (MICINN) (grant number FI2008-01649).
  • 2009-2010: UDC funding for Consolidated Research Groups.
CEPhiT in open access

Corpus of English Philosophy Texts

The Corpus of English Philosophy Texts (CEPhiT) is the second part of the Coruña Corpus of English Scientific Writing (CC). CEPhiT has been compiled for the description of English Philosophical writing between 1700 and 1900, both from a synchronic and diachronic perspective. As is characteristic of the Coruña Corpus, each text file in CEPhiT is accompanied by a metadata file containing information about the text sampled and its author's sociolinguistic background. Metadata files are also used to select the texts to work with through the Coruña Corpus Tool (CCT).

Compilers

Isabel Moskowich, Gonzalo Camiña Rioboó, Inés Lareo and Begoña Crespo.

Research Assistants

Iria Bello Viruega, María José Esteve Ramos, Paula Lojo Sandino, Leida Maria Monaco, Ana Montoya Reyes, Luis Puente-Castelo, Leticia Regueiro Naya, Estefanía Sánchez Barreiro and Sofía Zea Álvarez.

Compilation dates

2007-2012

Release date

2016

Size

40 text samples (400,416 words)

How to cite CEPhiT

Moskowich, Isabel; Camiña Rioboó, Gonzalo; Lareo, Inés and Crespo, Begoña (comps.) 2016. Corpus of English Philosophy Texts. A Coruña: Universidade da Coruña.
https://doi.org/10.17979/spudc.9788497497077

Further references

Manual

On CD, including the Coruña Corpus Tool (CCT), with some special characteristics. The manual is also contained in the open-access corpus.

Sources

Please click HERE

CEPhiT in open access

Funding

  • 2008-2011: Spanish Ministry of Science and Technology (grant number FFI2008-01649)
  • 2009-2011: UDC funding for Consolidated Research Groups
CHET in open access

Corpus of History English Texts

The Corpus of History English Texts (CHET) is the third part of the Coruña Corpus of English Scientific Writing (CC). It has been compiled to represent English History writing in late Modern English (1700-1900), and it can be used to describe such a tradition both from a synchronic and diachronic perspective. As is characteristic of the Coruña Corpus, each text file in CHET is accompanied by a metadata file which provides information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.

Compilers

Isabel Moskowich, Estefanía Sánchez-Barreiro, Inés Lareo and Paula Lojo Sandino.

Research Assistants

Anabella Barsaglini-Castro, Iria Bello, Gonzalo Camiña Rioboó, Iria Domínguez, Agnieszka Kozera, Emma Lezcano, Leida Maria Monaco, Luis Puente-Castelo.

Compilation dates

2010-2018

Release date

2019

Size

40 text samples (404,311 words)

How to cite CHET

Moskowich, Isabel; Lareo, Inés; Lojo Sandino, Paula and Sánchez-Barreiro, Estefanía (comps.) 2019. Corpus of History English Texts. A Coruña: Universidade da Coruña.
https://doi.org/10.17979/spudc.9788497497091

Further references

Manual

Both the Manual for CHET and the introduction to the corpus are available in the file you can download from the UDC repository.

Sources

Please click HERE

CHET in open access

Funding

  • 2013-2019: Spanish Ministry of Economy and Competitiveness (MINECO), National Programme for Excellence in Scientific and Technical Research (grant numbers FFI2013-42215-P and FFI2016-75599-P).
  • 2014-2015: Research network “English Language and Literature and Identity II” ”. Autonomoous Government of Galicia (grant number R2014/043).
CELiST in open access

Corpus of English Life Sciences Texts

The Corpus of English Life Sciences Texts (CELiST) is a subcorpus of the Coruña Corpus of English Scientific Writing (CC). Its compilation tries to represent late Modern English (1700-1900) writing on Life Sciences (biology, entomology, zoology, botany and other disciplines), in order to describe this tradition both from a synchronic and a diachronic perspective. As with all the other subcorpora in the Coruña Corpus, each text file in CELiST is accompanied by a metadata file which provides information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.

Compilers

Inés Lareo, Leida Maria Monaco, María José Esteve-Ramos and Isabel Moskowich.

Research Assistants

Iria Bello Viruega, Paula Lojo Sandino, Luis Puente-Castelo and Estefanía Sánchez-Barreiro.

Compilation dates

2006-2016

Release date

2020

Size

40 text samples (400,305 words)

How to cite CELiST

Lareo, Inés; Monaco, Leida Maria; Esteve-Ramos, María-José and Moskowich, Isabel (comps.) 2020. Corpus of English Life Sciences Texts. A Coruña: Universidade da Coruña.
https://doi.org/10.17979/spudc.9788497497848

Further references

Manual

Both the Manual for CELiST and the introduction to the corpus are available in the file you can download from the UDC repository.

Sources

Please click HERE

CELiST in open access

Funding

  • 2006-2009: Provincial Government of A Coruña.
  • 2009-2014: Autonomous Government of Galicia/Xunta de Gaicia.
  • 2009-2011: Ministerio de Ciencia y Tecnlogía.
CEChET in open access

Corpus of English Chemistry Texts

The Corpus of English Chemistry Texts (CECheT) is the fifth corpus of the Coruña Corpus of English Scientific Writing (CC). It has been compiled to represent English Chemistry and Alchemy writing in late Modern English (1700-1900), and it can be used to describe such tradition both from a synchronic and a diachronic perspective. As with all the other subcorpora in the Coruña Corpus, each text file in CECheT is accompanied by a metadata file which provides information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.

Compilers

Isabel Moskowich, Luis Puente-Castelo, Leida Maria Monaco

Research Assistants

Anabella Barsaglini-Castro, Iria Bello, Gonzalo Camiña Rioboó, Begoña Crespo, María José Esteve, Inés Lareo, Margarita Mele, Estefanía Sánchez Barreiro.

Compilation dates

2013-2019

Release date

2022

Size

41 text samples (402,503 words)

How to cite CECheT

Moskowich, Isabel; Puente-Castelo, Luis and Monaco, Leida Maria (comps.) 2022. Corpus of English Chemistry Texts. A Coruña: Universidade da Coruña.
https://doi.org/10.17979/spudc.9788497498388

Further references

  • Puente-Castelo, Luis & Leida Maria Monaco. 'it is proper subserviently, to inquire into the nature of experimental chemistry': Difficulties to harmonize disciplinary particularities and compilation criteria during the selection of samples for CECheT. EPiC Series in Language and Linguistics 1, 351-360.

Manual

Both the Manual for CECheT and the introduction to the corpus are available in the file you can download from the UDC repository.

Sources

Please click HERE

CECheT in open access

Funding

  • 2013-2019: Spanish Ministry of Economy and Competitiveness (MINECO), National Programme for Excellence in Scientific and Technical Research (grant numbers FFI2013-42215-P and FFI2016-75599-P).
  • 2014-2015: Research network “English Language and Literature and Identity II” ”. Autonomoous Government of Galicia (grant number R2014/043).
  • 2020-2023: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2019-105226GB-I00).

Corpus of English Texts on Language

This section is being updated. New information will be published shortly.

The Corpus of English Texts on Language (CETeL) will be a subcorpus of the Coruña Corpus of English Scientific Writing (CC). Its compilation will try to represent English writing on Linguistics, Philology and Languages in late Modern English (1700-1900), and it will be used to describe this tradition both from a synchronic and a diachronic perspective. As with all the other subcorpora in the Coruña Corpus, each text file in CETeL will be accompanied by a metadata file providing information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the Coruña Corpus Tool.

Compilation dates

2017-

Release date

Not yet finished.

Size

(ca.400,000 words)

Further references

Sources

Compilation in progress

Funding

  • 2017-2019: Spanish Ministry of Economy and Competitiveness (MINECO), National Programme for Excellence in Scientific and Technical Research (grant number and FFI2016-75599-P)
  • 2020-2023: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2019-105226GB-I00)

Corpus of English Texts on Physics

This section is being updated. New information will be published shortly.

The Corpus of English Texts on Physics (CETePh) will be the seventh subcorpus of the Coruña Corpus of English Scientific Writing (CC). The samples in it will try to represent late Modern English writing on different aspects of Physics from 1700 to 1900, thus containing texts on mechanics, hydraulics, electricity, magnetism and other domains of study in the field. CETePh will be used to describe the evolution of the language typical of this discipline as well as variation within it. As with all the other subcorpora in the Coruña Corpus, each text file in CETePh will be accompanied by a metadata file providing information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the 2019 version of the Coruña Corpus Tool.

Compilation dates

2019-

Release date

Not yet finished.

Size

(ca.400,000 words)

Further references

  • Moskowich, Isabel. 2016. "Philosophers and Scientists from the Modern Age: Compiling the Corpus of English Philosophy Texts (CEPhiT)". In Moskowich, Isabel; Camiña Rioboó, Gonzalo; Lareo, Inés and Crespo, Begoña (eds.), ‘The Conditioned and the Unconditioned’: Late Modern English Texts on Philosophy. Amsterdam: John Benjamins. 1-23.

Sources

Compilation in progress.

Funding

  • 2020-2023: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2019-105226GB-I00).

Corpus of English Geography Texts

This section is being updated. New information will be published shortly.

The Corpus of English Geography Texts (CEGeT) will be the seventh subcorpus of the Coruña Corpus of English Scientific Writing (CC). The samples it will contain aim at representing faithfully late Modern English writing on what we would now label as Geography. Since the discipline was not well delimited, our texts (from 1700 to 1900) deal with a wide variety of topics and range from textbooks to travelogues where travellers describe the physical characteristics of particular territories. Users should not expect to find texts on certain branches of the field such as mathematical geography, political geography, cultural geography, climatology and others as they constitute recent developments and are the result of a more detailed taxonomy. CEGeT will be used to describe the evolution of the language typical of this discipline as well as variation within it. As with all the other subcorpora in the Coruña Corpus, each text file in CEGeT will be accompanied by a metadata file providing information about the text sampled and its author's sociolinguistic background. Metadata files can be also used to select the texts to work with through the 2019 version of the Coruña Corpus Tool.

Compilation dates

2024-

Release date

Not yet finished.

Size

(ca.400,000 words)

Sources

Compilation in progress.

Funding

  • 2023-2026: Ministry of Science, Innovation and Universities, National Programme for Challenges in Society (grant number PID2022-136500NB-I00).

Tagged Coruña Corpus in CQPWeb

A POS tagged version of the CC can be found in CQPWeb with minimal extratextual information. Due to the characteristics of CQPWeb, users may also see that word counts radically differ from the complete versions.

Remember in most cases you must have a CQPWeb account to access. If you do not know how to create an account, click here.

You may also have to ask for permission to access these particular corpora.

CQPWeb at Lancaster University

Corpus of English Texts on Astronomy

This corpus is a beta version of CETA. The complete version is available either in open access or from John Benjamins.

Corpus of English Philosophy Texts

This corpus is a beta version of CEPhiT. The complete version is available either in open access or from John Benjamins.

Corpus of History English Texts

This corpus has been automatically tagged in Saarland University. Some mistakes may be found due to the peculiarities of the tagging process.

Corpus of English Life Sciences Texts

This corpus is a beta version of CELiST, published in open access here.

How to cite the Coruña Corpus

If you want to make a general reference about the Coruña Corpus in your work, please use the following works:

  • Moskowich, Isabel and Crespo, Begoña. 2007. Presenting the Coruña Corpus: A Collection of Samples for the Historical Study of English Scientific Writing. In Pérez Guerra, Javier et al. (eds.) ‘Of Varying Language and Opposing Creed’: New Insights into Late Modern English. Bern: Peter Lang. 341–357.
  • Moskowich, Isabel & Parapar López, Javier. 2008. Writing Science, Compiling Science. The Coruña Corpus of English Scientific Writing. In Lorenzo Modia, María Jesús (ed.) Proceedings from the 31st AEDEAN Conference. A Coruña: Universidade da Coruña. 531–544.
  • Crespo, Begoña & Isabel Moskowich. 2010. “CETA in the Context of the Coruña Corpus”. Literary and Linguistic Computing, 25(2): 153–164.
  • Crespo, Begoña & Isabel Moskowich. 2020. Astronomy, Philosophy, Life Sciences and History Texts: Setting the Scene for the Study of Modern Scientific Writing.. English Studies.

CCT latest version

The Coruña Corpus Tool was first released in 2012 when CETA came out (CD-ROM). Since then, it has been often updated and improved. The two more relevant changes occurred in 2019 and 2020 in order to make the application lighter and able to deal with chemical formulae in the search option as both CELiST and CEChET may require this utility. The version to be found in open access will always be the last one.

For reference to the CCT

Barsaglini-Castro, Anabella and Valcarce, Daniel. 2020. The Coruña Corpus Tool: Ten Years On. Revista de Procesamiento del Lenguaje Natural, 64: 13-19.

Parapar López, Javier and Moskowich, Isabel. 2007. The Coruña Corpus Tool. Revista de Procesamiento del Lenguaje Natural, 39: 289–290.