research - parte 1 de la cabecera research - parte 2 de la cabecera research - parte 3 de la cabecera research - parte 4 de la cabecera
remate superior



The Coruña Corpus of English Scientific Writing

The Coruña Corpus: A Collection of Samples for the Historical Study of English Scientific Writing is one of the projects currently being carried out in the University of A Coruña (Spain) by the Research Group for Multidimensional Corpus-based Studies in English (MuStE). The team is in the process of creating a corpus that can be used for the diachronic study of scientific discourse from most linguistic levels and thereby contribute to the study of the historical development of English for specific purposes. At the same time, we believe that the Coruña Corpus is going to be an excellent tool for the study of the scientific register/style at particular moments in history: it will offer the researcher the chance to analyse how this "Specific English" behaves from a synchronic point of view.

The compilation of the Coruña Corpus has been and is still governed by some of the most common parameters used in Corpus Linguistics, namely, external criteria for the delimitation of dates, sampling techniques, number of words per sample, etc.

Many pilot studies have already proved the utility of the Coruña Corpus.

The UNESCO classification of Sciences has provided a starting point for text and discipline selection as we intend to include texts from all or most of these categories:

Fields of Science and Technology (International Standardisation of Statistics on Science and Technology, UNESCO 1978).

I. Natural Sciences.
Astronomy, bacteriology, biochemistry, biology, botanics, chemistry, entomology, geology, geophysics, mathematics, meteorology, mineralogy, computing, physical geography, physics, zoology and other allied subjects.

II. Engineering and Technology.
Engineering sciences such as: chemistry, civil, electrical and mechanical engineering and their specialised subdivisions; forest products; applied sciences such as geodesy, industrial chemistry, etc.; architecture, the science and technology of food production; specialised technologies of interdisciplinary fields, e.g. systems analysis, metallurgy, mining, textile technology and other allied subjects.

III. Medical Sciences.
Anatomy, stomatology, basic medicine, paedriatics, obstretics, optometry, osteopathy, pharmacy, physiotherapy, public health services, technical health assistance and other allied subjects.

IV. Agricultural Sciences.
Agronomy, zootechnics, fisheries, forestry, horticulture, veterinary medicine and other allied subjects).

V. Social Sciences.
Anthropology (social and cultural) and ethnology, demography, geography (human, economic and social), law, linguistics, management, political sciences, psychology, sociology, organisation and methods, miscellaneous social sciences and interdisciplinary, methodological and historical S&T activities relating to subjects in this group.

Physical anthropology, physical geography and psychophysiology should normally be classified with the natural sciences.

VI. Humanities.
Arts (history of art and art criticism, excluding artistic "research"), ancient and modern languages and literatures, philosophy (including the history of science and technology), prehistory and history, together with auxiliary historical disciplines such as archaeology, numismatics, palaeography, genealogy, etc.), religion, other subjects and humanistic branches as well as other methodological and historical S&T activities relating to the subjects in this group.


At the moment we have just finished the compilation of CETA (Corpus of English Texts on Astronomy) in which we have gathered together samples of ca. 10,000 words from the eighteenth and nineteenth centuries. We have also worked on the subject matters of Philosophy to compile CEPhiT (Corpus of English Philosophy Texts) and Life Sciences for CELiST (Corpus of English Life Sciences Texts).

We are currently compiling CHET (Corpus of Historical English Texts) and CECheT (Corpus of English Chemistry Texts).

All corpora in the CC share a common structure and mark-up to facilitate contrastive studies and all their text files are accompanied by metadata files containing information about the author and the text itself.

The Coruña Corpus Tool

The Coruña Corpus Tool (CCT) is a development carried out by the Information Retrieval Lab in collaboration with the MuStE Group of the University of A Coruña. This application came up due to the need of the MuStE Group for a system to manage and exploit its linguistic corpus. The objective is to help linguists to extract and condense valuable information for their research. However, the application was not designed tied to the Coruña Corpus and it supports any xml-formatted corpus being, in this sense, an application that could be widely used.

A non-exhaustive list of CCT functionalities:
a) Linguistic corpus management, not only documents as text but also author information and styled document rendering.


image


b) Treatment and validation of TEI encoded documents with support for non-standard characters. It supplies information about the format errors in order to allow their correction by the linguists.


image


c) Intra-documental and collection basic search by single terms.

d) Concordance generation (key-word in context) of all the term appearances and location in the document.

e) Prefix, suffix and regular expressions search, which is very useful for the linguistic work.

f) Phrase search with term distance specification in order to search for linguistic structures.


image


g) Generation of types and tokens lists in document and collection level to allow statistical study of the terms occurrences.


image





remate inferior