Two corpus have been treated:
- Biodiversa
- WOS (Web of Science)
2.2.1.1. Biodiversa
I did a corpus preprocessing to extract abstracts of the Projects.csv file of June 2013 and to translate into English the titles and abstracts of projects which are in others languages. This csv file contains 6508 project abstracts which are most of the time in English and in French. There are 5851 projects in English, 539 in French, 99 in Swedish, 5 in German, 5 in Norwegian, 3 in Spanish, 2 in Portuguese and 1 in Vietnamese. 846 projects don’t have an abstract or have an unusable abstract (like « See lead proposal. » , »SUMMARY », « see main form » or « BLA BLA BLA »). I also found that some projects have the same abstract.
My script takes in parameter the csv file path, extracts titles and abstracts of projects, translates into English and does others preprocessings like:
- put the text in lowercase
- remove HTML tags, unnecessary spaces, brackets and line breaks
- replace « / » by « , », « & » by « and », « ; » by « : » and tabulations by spaces
Next, it creates a directory whose each file corresponds to a project and contains the project abstract. In this directory, a txt file is named by the title and the identifiant of the project which it corresponds. If the title is too long as file name, it takes the 200 first characters of the project name.
The script starts as follow : ./pretraitement.R file.csv
The result of this script is a directory containing 6 507 files and the display of the number of abstracts which we couldn’t analyse the abstract with Unitex (the number of the projects which don’t have an abstract or which have an abstract with a size lower than 20 characters.
At first, I used the Perl language to parse line by line the csv file. I noticed the csv file was poorly formatted. There are line breaks in project abstract, which lead to some errors in the extraction:
- no abstract extracted for a project while the project has one in the csv file
- incomplete abstracts extracted
- project titles extracted which not corresponding to a projet title
To solve this problem, I changed to langage and used R because its function read.csv doesn’t only considers line breaks and fields separator but also quotation marks.
For the translation, I used a translator in command line. It allows to detect the language of a text and to translate it into English. It can be called as follows:
- « trans -id text » for the language detection
- « trans {=en} -b text » for the translation into English
The csv file fields concerned by the translation are the title, abstract and keywords of a project. My translation function adds quotation marks around the text to translate to avoid the text is translate word by word. Finally, its calls the language detection function then makes the translation only if the text is not in English. As the abstracts are too long to be translated unchanged with the translator, they are divided in several parts which will be translated separately and reassembled.
2.2.1.2. WOS
The WOS corpus have been extracted by the following query:
This query extracted the projects which correspond to Elise Tancoigne’s query by removing the projects which don’t refer to the temperate agriculture.
We had saved projects data in ISI format. The ISI format have these following fields:
FN | File type |
VR | File format version number |
PT | Publication type (e.g., book, journal, book in series) |
AU | Author(s) |
TI | Article title |
SO | Full source title |
LA | Language |
DT | Document type |
NR | Cited reference count |
SN | ISSN |
PU | Publisher |
C1 | Research addresses |
DE | Author keywords |
ID | KeyWords Plus |
AB | Abstract |
CR | Cited references |
TC | Times cited |
BP | Beginning page |
EP | Ending page |
PG | Page count |
JI | ISO source title abbreviation |
SE | Book series title |
BS | Book series subtitle |
PY | Publication year |
PD | Publication date |
VL | Volume |
IS | Issue |
PN | Part number |
SU | Supplement |
SI | Special issue |
GA | ISI document delivery number |
PI | Publisher city |
WP | Publisher web address |
RP | Reprint address |
CP | Cited patent |
J9 | 29-character source title abbreviation |
PA | Publisher address |
UT | ISI unique article identifier |
ER | End of record |
The Sci2 software has been used to convert the data exported from WOS into csv format.
Then we applied to this corpus the same preprocessing script than the biodiversa corpus without translation.
We noticed there are 58 projects without abstract.