The SiBol/Port Corpus Linguistics Project

The SiBol/Port project (originally SiBol) was set up in 2005 by a group of English linguistics researchers from the Universities of Siena and Bologna in Italy – “SiBol” being a portmanteau of the two University names. These were Alison Duguid, Anna Marchi, John Morley and Charlotte Taylor (Siena) and Alan Partington and Caroline Clark (Bologna).
The aim of the project is to study developments in very recent English language usage but also in social, cultural and political attitudes over recent times, as reflected in language. The term Modern-Diachronic Corpus-assisted Discourse Studies (MD-CADS) was coined to denominate, in general, this kind of study, which can be considered a form of Corpus-assisted Discourse Studies (CADS).
The group already possessed a corpus containing the complete collections of texts from the Guardian, Times, Telegraph and the Sunday Times and Sunday Telegraph from 1993. In 2006 it compiled a sister corpus, containing the complete set of articles the same newspapers (plus the Guardian’s sister paper, the Observer) from the previous year, 2005, and in 2011 one of the members, Taylor, now at Portsmouth University, compiled a third corpus containing the output of the Guardian, Times, Telegraph for 2010. They were converted into XML format and marked up according to TEI guidelines by Marchi. An expanded and internationalised  version, consisting of 12 newspaper titles is nearing completion (see below).

By combining automated statistical analyses with more traditional close reading text analysis, the group is able to compare and contrast the three sets of language data and has produced a number of publications reporting their findings (see below).

The SiBol / Port suite of corpora, therefore, currently consists of:

  • SiBol 93 containing the entire output of the Guardian, Times, Telegraph and the Sunday Times and Sunday Telegraph for 1993.
  • SiBol 05 containing the entire output of the Guardian, Times, Telegraph and the Observer, Sunday Times and Sunday Telegraph for 2005.
  • Port 2010 containing the entire output of the Guardian, Times, Telegraph for 2010.
  • SiBol 13 containing the entire output of the GuardianTimes, Telegraph, Daily Mail, Daily Mirror, Times of India, New York Times, Washington Times, South China Morning Post, Daily News Egypt, Gulf News (UAE) and This Day Lagos.

The first three have been edited by the Sketch Engine team, using Jan Pomikálek’s de-duplication tool, and are now available to the wider research community through the Sketch Engine interface.

In 2010, a special edition of the journal Corpora was dedicated to outlining and exemplifying the methodology of MD-CADS and all the articles in the journal made use of SiBol 93 and SiBol 05.

MD-CADS research using the SiBol / Port corpora:

  • Partington A. and Duguid, A. (2008). Modern diachronic corpus-assisted discourse studies (MD-CADS) in Bertuccelli Papa, M. and Bruti, S. (eds) Threads in the complex fabric of Language. Pisa: Felici editore, pp. 5-19
  • Taylor, C. 2011. Searching for similarity: The representation of boy/s and girl/s in the UK press in 1993, 2005, 2010. Corpus Linguistics 2011. University of Birmingham, 20-22 July.
  • Partington, A. 2015. Corpus-assisted comparative case studies of representations of the Arab world. In P. Baker (ed) Corpora and Discourse Studies, London: Palgrave.