Home » Corpus comparable monolingue multimodal


Corpus comparable monolingue multimodal


This is the first public release of LINA-PSLMC 1.0, which is the Paragraph Similarity List for the Multimodal Monolingual Comparable Corpus defined by Laboratoire d’Informatique de Nantes-Atlantique (LINA), Université de Nantes, France in part of the DEPART project (Documents Ecrits et Paroles – Reconnaissance et Traduction). These lists of similar paragraphs are released under the LGPLLR, a version of the LGPL adapted to linguistic resources. See LICENSE for details. These list of paragraphs contains two different modalities namely transcribed texts and written texts. This multimodal monolingual corpus and the alignments are explained in the thesis Multimodal Monolingual Comparable Corpus Alignment written by Prajol Shrestha in 2013. The similarity between two paragraphs are decided based on the similarity definition described in Chapter 2 and the building of this corpus is explained in Chapter 5 of this thesis.

The written part of the corpus

The written text part is a part of the North American News Text Corpus which is one of the many corpus present in the Linguistic Data Consortium, LDC. The LDC catalog number for this corpus is LDC95T21. 12 articles from the North American News Text Corpus were selected, each of these articles were related to a single topic namely death of Diana. The paragraphs were segmented using the natural segmentation mark <p> present in the corpus and the counting starts from 0. The documents list is given below. The document id (DOCID) of these articles are :

  • latwp970901.0002
  • latwp970901.0009
  • latwp970901.0015
  • latwp970901.0017
  • latwp970901.0024
  • latwp970901.0030
  • latwp970902.0113
  • latwp970902.0120
  • latwp970902.0121
  • nyt970901.0079
  • nyt970901.0067
  • nyt970902.0120

The transcript part of the corpus

The transcript texts are part of 7 transcripts from two different sources : 3 transcripts were selected from the transcript corpus LDC98T28 and the other 4 were manually transcribed. The corpus LDC98T28 is the 1997 English Broadcast News Transcripts (HUB4) which is one of the many corpus present in the Linguistic Data Consortium, LDC. The document list is given below. The document id (DOCID) of these articles are :

  • eo970830.sgml
  • eo970903.sgml
  • eo970911.xml

The four other transcripts are manually transcribed from programs of ABC and CNN as listed below along with the DOCID, and the source is given :

DOCID Name Date
ABC.xmlABC Report17-12-06
cnn1.xmlCNN Breaking news08-31-97
cnn2.xmlCNN American morning16-12-06
cnn3.xmlCNN Breaking news18-12-03

Each of these transcripts were related to a single topic namely death of Diana and were manually segmented into segments. The segmentation of the manually transcribed texts can be found in the folder Transcripts of the corpus. The segmentation of the LDC transcribed texts were also done manually but is not very different from the natural segmentation using <turn> tag present in the texts and the counting starts from 0.

To contact, email us at :

Download

Copyright : LS2N 2017 - Mentions Légales - 
 -