Corpus comparable monolingue multimodal

This is the first public release of LINA-PSLMC 1.0, which is the Paragraph Similarity List for the Multimodal Monolingual Comparable Corpus defined by Laboratoire d’Informatique de Nantes-Atlantique (LINA), Université de Nantes, France in part of the DEPART project (Documents Ecrits et Paroles – Reconnaissance et Traduction). These lists of similar paragraphs are released under the LGPLLR, a version of the LGPL adapted to linguistic resources. See LICENSE for details. These list of paragraphs contains two different modalities namely transcribed texts and written texts. This multimodal monolingual corpus and the alignments are explained in the thesis Multimodal Monolingual Comparable Corpus Alignment written by Prajol Shrestha in 2013. The similarity between two paragraphs are decided based on the similarity definition described in Chapter 2 and the building of this corpus is explained in Chapter 5 of this thesis.

The written part of the corpus

The written text part is a part of the North American News Text Corpus which is one of the many corpus present in the Linguistic Data Consortium, LDC. The LDC catalog number for this corpus is LDC95T21. 12 articles from the North American News Text Corpus were selected, each of these articles were related to a single topic namely death of Diana. The paragraphs were segmented using the natural segmentation mark <p> present in the corpus and the counting starts from 0. The documents list is given below. The document id (DOCID) of these articles are :

latwp970901.0002
latwp970901.0009
latwp970901.0015
latwp970901.0017
latwp970901.0024
latwp970901.0030
latwp970902.0113
latwp970902.0120
latwp970902.0121
nyt970901.0079
nyt970901.0067
nyt970902.0120

The transcript part of the corpus

The transcript texts are part of 7 transcripts from two different sources : 3 transcripts were selected from the transcript corpus LDC98T28 and the other 4 were manually transcribed. The corpus LDC98T28 is the 1997 English Broadcast News Transcripts (HUB4) which is one of the many corpus present in the Linguistic Data Consortium, LDC. The document list is given below. The document id (DOCID) of these articles are :

eo970830.sgml
eo970903.sgml
eo970911.xml

The four other transcripts are manually transcribed from programs of ABC and CNN as listed below along with the DOCID, and the source is given :

DOCID	Name	Date
ABC.xml	ABC Report	17-12-06
cnn1.xml	CNN Breaking news	08-31-97
cnn2.xml	CNN American morning	16-12-06
cnn3.xml	CNN Breaking news	18-12-03

Each of these transcripts were related to a single topic namely death of Diana and were manually segmented into segments. The segmentation of the manually transcribed texts can be found in the folder Transcripts of the corpus. The segmentation of the LDC transcribed texts were also done manually but is not very different from the natural segmentation using <turn> tag present in the texts and the counting starts from 0.

To contact, email us at :

Download

Ressource Télécharger

Équipe TALN

Corpus comparable monolingue multimodal

The written part of the corpus

The transcript part of the corpus

Download