Corpus comparable monolingue multimodal
This is the first public release of LINA-PSLMC 1.0, which is the Paragraph Similarity List for the Multimodal Monolingual Comparable Corpus defined by Laboratoire d’Informatique de Nantes-Atlantique (LINA), Université de Nantes, France in part of the DEPART project (Documents Ecrits et Paroles – Reconnaissance et Traduction). These lists of similar paragraphs are released under the LGPLLR, a version of the LGPL adapted to linguistic resources. See LICENSE for details. These list of paragraphs contains two different modalities namely transcribed texts and written texts. This multimodal monolingual corpus and the alignments are explained in the thesis Multimodal Monolingual Comparable Corpus Alignment written by Prajol Shrestha in 2013. The similarity between two paragraphs are decided based on the similarity definition described in Chapter 2 and the building of this corpus is explained in Chapter 5 of this thesis.
The written part of the corpus
The written text part is a part of the North American News Text
Corpus which is one of the many corpus present in the Linguistic Data
Consortium, LDC. The LDC catalog number for this corpus is LDC95T21.
12 articles from the North American News Text Corpus were selected,
each of these articles were related to a single topic namely death of
Diana. The paragraphs were segmented using the natural segmentation mark
<p>
present in the
corpus and the counting starts from 0. The documents list is given
below. The document id (DOCID) of these articles are :
- latwp970901.0002
- latwp970901.0009
- latwp970901.0015
- latwp970901.0017
- latwp970901.0024
- latwp970901.0030
- latwp970902.0113
- latwp970902.0120
- latwp970902.0121
- nyt970901.0079
- nyt970901.0067
- nyt970902.0120
The transcript part of the corpus
The transcript texts are part of 7 transcripts from two different sources : 3 transcripts were selected from the transcript corpus LDC98T28 and the other 4 were manually transcribed. The corpus LDC98T28 is the 1997 English Broadcast News Transcripts (HUB4) which is one of the many corpus present in the Linguistic Data Consortium, LDC. The document list is given below. The document id (DOCID) of these articles are :
- eo970830.sgml
- eo970903.sgml
- eo970911.xml
The four other transcripts are manually transcribed from programs of ABC and CNN as listed below along with the DOCID, and the source is given :
DOCID | Name | Date |
---|---|---|
ABC.xml | ABC Report | 17-12-06 |
cnn1.xml | CNN Breaking news | 08-31-97 |
cnn2.xml | CNN American morning | 16-12-06 |
cnn3.xml | CNN Breaking news | 18-12-03 |
Each of these transcripts were related to a single topic namely death
of Diana and were manually segmented into segments. The segmentation of
the manually transcribed texts can be found in the folder Transcripts
of the corpus. The segmentation of the LDC transcribed texts were also
done manually but is not very different from the natural segmentation
using <turn>
tag present in the texts and the counting starts from 0.
To contact, email us at :