Proposition de stage - 2025

Generating rhetorically coherent summaries of long legal documents

Niveau : Master

Période : Jan-2025 (flexible) ~6 months

Location: Nantes University, Laboratory of Digital Sciences of Nantes, Natural Language Processing Team [1]
Gratification : ~600 euros
Starting date: Jan-2025 (flexible) ~6 months
Supervisors: Laura Monceaux, Anas Belfathi and Nicolas Hernandez

Context

As the highest court in the United States, dealing with constitutional issues and federal law, the Supreme Court (SCOTUS) defines a model of society whose global impact extends beyond the borders of the United States (see the recent decisions limiting the EPA’s regulation of carbon emissions, for example). It is crucial that international legal practitioners %from all over the world
who do not speak American as their first language are able to access and understand these decisions.
Nevertheless SCOTUS opinions are notoriously long and use specialised language, making them laborious to read and understand.
The Lexhnology project, supported by the French National Research Agency (Agence Nationale de la Recherche) under the reference ANR-22-CE38-0004 [2], proposes to exploit the rhetorical structure of the texts to facilitate access to them.

[1] http://taln.ls2n.fr
[2] https://lexhnology.hypotheses.org/

Subject

In this internship, we propose to work on the summarisation task. The current state of the art in automatic summarization defines the task as a generative task involving large language models with application of ‘divide and conquer’ strategies upstream when the text is too long to process.

The scientific questions concern the efficient modelling of long documents, the exploitation of LLM for the extraction of important parts and the generation of summaries, and the evaluation of the productions generated.

The internship will focus on studying the use of rhetorical analyses in the process of extracting relevant text parts and generating coherent summaries. Depending on the candidate’s development and affinities, the following aspects can be explored: Measuring the impact of segmentation strategies, analysing the impact of the complexity of a case, taking into account multiple modalities (oral argument and written opinion), considering the criteria for evaluating the summaries generated.

References

* Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs. (Parmar et al., EMNLP2024)
* PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation. (Leiter and Steffen Eger, EMNLP2024)
* LOCOST: State-Space Models for Long Document Abstractive Summarization. (Le Bronnec et al. EACL2024)
* Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation. (Shukla et al., AACL2022)
* Toward Unifying Text Segmentation and Long Document Summarization. (Cho et al., EMNLP2022)
* A divide-and-conquer approach to the summarization of long documents. (Gidiotis and Tsoumakas, TASLP2020)

Application

We are looking for applications from students preparing a Master’s degree or equivalent with solid skills (and ideally experience) in Natural Language Processing, Machine Learning and Deep Learning. Excellent verbal and written skills (in English) are also essential.

To apply for this position, please send an email with your Curriculum Vitae, a document with your academic results, and a few words explaining your interest in this project to Laura Monceaux AND Anas Belfathi AND Nicolas Hernandez (firstname.lastname@univ-nantes.fr).