Accéder directement au contenu Accéder directement à la navigation
Article dans une revue

Toward a More Global and Coherent Segmentation of Texts

Abstract :

The automatic text segmentation task consists of identifying the most important thematic breaks in a document in order to cut it into homogeneous passages. Text segmentation has motivated a large amount of research. We focus here on the statistical approaches that rely on an analysis of the distribution of the words in the text. Usually, the segmentation of texts is realized sequentially on the basis of very local clues. However, such an approach prevents the consideration of the text in a global way, particularly concerning the granularity degree adopted for the expression of the different topics it addresses. We thus propose here two new segmentation algorithms—ClassStruggle and SegGen—which use criteria rendering global views of texts. ClassStruggle is based on an initial clustering of the sentences of the text, thus allowing the consideration of similarities within a group rather than individually. It relies on the distribution of the occurrences of the members of each class 1 to segment the texts. SegGen proposes to evaluate potential segmentations of the whole text thanks to a genetic algorithm. It attempts to find a solution of segmentation optimizing two criteria, the maximization of the internal cohesion of the segments and the minimization of the similarity between adjacent ones. According to experimental results, both approaches appear to be very competitive compared to existing methods.

Type de document :
Article dans une revue
Liste complète des métadonnées
Contributeur : Okina Université d'Angers <>
Soumis le : mercredi 9 juin 2021 - 14:58:01
Dernière modification le : jeudi 10 juin 2021 - 03:39:58




Sylvain Lamprier, Tassadit Amghar, Bernard Levrat, Frédéric Saubion. Toward a More Global and Coherent Segmentation of Texts. Applied Artificial Intelligence, Taylor & Francis, 2008, 22 (3), pp.208 - 234. ⟨10.1080/08839510701881391⟩. ⟨hal-03255362⟩



Consultations de la notice