Multi-Source Document Summarisation for Coverage
Indiketiya Hewage, Sachini (2025)
Indiketiya Hewage, Sachini
2025
Master's Programme in Computing Sciences and Electrical Engineering
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2025-11-26
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-2025112510898
https://urn.fi/URN:NBN:fi:tuni-2025112510898
Tiivistelmä
This thesis addresses the limitations of current multi-source document summarisation systems, which often prioritise frequently repeated information while overlooking contextually important but less common facts. While aiming for broad coverage, these systems frequently produce summaries with substantial phrasing redundancy. In addition, standard evaluation metrics such as ROUGE and BLEU depend on golden summaries and penalise paraphrasing, making them inadequate for assessing semantic coverage. To overcome these challenges, this work introduces a modular, ablation-ready framework designed to maximise semantic coverage, minimise redundancy and provide a reference-free evaluation metric for quantitative assessment.
The framework operates in four stages: (i) linguistic preprocessing incorporating co-reference resolution and Named Entity Recognition (NER) and tagging to normalise entities and preserve factual anchors; (ii) summary-basis generation using paragraph pairing or clustering sentence/paragraph strategies; (iii) LLM-based abstractive summarisation; and (iv) evaluation via Coverage, Overlap, and Named Entity Retention metrics. An ablation study was conducted across three preprocessing levels (Basic, Coref, Coref+NER) and three summary-basis generation methods (pairing, paragraph clustering, and sentence clustering), yielding nine configurations per instance.
Quantitative and qualitative analyses revealed that the Coref+NER with paragraph pairing configuration achieved the highest performance, significantly improving Coverage and Named Entity Retention while maintaining low Overlap. Co-reference resolution enhanced contextual continuity, while NER reinforced factual salience. The paragraph pairing mechanism which consists of organising paragraphs into Base, Counterpart, and Auxiliary units, guided the LLM to produce comprehensive summaries. Statistical tests confirmed that improvements were systematic rather than incidental. The intrinsic metrics effectively quantified coverage, redundancy and factual retention without requiring golden summaries. Collectively, these results demonstrate that the proposed framework advances multi-source summarisation by improving factual coverage while remaining transparent and extensible for future research.
The framework operates in four stages: (i) linguistic preprocessing incorporating co-reference resolution and Named Entity Recognition (NER) and tagging to normalise entities and preserve factual anchors; (ii) summary-basis generation using paragraph pairing or clustering sentence/paragraph strategies; (iii) LLM-based abstractive summarisation; and (iv) evaluation via Coverage, Overlap, and Named Entity Retention metrics. An ablation study was conducted across three preprocessing levels (Basic, Coref, Coref+NER) and three summary-basis generation methods (pairing, paragraph clustering, and sentence clustering), yielding nine configurations per instance.
Quantitative and qualitative analyses revealed that the Coref+NER with paragraph pairing configuration achieved the highest performance, significantly improving Coverage and Named Entity Retention while maintaining low Overlap. Co-reference resolution enhanced contextual continuity, while NER reinforced factual salience. The paragraph pairing mechanism which consists of organising paragraphs into Base, Counterpart, and Auxiliary units, guided the LLM to produce comprehensive summaries. Statistical tests confirmed that improvements were systematic rather than incidental. The intrinsic metrics effectively quantified coverage, redundancy and factual retention without requiring golden summaries. Collectively, these results demonstrate that the proposed framework advances multi-source summarisation by improving factual coverage while remaining transparent and extensible for future research.