Hyppää sisältöön
    • Suomeksi
    • In English
Trepo
  • Suomeksi
  • In English
  • Kirjaudu
Näytä viite 
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
  •   Etusivu
  • Trepo
  • TUNICRIS-julkaisut
  • Näytä viite
JavaScript is disabled for your browser. Some features of this site may not work without it.

INCEPT: A Framework for Duplicate Posts Classification with Combined Text Representations

Skenderi, Erjon; Huhtamäki, Jukka; Laaksonen, Salla Maaria; Stefanidis, Kostas (2024-08-16)

 
Avaa tiedosto
3677322.pdf (3.765Mt)
Lataukset: 



Skenderi, Erjon
Huhtamäki, Jukka
Laaksonen, Salla Maaria
Stefanidis, Kostas
16.08.2024

ACM TRANSACTIONS ON THE WEB
40
doi:10.1145/3677322
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tuni-202409108639

Kuvaus

Peer reviewed
Tiivistelmä
Dealing with many of the problems related to the quality of textual content online involves identifying similar content. Algorithmic solutions for duplicate content classification typically rely on text vector representation, which maps textual information into a set of features. Ideally, this representation would capture all aspects of the underlying text, including length, word frequencies, syntax, and semantics. While recent advancements in text representation have led to improved performance, a comprehensive approach that explicitly incorporates all text features has not yet been proposed. In this study, we present the INCEPT framework that utilizes multiple representation methods to detect duplicate text pairs, taking advantage of their individual strengths. The core of our approach involves using a stacking ensemble of pairwise vector distance measurements that are computed from multiple text representation methods. A stacking classifier then utilizes these distance scores as input and learns to identify duplicate posts. We assess the proposed framework's effectiveness in identifying duplicate posts in an online Question and Answer platform. By combining several text representation methods, INCEPT performs well in the duplicate posts classification task. Our experiments demonstrate that specific framework configurations outperform the accuracy scores obtained from individual text representation methods. Therefore, we also infer that no single text representation method can independently capture a text's features.
Kokoelmat
  • TUNICRIS-julkaisut [24216]
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste
 

 

Selaa kokoelmaa

TekijätNimekkeetTiedekunta (2019 -)Tiedekunta (- 2018)Tutkinto-ohjelmat ja opintosuunnatAvainsanatJulkaisuajatKokoelmat

Omat tiedot

Kirjaudu sisäänRekisteröidy
Kalevantie 5
PL 617
33014 Tampereen yliopisto
oa[@]tuni.fi | Tietosuoja | Saavutettavuusseloste