Skip to content
  • Tiếng Việt
  • English

Student of the Faculty of Data Science presents scientific paper at Ranh A International Scientific Conference

Paper: "ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text"

Paper Link: https://arxiv.org/abs/2401.16403

Students involved:

  • Nguyen Thanh Nhi - KHDL2021  - Co-first author
  • Le Thanh Phong - KHDL2021  - Co-first author

Supervisor:

  • Mr. Nguyen Van Kiet

Abstract:

Lexical normalization, a fundamental task in Natural Language Processing (NLP), involves the transformation of words into their canonical forms. This process has been proven to benefit various downstream NLP tasks greatly. In this work, we introduce ViLexNorm, the first-ever corpus developed for the Vietnamese lexical normalization task. The corpus comprises over 10,000 pairs of sentences meticulously annotated by human annotators, sourced from public comments on Vietnam's most popular social media platforms. Various methods were used to evaluate our corpus, and the best-performing system achieved a result of 57.74% using the Error Reduction Rate (ERR) metric (van der Goot, 2019a) with the Leave-As-Is (LAI) baseline. For extrinsic evaluation, employing the model trained on ViLexNorm demonstrates the positive impact of the Vietnamese lexical normalization task on other NLP tasks. Our corpus is publicly available exclusively for research purposes.

"We would like to express our sincere gratitude to Mr. ThS. Nguyen Van Kiet for dedicating much time and enthusiasm to guide the group throughout the research and paper publication process. We also thank Mr. ThS. Luu Thanh Son for supporting the group during the experimentation phase. Additionally, we extend our thanks to the lecturers and assistants of the Faculty of Information Science and Engineering for their valuable insights during the execution of this project."

The 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024), ranked A, is the flagship European conference dedicated to European and international researchers, covering a wide spectrum of research in Computational Linguistics and Natural Language Processing.

Detailed Information: https://www.facebook.com/UIT.Fanpage/posts/pfbid02qcuaXwc8yjKigg7AZ6QBSq...

Hạ Băng - Media Collaborator, University of Information Technology

English version: Phan Huy Hoang