Skip to content
  • Tiếng Việt
  • English

Congratulations to the UIT Data Science student for being accepted at the top international conference in Natural Language Processing and Artificial Intelligence (Ranked A* according to Core2023) with an H-Index of 176.

Title: "ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing"

Paper Link (preprint): https://arxiv.org/abs/2310.11166

Pre-trained LM Code: https://huggingface.co/uitnlp/visobert

Student Contributors:

  • Nguyễn Quốc Nam – 20520644 – Primary Author, Data Science 2020.
  • Phan Châu Thắng – 20520929 – Co-author, Data Science 2020

Supervisors:

  • ThS. Nguyễn Văn Kiệt
  • ThS. Nguyễn Đức Vũ

Summary of the Paper:

While English and Chinese, known as resource-rich languages, have seen significant development of language models for natural language processing (NLP) tasks, Vietnamese, currently the eighth most used language on the Internet with around 85 million users worldwide, still faces limitations in NLP research. Despite the availability of a large amount of Vietnamese data on the Internet, progress in Vietnamese natural language processing (NLP) research remains restricted. Several models such as PhoBERT, ViBERT, and vELECTRA have been proposed and work well on general Vietnamese NLP tasks, including POS tagging and Named Entity Recognition. However, these language models still have limitations in social media language processing and analysis of Vietnamese data. Therefore, in this paper, the authors propose the first language model for Vietnamese social media text, called ViSoBERT, pre-trained on a large, high-quality, and diverse dataset of Vietnamese social media text based on the architecture of the XLM-R model. Additionally, the authors evaluate their model on five important NLP tasks on Vietnamese social media text: emotion recognition, hate speech detection, sentiment analysis, spam reviews detection, and hate speech spans detection. To gain a deeper understanding of their language model, the team analyzes experimental results on Masking Rate, examines social media features including Icons, Teencodes, and Punctuation Marks, and deploys feature-based extraction for specific language models. Their experiments demonstrate that ViSoBERT, with fewer parameters, outperforms previous advanced models on various Vietnamese social media text tasks.

The students express deep gratitude to ThS. Nguyễn Văn Kiệt, and ThS. Nguyễn Đức Vũ for dedicating time and effort to guide and identify limitations during the research and publication of this international scientific paper.

Note: Empirical Methods in Natural Language Processing (EMNLP) is one of the top international conferences in natural language processing and artificial intelligence (Ranked A* according to CORE2023) with an H-index of 176.

Further information:  https://www.facebook.com/UIT.Fanpage/posts/pfbid0bikaeBJkqHcD8Yc1h2ifxXr...

Hải Băng - Collaborative Communication Officer, University of Information Technology.

Translated by Ngoc Diem