Title: "A Text-based Approach For Link Prediction on Wikipedia Articles"
Authors:
Tran Hoàng Anh – 20521079 – KHDL2020 – Main Author
Nguyen Minh Tam – 20520748 – KHDL2020 – Co-Author
Supervisor: MS. Luu Thanh Son.
Abstract:
Wikipedia is the largest encyclopedia where the articles are bound together by the hyperlinks. By predicting future links between articles, we can enhance the navigability and discoverability of the network, and provide users with more relevant and informative articles through the links. The DSAA 2023 Competition focus on the link prediction task applied to Wikipedia articles. In this challenge, we are given a sparsified subgraph of the Wikipedia network, and our target is to predict if a link exists between two Wikipedia pages u and v. In particular, we are given a ground-truth file which contains pairs of nodes corresponding to positive or negative samples. If an edge exists between two nodes then the corresponding label is set to 1, otherwise, the label is 0. However, if a pair of nodes is not reported in the file, this does not imply that there is no edge between them. Some of these missing pairs of nodes will appear in the test file, and we will have to predict whether there is a link between them or not.
In this paper, we will present our approach and solutions for this challenge. Our approach is text-based, and we used the Part-of-Speech tagging (POS) to extract features from the text. Before running prediction models, we first analyzed and visualized the data to understand more about the dataset. Next, we embedded the nodes by applying POS tagging, and we also conducted statiscal t-test to select the tags. Finally, we ran the classification models on the embedded dataset. Most of the models we used are classical Machine Learning models, which ensures the efficiency of our approach. Our method archieved 0.99999 in both public and private test sets, and placed 7th in the competion.
"We express our gratitude to Mr. Lưu Thanh Sơn for accompanying and guiding our team throughout our participation in the competition and the publication of this international scientific paper."
The 10th International Conference on Data Science and Advanced Analytics (DSAA) highlights the strong interdisciplinary synthesis between statistics, computer science, information/intelligence science, and the interaction between academic and business domains in data science and analytics. DSAA sets high standards for the organizing committee, important keynote speeches, main conference submissions, and special sessions, as well as the acceptance rate for competitive papers. DSAA has been widely recognized as a leading annual specialized meeting in data science and analytics, such as Google Metrics and China Computer Foundation. DSAA 2023 provides a premier forum bringing together researchers, practitioners, government, developers, and users of big data solutions to exchange the latest theoretical developments in data science and best practices for various applications. DSAA 2023 invites submissions describing innovative research on all aspects of data science and advanced analytics, as well as papers focusing on applications that make significant, original, and reproducible contributions to improving the practice of data science and analytics in real-world situations.
For more details, visit: https://www.facebook.com/UIT.Fanpage/posts/pfbid032wBfBMLpsoZkqZN...
Hai Bang - Media Collaborator, University of Information Technology
Nhat Hien - Translation Collaborator, University of Information Technology