Chúc mừng nhóm sinh viên UIT có bài báo khoa học được chấp nhận đăng tại Hội nghị quốc tế PACLIC 38

CN, 12/01/2025 - 22:43

SDG9

Bài báo: “A study Vietnamese readability assessing through semantic and statistical features”

Link bài báo: https://arxiv.org/abs/2411.04756

Sinh viên thực hiện

- Lê Tuấn Hưng - KHDL2021 - Tác giả chính

- Tô Trường Long - KHDL2021 - Đồng tác giả

- Nguyễn Trọng Mạnh - KHDL2021 - Đồng tác giả

GVHD: TS. Đỗ Trọng Hợp, TS. Nguyễn Thị Quyên

Tóm tắt bài báo:

Determining the difficulty of a text involves assessing various textual features that may impact the reader's text comprehension, yet current research in Vietnamese has only focused on statistical features. This paper introduces a new approach that integrates statistical and semantic approaches to assessing text readability. Our research utilized three distinct datasets: the Vietnamese Text Readability Dataset (ViRead), OneStopEnglish, and RACE, with the latter two translated into Vietnamese. Advanced semantic analysis methods were employed for the semantic aspect using state-of-the-art language models such as PhoBERT, ViDeBERTa, and ViBERT. In addition, statistical methods were incorporated to extract syntactic and lexical features of the text. We conducted experiments using various machine learning models, including Support Vector Machine (SVM), Random Forest, and Extra Trees and evaluated their performance using accuracy and F1 score metrics. Our results indicate that a joint approach that combines semantic and statistical features significantly enhances the accuracy of readability classification compared to using each method in isolation. The current study emphasizes the importance of considering both statistical and semantic aspects for a more accurate assessment of text difficulty in Vietnamese. This contribution to the field provides insights into the adaptability of advanced language models in the context of Vietnamese text readability. It lays the groundwork for future research in this area.

“Nhóm xin gửi lời cảm ơn chân thành đến thầy Đỗ Trọng Hợp và cô Nguyễn Thị Quyên vì đã tận tâm hướng dẫn, hỗ trợ và góp ý quý báu trong suốt quá trình hoàn thiện bài báo khoa học. Nhờ sự chỉ dẫn và kinh nghiệm chuyên môn sâu sắc của thầy và cô, nhóm đã có thể hoàn thành nghiên cứu một cách hiệu quả và đạt được những kết quả tốt nhất”.

Thông tin Hội nghị:

Hội nghị Châu Á Thái Bình Dương lần thứ 38 về Ngôn ngữ, Thông tin và Tính toán (The 38th Pacific Asia Conference on Language, Information and Computation - PACLIC 38) là hội nghị rank C quốc tế uy tín trong lĩnh vực phân tích lý thuyết và xử lý ngôn ngữ tự nhiên. Kể từ năm 1982, chuỗi hội nghị PACLIC cung cấp diễn đàn cho các nhà nghiên cứu từ các lĩnh vực khác nhau chia sẻ và thảo luận về tiến trình nghiên cứu khoa học, phát triển và ứng dụng các chủ đề liên quan đến việc nghiên cứu ngôn ngữ. Năm 2024, hội nghị chính của PACLIC 38 sẽ diễn ra từ ngày 7-9 tháng 12 tại Đại học Ngoại ngữ Tokyo (Tokyo University of Foreign Studies).

Thông tin chi tiết xem tại: https://www.facebook.com/share/p/18HcJzFf53/

Đông Xanh - Cộng tác viên truyền thông Trường Đại học Công nghệ Thông tin