Cuneiform Text Dialect Identification Using Machine Learning Algorithms and Natural Language Processing (NLP)

Authors

  • Elaf A. Saeed Al-Nahrain University
  • Ammar D. Jasim
  • Munther A. Abdul Malik

DOI:

https://doi.org/10.31987/ijict.7.2.265

Keywords:

Cuneiform, unigram, CLI, Over-sampling, SVM, RF, DT, KNN, DNN.

Abstract

Due to a lack of resources and the tokenization issue, it is challenging to identify the languages inscribed in cuneiform symbols. Sumerian and six dialects of the Akkadian language-Old Babylonian, Middle Babylonian Peripheral, Standard Babylonian, Neo-Babylonian, Late Babylonian, and Neo-Assyrian-are among the seven languages and dialects written in cuneiform that need to be identified. This problem is addressed by the Cuneiform Language Identification task in VarDial 2019. This paper presents ten machine learning algorithms derived from four types of machine learning that were used (supervised, ensemble, instance-based, and Artificial Neural Network) learnings. The Support Vector Machine (SVM), Na Bayes (NB), Logistic Regression (LR), and Decision Tree (DT) algorithms within supervised learning, the K-Nearest Neighbors algorithm (KNN) within instance- based learning, the Random Forest (RF), Adaptive Boosting (Adaboost), Extreme Gradient Boosting (XGBoost), and Gradient Boosting (GB) algorithms within ensemble learning. Also, one of the natural language processing algorithms, n-gram, is used to identify the cuneiform dialect. The best result belongs to an ensemble of Random Forest classifiers working on character-level features with a macro averaged F1 score of 96%, and the best outcome for the n-grams algorithm is 0.82% of di-gram.

Downloads

Published

2024-09-01

How to Cite

Cuneiform Text Dialect Identification Using Machine Learning Algorithms and Natural Language Processing (NLP). (2024). Iraqi Journal of Information and Communication Technology, 7(2), 26-40. https://doi.org/10.31987/ijict.7.2.265

Most read articles by the same author(s)

1 2 3 4 5 6 7 8 9 10 > >>