TechRxiv

Pars-HaO: Hate and Offensive Language Detection on Persian Tweets Using Machine Learning and Deep Learning

Download (1.12 MB)
Version 3 2023-12-05, 04:16
Version 2 2023-11-22, 04:49
Version 1 2023-09-21, 21:27
preprint
posted on 2023-12-05, 04:16 authored by Mohammad Karami SheykhlanMohammad Karami Sheykhlan, Saleh Kheiri Abdoljabbar, Jaber Karimpour

As social networks continue to gain widespread popularity, an urgent requirement arises to automatically identify and detect offensive language and hate speech. While there is a wealth of research and datasets available for English in this domain, there is currently a scarcity of research and datasets focused on identifying hate speech and offensive language in Persian text. This article introduces a 3-class dataset named Pars-HaO, consisting of 8013 tweets, to fill the gap in existing research. We collected the dataset by combining comments from pages that are more exposed to hate speech and using a keyword-based approach. Three annotators then labeled the tweets. In this study, we employed a combination of the Convolutional Neural Network (CNN) model and four widely recognized machine learning models, namely Support Vector Machine (SVM) and Logistic Regression (LR), Random Forest (RF), and Decision Tree (DT) as a baseline. Then, we compared the base models with Long Short-Term Memory(LSTM), Bidirectional LSTM (BiLSTM), and CNN models, each trained using the output of the last hidden state of Bidirectional Encoder Representations from Transformers (BERT). Experimental results on the Pars HaO dataset demonstrated that the BERT with BiLSTM technique yielded the best outcome, achieving a macro F1-score of 70%.
 

History

Email Address of Submitting Author

mohammadkaramisheykhlan@gmail.com

Submitting Author's Institution

University of Mohaghegh Ardabili

Submitting Author's Country

  • Iran

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC