Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Velankar, Abhishek; Patil, Hrushikesh; Joshi, Raviraj

doi:10.1007/978-3-031-20650-4_10

Computer Science > Computation and Language

arXiv:2204.08669 (cs)

[Submitted on 19 Apr 2022]

Title:Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Authors:Abhishek Velankar, Hrushikesh Patil, Raviraj Joshi

View PDF

Abstract:Transformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. In this work, we conduct a comparative study between monolingual and multilingual BERT models. We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi. We further show that Marathi monolingual models outperform the multilingual BERT variants on five different downstream fine-tuning experiments. We also evaluate sentence embeddings from these models by freezing the BERT encoder layers. We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts. However, we observe that these embeddings are not generic enough and do not work well on out of domain social media datasets. We consider two Marathi hate speech datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification dataset L3Cube-MahaSent, and Marathi Headline, Articles classification datasets.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2204.08669 [cs.CL]
	(or arXiv:2204.08669v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2204.08669
Related DOI:	https://doi.org/10.1007/978-3-031-20650-4_10

Submission history

From: Raviraj Joshi [view email]
[v1] Tue, 19 Apr 2022 05:07:58 UTC (728 KB)

Computer Science > Computation and Language

Title:Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators