ALUE: Arabic Language Understanding Evaluation
Haitham Seelawi Ibraheem Tuffaha Mahmoud Gzawi Wael Farhan
Bashar Talafha Riham Badawi Zyad Sober Oday Al-Dweik
Abed Alhakim Freihat Hussein T. Al-Natsheh
Mawdoo3 Ltd, Amman, Jordan
{haitham.selawi,ibraheem.tuffaha,mahmoud.gzawi,wael.farhan,
bashar.talafha,riham.badawi,zyad.sober,oday.aldweik,
abedalhakim.freihat,h.natsheh}
@mawdoo3.com
Abstract
The emergence of Multi-task learning (MTL)
models in recent years has helped push the
state of the art in Natural Language Understanding (NLU). We strongly believe that
many NLU problems in Arabic are especially
poised to reap the benefits of such models. To
this end, we propose the Arabic Language Understanding Evaluation Benchmark (ALUE),
based on 8 carefully selected and previously
published tasks. For five of these, we provide
new privately held evaluation datasets to ensure the fairness and validity of our benchmark.
We also provide a diagnostic dataset to help
researchers probe the inner workings of their
models. Our initial experiments show that
MTL models outperform their singly trained
counterparts on most tasks. But in order to entice participation from the wider community,
we stick to publishing singly trained baselines
only. Nonetheless, our analysis reveals that
there is plenty of room for improvement in
Arabic NLU. We hope that ALUE will play
a part in helping our community realize some
of these improvements. Interested researchers
are invited to submit their results to our online,
and publicly accessible leaderboard.
1
Introduction
Historically, research into the wide spectrum of
problems in Natural Language Processing (NLP)
and Understanding (NLU), has been highly compartmentalized, with each line of research attempting to tackle every single problem on its own, irrespective of the rest. However in recent years, the
view has been shifting towards re-examining the
whole field of NLP under a multitasking lens. This
has manifested itself in the development of MultiTask Learning (MTL) models, which are trained to
optimize multiple losses, each for a different task,
simultaneously.
This shift in paradigm was brought about by
a confluence of various elements from the wide
landscape of NLP research. For one, most core
NLP tasks have been researched extensively with
a significant slowdown in improvements under the
banner of “singlism”! Another is the recent advances in contextual word embeddings, which was
brought about in turn by the advent of a whole new
class of neural network architectures, namely, the
transformer, as described in the seminal paper of
Vaswani et al. (2017).
We believe that the community of Arabic NLP is
particularly poised to reap significant benefits from
adopting this shift in paradigm. To this end, we
seek to present a collection of 8 different Arabicspecific tasks, as part of a collective benchmark
which we refer to as the Arabic Language Understanding Evaluation benchmark (ALUE). Similar
to the General Language Understanding Evaluation
(GLUE) benchmark (Wang et al., 2019b), the tasks
we present in ALUE, are already available online
and have featured in previous competitive workshops. To ensure fair use of the benchmark, we
provide our privately held evaluation datasets for
five of these tasks, in which we follow the respective original authors’ annotation processes to a tee.
In addition, we present an expert-constructed diagnostic dataset, to help researchers probe the inner
workings of their models, and diagnose their shortcomings against a wide range of linguistic phenomena. We also present an automated and publicly
published leaderboard on the internet1 , open to any
researcher to directly submit the results of their
models to. Last but not least, we include baseline
results for several publicly available pre-trained
Arabic models.
This paper aims to introduce our work to the Arabic NLP community. We hope it will provide an
impetus that aids the research and development ef1
www.alue.org
173
Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 173–184
Kyiv, Ukraine (Virtual), April 19, 2021.
forts related to Arabic MTL approaches, and leads
to wider collaboration as well as healthy competition. In Section 2, we discuss related work, both
from the point of view of MTL models and datasets.
In Section 3, we discuss the tasks comprising the
ALUE benchmark, and their respective datasets.
Section 4 focuses on the diagnostic dataset, and the
way it was constructed, and the rationale behind
it. An overview of our selected baselines can be
found in Section 5. Results and discussions of our
work can be found under Section 6, followed by
our conclusions in section 7.
2
Related Work
The idea of MTL is relatively new in NLP. One
of the earliest oft-cited works in this regard was
by Collobert et al. (2011), in which they trained a
multi-layered neural model on different sentence
tagging tasks, with a common sentence encoder
shared between them all, achieving solid performance on all of the tasks in the process.
The shift in paradigm towards MTL, requires a
shift in terms of benchmarking as well. The General Language Understanding Evaluation (Wang
et al., 2019b) is one of the most widely used benchmarks for comparing MTL models in the English
language. It consists of nine tasks that are based on
previously published benchmarks/datasets, mainly
focusing on NLU. A subsequent iteration of the
benchmark, named SuperGlue (Wang et al., 2019a),
extends the scope of focus of the original benchmark to more challenging tasks, including question
answering and co-reference resolution, while including a human performance baseline. For both
benchmarks, the organizers provide an automated
leaderboard that serves to compare and showcase
the latest advancements in the field.
One of the earliest successful employment of
MTL in Arabic NLP was by (Zalmout and Habash,
2019). By using MTL in an Adversarial learning
setting, they reported state of the art results in crossdialectal morphological tagging. This was mainly
achieved by learning dialect invariant features between MSA (high resource dialect), and Egyptian
Arabic (low resource dialect). This, they argue,
helps in knowledge-transfer from the former to the
latter, thereby sidestepping the issue of resource
scarcity that plagues many Arabic variants. They
also note that the gain from such knowledge transfer approach is more significant the smaller the
datasets are.
Another paper by Baniata et al. (2018) employs
the idea of MTL in the context of Neural Machine
Translation. For translation from dialectical Arabic
to Modern Standard Arabic (MSA), the authors use
a sequence-to-sequence architecture, where each
source language has its own encoder. However, for
the target language, only a single shared decoder
is used. Using this setup they report better results.
Using this setup, they report better results and are
able to efficiently use smaller datasets.
Freihat et al. (2018) manually curated an Arabic
corpus of 2 million words, that was simultaneously
annotated for POS, NER, and segmentation. Using
an MTL model trained to perform the three aforementioned tasks, the authors were able to achieve
state of the art results on said tasks, and show that
such a model can greatly simplify and enhance
downstream tasks, such as lemmatization.
Using this setup, they report better results and
are able to efficiently use smaller datasets.
3
Tasks and Datasets
ALUE incorporates a total of 8 tasks covering a
wide spectrum of Arabic dialects and NLP/NLU
problems. Below, we provide a brief description
of each task; the nature of the problem, its original
workshop, and the evaluation metrics used. If the
task is one of the five we have provided our own
private dataset for, then we will also discuss the
annotation process we followed to generate said
private dataset.
It is worth mentioning that some of these tasks
were subtasks in their respective workshops, such
as the Emotion Classification and Sentiment Intensity Regression subtasks from SemEval-2018
Task 1, and the Offensive and Hate Speech subtasks from OSACT4 Shared Task on Offensive
Language Detection. However, for the purposes
of ALUE, these will be treated as independent tasks
rather than subtasks. Nevertheless, in the discussion below, we are going to list them under the
name of the original workshop task they featured
in first.
3.1 IDAT@FIRE2019 Irony Detection Task
(FID)
The shared task of Irony Detection in Arabic
Tweets (Ghanem et al., 2019) is based on a dataset
of around 5,000 tweets. Each tweet is labeled with
a ”1” when it is ironic, holds satire, parody, sarcasm, or if the intended meaning is the contrary of
174
the literal one. A label of ”0” is given otherwise.
This task will be evaluated using the F1 score.
3.2 MADAR Shared Task Subtask 1 (MDD)
This task is based on the MADAR Shared Task
on Arabic Fine-Grained Dialect Identification
(Bouamor et al., 2019). Each sentence is exclusively classified into one of 25 labels, corresponding to one city out of 25 predefined Arab
cities. A 26th label is added for MSA. The data is
sourced from the Basic Traveling Expression Corpus (Takezawa et al., 2007), with the same 2,000
sentences translated to the spoken dialect in each
of the cities and MSA (Corpus-26). The metric of
choice for this task is the F1-score.
3.3 NSURL-2019 Shared Task 8 (MQ2Q)
The Semantic Question Similarity in Arabic task
(Seelawi et al., 2019) was presented in the First
International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019). In this
task, a pair of questions is assumed to be semantically similar if they have the same exact answer
and meaning, which is denoted with a label of ”1”.
A label of ”0” is given otherwise. For this task,
we develop a new evaluation dataset. We start by
clustering a dataset of Arabic article titles into clusters of similar semantic meaning. From these, we
select headlines that have a question format. Then
by pairing questions from similar clusters, we obtain 29,254 similar question pairs. Non-similar
question pairs were generated by pairing questions
from clusters that are close but not similar in semantic meaning. This is done to ensure that the
resultant dataset is challenging. We then select
4,000 of these pairs, with an equal representation
of ”0”s and ”1”s. A final round of human validation on those question pairs is conducted to ensure
the quality of the resulting dataset. The evaluation
of this task is performed using an F1-score.
3.4 OSACT4 Shared Task on Offensive
Language Detection (OOLD & OHSD)
The offensive Language Detection shared task
(Mubarak et al., 2020) is based on a dataset that contains a total of 10,000 tweets, with 2 subtasks. The
first is the Offensive Task (OOLD) where a tweet
is labeled offensive if it consists of inappropriate
language or imply insults or attacks against other
people, and not offensive otherwise. The second is
the HateSpeech Task (OHSD) in which offensive
tweets from the subtask above are also considered
hate speech if they are attacking a certain group
based on nationality, ethnicity, gender, political or
sports affiliation, religious belief, or other related
characteristics. Otherwise, they are labeled as not
hate speech.
For both of these tasks, we develop our own evaluation dataset, using the Abusive Language Detection on Arabic Social Media corpus (Al Jazeera)
(Mubarak et al., 2017), which contains 32,000 comments. We refine this corpus with 8 multi-labeled
fine-grained classes, namely: toxic, insult, threat,
identity hate, sexual, racial, blasphemy, and politically incorrect. Each label of these is denoted with
either ”1” if the class applies, or ”0” otherwise.
We then annotate these comments using the following guidelines: (i) offensive if 3 or more classes
of insult, toxic, threat, identity hate, and/or sexual
are present, (ii) hate speech if the same previous
conditions were satisfied, but with the additional
requirement of the racial class having a label of
”1” too (iii) comments with 0 values across all the
classes are labeled as not offensive nor hate speech
(iv) anything else that fails to satisfy any of the
previous conditions is discarded.
We select 1,000 sentences from the resultant
dataset, with special care to achieve a similar distribution of the original one. Finally, a round of
human validation is conducted to ensure the quality of the overall evaluation dataset. Both of these
tasks are evaluated using the F1-score.
3.5 SemEval-2018 Task 1 - Affect in Tweets
(SVREG & SEC)
The Affect in Tweets dataset (Mohammad et al.,
2018) was introduced in the 2018 SemEval workshop. The task consists of five subtasks. For our
purposes, we will only include two of these. The
first is the Emotion Classification task (SEC) in
which a tweet is classified using one or more of
eleven possible labels that best capture the emotions expressed by it. These labels are anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust. The second is
the Sentiment Intensity Regression task (SVREG)
in which participants are expected to predict the
”valence” of a given tweet, using a real-valued score
between ”0” and ”1”, with ”0” representing the
most negative sentiment possible, while ”1” being
the most positive sentiment possible.
For both of these tasks, we develop privately
held evaluation datasets. For SEC, our annota-
175
tion process follows the convention described in
SemEval-2018 (Mohammad et al., 2018). First,
we select certain keywords, which collectively capture the 11 emotions. This is accomplished using
various morphological forms and sub-phrases. A
total of 18,000 tweets are then crawled using said
keywords, each of which is subsequently labeled
by four experienced annotators. A given emotion
is then labeled as present with a ”1”, if two or
more annotators agree. Otherwise, the emotion is
labeled with a ”0”. 1,000 tweets are selected in a
manner that resembles the distribution captured by
the original dataset.
Our VREG evaluation dataset is based on the
same 1,000 tweets selected for SEC. It is annotated
using the Best-Worst Scaling (BWS) annotation
methodology as described by Kiritchenko and Mohammad (2016). We combine these tweets and
group them in tuples of four tweets each, according
to the following set of rules: (i) No two 4-tuples
should have the same four tweets. (ii) No duplicated tweets should exist within the same 4-tuple.
(iii) All the tweets should have equal representation during the annotation process. As noted by the
original authors, somewhere between 1,500 and
2,000 BWS questions should be sufficient to obtain
reliable scores. Each of our tweets is present in 8
different 4-tuples, making a total of 2,000 BWS
questions. Those tuples are then presented to the
annotators who pick the tweet with the highest sentiment, as well as the lowest sentiment out of a
given 4-tuple. Each 4-tuple is annotated by two
different annotators.
Regression values are then obtained using the
equation below:
Vi = 0.5 + 0.5((Bi − Wi )/Ti )
Where Vi is the regression value for the tweet
i, Bi is the number of times that tweet i was voted
as having the highest sentiment in a 4-tuple, and
Wi representing the number of times it was voted
as having the lowest sentiment. Ti is the number
of times the tweet appears throughout all of the 4tuples. It is worth noting that the Vi value obtained
using the above equation will be between 0 and 1,
inclusive.
For evaluation, we use the Jaccard similarity
score and the Pearson correlation coefficient for
SEC and VREG, respectively.
3.6 Cross-lingual Sentence Representations
(XNLI)
This task is based on a dataset that was first presented by Conneau et al. (2018). It contains 7,500
textual entailment sentence pairs, each representing
a hypothesis and a premise. These sentence pairs
are labeled with either one of the following logical relationship labels: entailment, contradictory,
or neutral. The data was originally labeled in the
English language and then translated into 15 other
languages including Arabic. It is split into 2,500
for development, and 5,000 for testing. The training data is meant to be the Multi-Genre Natural
Language Inference (MultiNLI) corpus (Williams
et al., 2018), which is available only in English.
However, for the purposes of ALUE, we consider
the 5,000 test dataset to be the training dataset, and
the 2,500 development dataset to be the test dataset.
The Evaluation metric for this task is accuracy.
3.7 How to Participate
Similar to GLUE, researchers interested in submitting their results to our leaderboard need to download and run their models, only using the training
and, where available, the validation datasets, for
each and every task. Once the results are obtained,
and proper formatting and naming are adhered to,
they can be submitted to our website2 . The leaderboard shows each accepted team’s submission performance per task. The final ranking is based on
the unweighted average across all the tasks, not
including the diagnostic dataset. Appendix B can
be consulted for more details on the technology
stack powering our leaderboard.
4
Diagnostics
Understanding a sentence depends on the capacity
of the model to capture multiple underlying linguistic representations; changes in the logical, semantic, syntactic, and/or morphological features of a
sentence can alter its meaning. These alterations
can be described in terms of logical relationships
(Bos and Markert, 2005). The Single-Task Training Benchmark by Wang et al. (2019b) shows varying weaknesses between different models when
it comes to capturing the linguistic phenomena
involved. For example, double negation is especially difficult for Contextualized Word Vectors
such as Cove (McCann et al., 2017). Nonetheless, this seems to be somewhat ameliorated by
176
2
www.alue.org
deep learning-based Contextualized Word representations such as ELMo (Peters et al., 2018). Yet,
ELMo seems to struggle with Morphological Negation when compared to Cove.
The use of a manually annotated evaluation
dataset that captures the diversity of such linguistic
phenomena can help model designers better understand their models’ generalization behavior, and
work on improving them in the process. For this
purpose, we create the ALUE diagnostic dataset;
a high-quality, hand-labeled evaluation dataset,
inspired by its GLUE counterpart (Wang et al.,
2019b). It is composed of around 1,100 Arabic
pairs of hypothesis and premise sentences. Each
of these is labeled with their respective entailment
relationship (entailment, contradiction, or neutral)
and tagged with one or more category representing
the host of linguistic phenomena they represent.
For the purposes of the evaluation of a given
model’s performance on the diagnostic dataset, we
opted to follow GLUE’s lead, and use the R3 metric
(Gorodkin, 2004), for similar reasons (i.e. unbalanced class distribution in the diagnostic dataset
labels). We also believe that the R3 readily helps
with the investigation of systematic errors.
4.1 Annotation Process
We begin with the set of categories introduced in
GLUE (Wang et al., 2019b)3 . Each class describes
a linguistic phenomenon that is important for NLU
models to capture. Given that many of them are
well described in the linguistics literature, they
can be matched using syntactic/semantic patterns.
Next, we look up for sentences expressing these
patterns in a syntactically tagged Arabic corpus,
with the help of WordNet relations. From these
sentences, we construct sentence pairs, mostly by
modifying the sub-phrases of a premise to produce
a hypothesis reflecting the same linguistic phenomena in GLUE. For this purpose, we use three main
sources, namely: Arabic Wikipedia, UN Multilingual Corpus V1.0 (Ziemski et al., 2016) and
the corpus from the Arabic Linguistic Tool (ALP)
(Freihat et al., 2018). This latter source is composed of syntactically annotated texts of various
genres (e.g. news items, prose, literature, dialogues
and TV podcasts) to ensure high coverage of spoken and written MSA. In addition, we manually
translate the 330 artificially created examples from
GLUE, which mostly express complex linguistic
phenomena, and included them as part of our diagnostic dataset.
It must be noted that languages do not always
describe similar linguistic phenomena in the same
manner. This can lead to different entailments for
equivalent translations. For instance, the agent of a
verb in a passive construction is often mentioned in
English, and when combined with Negation the entailment always yields a contradiction label: John
didn’t break the vase vs The vase was broken by
John. The Arabic language tends to hide the agent
of the verb in passive constructions which leads to
a neutral label4 : John didn’t break the vase vs The
vase was broken. With this in mind, during the annotation process, we took care of such peculiarities
and, hence, ended up adding other categories. See
appendix C.1 for more details.
5
Baselines
All of our baselines build on publicly released pretrained word embeddings or models. These were
carefully selected to represent a temporal crosssection of progress in Arabic NLP during the past
few years; we start with fixed word-embeddings
(i.e. AraVec, Soliman et al., 2017 and Fasttext,
Joulin et al., 2017), and end with masked language
models (i.e. Arabic-BERT, Safaya et al., 2020 and
Multilingual-BERT, Turc et al., 2019), by way of
sentence representations (i.e. Large Multilingual
Universal Sentence Encoder, Yang et al., 2019)
and the early starts of contextual embeddings (i.e.
ELMoForManyLangs, Che et al., 2018; Fares et al.,
2017).
For BERT based models, we use Huggingface
(Wolf et al., 2019) for implementation, whereas for
all other models, we use TensorFlow 2.0 (Abadi
et al., 2015). However since the ELMoForManyLangs model is implemented in PyTorch, we decided to use its contextual embeddings without finetuning, as otherwise, we would have needed to
implement it using PyTorch (Paszke et al., 2019).
This would have effectively required from us to use
three different deep-learning frameworks to implement all of our baselines. This, we reckon, would
make the process of reproduction of our baselines
much more complex for other researchers; as a
matter of fact, as we trawled through the literature
in search for Arabic pre-trained and publicly re-
3
The full details of the annotation process for the diagnostics data are too long to be included here. Therefore we opted
to provide them on our website.
177
4
The logical inference is not centered on the agent of the
verb, but the whole event.
leased models, we couldn’t but notice the dearth of
publicly released Arabic models especially those
that capitalize on the latest advancements in NLP.
We hope that our contributions through ALUE will
help in this regard, by being a part of a conducive
environment for the Arabic NLP community, to
develop and publicly release state of the art MTL
models.
We also note the lack of large corpora dedicated
in full to Arabic and its varients. The majority of
our selected baselines are pre-trained on dumps of
Arabic Wikipedia and the Common Crawl (both of
which are predominantly MSA in nature), while a
few of them are trained on unreleased crawls with
larger coverage of the various dialects of Arabic
(i.e. ArabicBERT and AraVec). This makes our
baselines a little harder to compare directly, but
we hope that this highlights the peculiar challenges
that Arabic NLP faces in this regard.
For all of our models, we embed our sentences
into fixed-size vectors which in turn are fed into
a feed-forward network that produces the final
prediction. For the Universal Sentence Encoder
(USE), this is readily achieved. However, for our
AraVec, Fasttext and ELMO based models, in order to achieve the same step, we first consume the
word embeddings using a BiLSTM. For the BERT
based models, this is achieved somewhat indirectly,
via the [CLS] token, which serves as a surrogate
for sentence embeddings. More details on the exact architecture for each model per task, can be
obtained via appendix A and the github repo where
we release all our code5 . The parameters used for
each model might differ slightly, as we attempted
to bring out the best performance possible from
each to make our comparisons between them a little more fair.
6
Results and Discussion
Several key points in our work are worth analyzing,
namely: (i) the baseline scores and the comparison of the different approaches, (ii) the analysis
of the performance of our baseline models on the
diagnostic dataset, and (iii) the comparative analysis between private and public evaluation datasets.
This section goes through each of these key points
in the same order as presented above.
6.1 Benchmark Results
Each one of our models was trained in a singletask fashion. To ensure reproducibility we used
a random seed. While we strongly advocate for
MTL models, we strategically eschewed training
any such model for our baselines. This is mainly
because our initial experiments show that it significantly boosts the performance on our benchmark,
and as such, with the end goal of encouraging researchers to submit their results to ALUE, we decided to keep our baselines competitive enough to
entice participation from the wide community, but
simple enough to surpass. At the end of the day,
we believe that baselines, as their name suggests,
belong at the base of a leaderboard. The results
obtained for each of our baselines are outlined in
Table 1.
We note that, expectedly, BERT based models
outperform all others by a large margin, with ArabicBERT’s performance being the best across all
tasks. This can be attributed in part, to the fact
that it is trained on a large corpus composed of
MSA as well as dialectical variants of Arabic. The
importance of this factor is evident in tasks that
are heavily skewed towards dialectical Arabic (i.e.
OOLD and OHSD), where a simple model using
twitter-based word representations such as AraVec
(i.e. which has a heavy representation of dialectical words) outclassed Multi-lingual BERT, which
was only trained on MSA. This strongly highlights
the importance of pre-training data that covers the
wide spectrum of Arabic variants in this time and
age.
Interestingly, the difference in performance between Multi-BERT cased and uncased is somewhat
negligible across all tasks except for those that require strong syntactical performance (i.e. MQ2Q
and XNLI). This indicates that orthographic normalization in Arabic might impede a model’s ability to achieve good syntactical modeling. We also
suspect that the cased Multi-BERT model is indirectly benefiting from preserving the case for the
other languages in the shared WordPiece vocabulary space it learns.
In a similar vein, the USE model performs very
competitively on XNLI and MQ2Q. This is due
to the fact the Natural Language Inference (NLI)
is part of the pre-training method for said model.
178
5
We release the code for our baselines publicly
for reproducibility at the following GitHub repo:
https://github.com/hseelawi/alue baselines
Model
ArabicBERT
ML-BERT Cased
ML-BERT Uncased
USE
ML-ELMo
AraVec
FastText
FID
82.18
81.61
81.01
76.90
77.00
76.70
77.10
MDD
59.66
61.26
57.98
23.40
52.10
48.40
50.80
MQ2Q
85.69
83.24
75.79
76.50
70.50
62.60
66.80
OOLD
89.47
80.33
79.85
76.30
71.60
85.60
79.70
OHSD
78.72
70.54
70.64
68.20
62.88
73.80
60.40
SVREG
55.12
33.85
32.01
36.50
24.90
32.20
37.00
SEC
25.13
14.02
13.81
14.80
14.40
18.05
15.30
XNLI
60.96
63.09
57.91
57.10
50.00
47.70
52.70
Avg
67.12
60.99
58.63
53.71
52.92
55.63
54.98
DIAG
19.60
19.00
15.10
13.90
09.60
10.00
03.50
Table 1: Evaluation scores for our baseline models on the various ALUE tasks, with Pearson Correlation and
Jaccard Index scores for SVREG and SEC tasks respectively, Matthews Correlation Coefficient for the Diagnostic
dataset (DIAG), Accuracy for XNLI, and F1-score for the rest. Note that DIAG is not included in the average as it
is not designed for direct model comparison.
Nonetheless, the model’s very poor performance
on dialect detection (i.e. MDD) reveals the inherent issues that sentence embedding models face in
tasks where lexical information is important, as it
tends to be discarded, in the process of embedding
the full sentence into a single fixed-size vector. For
such tasks, we can make the observation that models which use subword embeddings or some form
of word morphological based tokenization tend to
perform well, for the exact opposite reason, even
for those that have been trained on MSA only (i.e.
Multi-BERT).
The benefits of using subwords can be generalized to dialect-heavy tasks too, as can be seen in
the case of the Multi-BERT and Fasttext models.
This might be explained by the fact that subwords
make better use of cognates across the different
forms of Arabic. Additionally, there is no denying
that the use of subwords mitigates the effects of
non-standard orthography, amongst and across the
various dialects. Nonetheless, their contributions
might become less pronounced when dialectical
Arabic is strongly present in the pre-training corpus. This is very evident in the case of the AraVec
model, which, as alluded to above, outperforms
all the models on three out of five of those tasks
(i.e. SEC, OOLD, and OHSD), except for ArabicBERT, which even then, is heavily pre-trained
on dialectical data.
scores on Lexical Semantics seem to be the lowest
with an 8.1 average. Interestingly, both versions
of MultiBERT outperform ArabicBERT in worldknowledge, which depends on extra-linguistic information. This, perhaps, has something to do
with the fact that they were trained on Wikipedia
dumps of many languages. USE results seem to
be high in World Knowledge compared to other
coarse-grained categories as well.
For fine-grained categories, here again we can
see the strong correlation between USE and MultiBERT performance on Named Entities, which perhaps is underpinned by the same factors for their
solid performance on World Knowledge. All of
our models seem to exhibit very poor performance
on Double Negation and Conditionals. This is
probably due to the very rich diversity of toolwords used in Arabic to describe such phenomena,
which makes it difficult for the models to make
adequate generalizations. Of note is the fact that
FastText seems to work especially well on Restrictivity, where all other models seem to struggle.
6.2 Diagnostic Results Analysis
We report the performance of our models on the
diagnostic dataset in Table 2.
For coarse-grained categories, ArabicBERT
outperforms, on average, all other models with a
considerable difference, although the overall performance across all models is low. The highest
scores fall under the Predicate-Argument Structures category, with an average of 15, whereas the
179
6.3 Private vs Public Set Analysis
Here we preform a comparative analysis between
our private evaluation datasets and the original
ones, to provide a better understanding of the involved baselines and datasets. First, we compute a
correlation score between both, private and public
results as displayed in Table 3. As expected, all
of these scores are positively correlated but some
datasets are more so than others. For instance, the
strong correlation between the public and private
evaluation datasets for MQ2Q can be explained by
the fact that the only difference is in the diversity
of the topics covered. On the other hand, while
both the public and private evaluation datasets for
SVREG and SEC are from the same source (i.e.
tweets), they were collected at different points in
Model
ArabicBert
ML-BERT cased
ML-BERT uncased
USE
Elmo
AraVec
FastText
AVG
All
19.6
19
15.1
13.9
9.6
10
3.5
LS
15
17
07
07
14
00
-02
8.1
Coarse-Grained
PAS
L
K
28 13.1 15
20
12
19
15 13.5 19
14
10
15
10
09
10
10
09
06
08
02
-03
15
9.8 11.6
Quant
48
45
36
34
08
07
14
2N
-17
-20
-12
-30
-26
-37
-37
Fine-Grained
Cond Rest
-06
16
00
00
-14
-10
-14
00
10
-12
-01
00
-24
30
Nom
15
34
10
20
14
10
09
NE
10
23
06
22
00
10
-06
Table 2: The R3 results of our different baseline models on the diagnostic dataset. Scores are scaled by 100. The
”All” score is the average of coarse-grained categories. Abbreviations are: Lexical Semantics (LS), PredicateArgument Structure (PAS), Logic (L), World knowledge (k), Quantifiers (Quant), Double Negation (2N), Conditional (Cond), Restrictivity (Restr), Nominalization (Nom) and Named entities (NE). Here we only report results
on the fine-grained classes that we find to be the most interesting.
Model
ArabicBERT
MultiBERT-cased
MultiBERT-uncased
USE
ELMo
FastText
AraVec
Correlation
MQ2Q
Private Public
0.8569 0.9523
0.8324 0.9573
0.7579 0.9389
0.7650 0.8451
0.7050 0.9018
0.6680 0.8869
0.6260 0.8096
0.7757
SVREG
Private Public
0.5512 0.8376
0.3385 0.7261
0.3201 0.7274
0.3650 0.7224
0.2490 0.5550
0.3700 0.6480
0.3220 0.6402
0.8492
SEC
Private Public
0.2513 0.5422
0.1402 0.4802
0.1381 0.4739
0.1480 0.2705
0.1440 0.2850
0.1530 0.3221
0.1805 0.3552
0.5012
OOLD
Private Public
0.8947 0.9583
0.8033 0.9439
0.7985 0.9453
0.7630 0.7264
0.7160 0.6119
0.7970 0.7059
0.8560 0.7439
0.6229
OHSD
Private Public
0.7872 0.9820
0.7054 0.9715
0.7064 0.9772
0.6820 0.5501
0.6288 0.4930
0.6040 0.4770
0.7380 0.6145
0.7151
Table 3: The evaluation scores for our baseline models on both, the private and public evaluation datasets. The last
row shows the Pearson correlation coefficient for both sets across all the models for the corresponding task.
time. For SVREG, the scores on both evaluation
datasets are highly correlated, yet there is a drastic
gap between the results. This can be explained by
the comparative nature of the BWS annotation process. Each tweet is labeled with a floating number
that evaluates how its sentiment compares to the
average or norm of the entire dataset. From the
results, we can deduce that the overall sentiment
of the public evaluation dataset is more positive
than ours. The results on the SEC datasets seem
to be the least correlated. This is because both versions of MultiBERT (cased and otherwise) achieve
high scores on the original evaluation dataset, yet,
curiously, they score the lowest on our evaluation
dataset. Interestingly, in the case of OOLD and
OHSD, the public and private evaluation datasets
for both cases, have a strong positive correlation
despite the fact that they were collected from two
different sources; the original one is from Twitter
while our dataset is from Aljazeera comments.
7
Conclusion
In this paper, we introduce the ALUE benchmark,
with the purpose of providing a platform for researchers interested in pushing the state of the art
in Arabic NLU. It consists of 8 previously published tasks, with 5 privately curated evaluation
datasets to ensure the validity of the leaderboard.
We evaluated the correctness of these 5 evaluation
sets, finding a positive correlation between the ones
we developed and the original ones. In addition,
we built a novel diagnostic dataset that helps analyze the results of models against a comprehensive
range of linguistic phenomena. Our initial experiments show that MTL approaches outperform their
single-model-per-task counterparts, but to keep our
leaderboard lucrative for participation, we decided
to only use single-task models as our baselines.
Our BERT baselines seem to outperform all other
models, and especially so, when the pertraining
data is not confined to MSA-dominant corpuses,
but contain dialectical varieties of Arabic as well.
For the diagnostic dataset, we found that our baselines struggle to capture many of the linguistic phe-
180
nomena represented by the dataset itself, which
suggests that there is plenty of room for improvement in the state of art for Arabic NLU. We hope
that ALUE will be an integral part of the efforts to
push said state of the art in the coming few years.
8
Acknowledgement
We would like to express our gratitude to our partners in this project, the MIND Lab at the American
University of Beirut (AUB), and the CAMeL Lab
at New York University Abu Dhabi. Their outstanding support and ongoing guidance was invaluable
to the work presented in this paper. We would also
like to thank the team members of Mawdoo3 AI
data Annotation and Linguistics departments for
their contributions in annotating the privately held
evaluation datasets; namely: Manar Salous, Yasmin
Al Momani, Othman Abu Saa’, Mohammad Saleh
and Mariam Arnaout. In addition, we thank Abdallah Abu Sham and Wael Gaith from Mawdoo3 for
help in developing and deploying our leaderboard
website. Finally, special thanks goes to Mawdoo3
Ltd. for supporting our research efforts and for
making the evaluation datasets publicly available
for further research on Arabic NLP/NLU.
References
Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay
Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey
Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore,
Derek Murray, Chris Olah, Mike Schuster, Jonathon
Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan,
Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software
available from tensorflow.org.
Laith H Baniata, Seyoung Park, and Seong-Bae Park.
2018. A neural machine translation model for arabic
dialects that utilizes multitask learning (mtl). Computational intelligence and neuroscience, 2018.
Johan Bos and Katja Markert. 2005. Recognising textual entailment with logical inference. In Proceedings of Human Language Technology Conference
and Conference on Empirical Methods in Natural
Language Processing, pages 628–635, Vancouver,
British Columbia, Canada. Association for Computational Linguistics.
181
Houda Bouamor, Sabit Hassan, and Nizar Habash.
2019.
The madar shared task on arabic finegrained dialect identification. In Proceedings of the
Fourth Arabic Natural Language Processing Workshop, pages 199–207.
Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng,
and Ting Liu. 2018. Towards better UD parsing:
Deep contextualized word embeddings, ensemble,
and treebank concatenation. In Proceedings of the
CoNLL 2018 Shared Task: Multilingual Parsing
from Raw Text to Universal Dependencies, pages
55–64, Brussels, Belgium. Association for Computational Linguistics.
Ronan Collobert, Jason Weston, Léon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
scratch. Journal of machine learning research,
12(Aug):2493–2537.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk,
and Veselin Stoyanov. 2018. Xnli: Evaluating crosslingual sentence representations. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Murhaf Fares, Andrey Kutuzov, Stephan Oepen, and
Erik Velldal. 2017. Word vectors, reuse, and replicability: Towards a community repository of large-text
resources. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 271–
276, Gothenburg, Sweden. Association for Computational Linguistics.
Abed Alhakim Freihat, Gabor Bella, Hamdy Mubarak,
and Fausto Giunchiglia. 2018. A single-model approach for arabic segmentation, pos tagging, and
named entity recognition. In 2018 2nd International
Conference on Natural Language and Speech Processing (ICNLSP), pages 1–8. IEEE.
Bilal Ghanem, Jihen Karoui, Farah Benamara,
Véronique Moriceau, and Paolo Rosso. 2019. Idat
at fire2019: Overview of the track on irony detection
in arabic tweets. In Proceedings of the 11th Forum
for Information Retrieval Evaluation, pages 10–13.
Jan Gorodkin. 2004. Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry, 28(5-6):367–374.
Armand Joulin, Édouard Grave, Piotr Bojanowski, and
Tomáš Mikolov. 2017. Bag of tricks for efficient text
classification. In Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 2, Short Papers,
pages 427–431.
Svetlana Kiritchenko and Saif M Mohammad. 2016.
Capturing reliable fine-grained sentiment associations by crowdsourcing and best–worst scaling. In
Proceedings of NAACL-HLT, pages 811–817.
Bryan McCann, James Bradbury, Caiming Xiong, and
Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305.
communication research. In International Journal
of Computational Linguistics & Chinese Language
Processing, Volume 12, Number 3, September 2007:
Special Issue on Invited Papers from ISCSLP 2006,
pages 303–324.
Saif Mohammad, Felipe Bravo-Marquez, Mohammad
Salameh, and Svetlana Kiritchenko. 2018. Semeval2018 task 1: Affect in tweets. In Proceedings of the
12th international workshop on semantic evaluation,
pages 1–17.
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. 2019. Well-read students learn better:
On the importance of pre-training compact models.
arXiv preprint arXiv:1908.08962v2.
Hamdy Mubarak, Kareem Darwish, and Walid Magdy.
2017. Abusive language detection on arabic social
media. In Proceedings of the First Workshop on Abusive Language Online, pages 52–56.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information processing systems, pages 5998–6008.
Hamdy Mubarak, Kareem Darwish, Walid Magdy,
Tamer Elsayed, and Hend Al-Khalifa. 2020.
Overview of osact4 arabic offensive language detection shared task. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language
Detection, pages 48–52.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel Bowman. 2019a. Superglue: A
stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pages 3261–3275.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
PyJunjie Bai, and Soumith Chintala. 2019.
torch: An imperative style, high-performance deep
learning library. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages
2227–2237.
Ali Safaya, Moutasem Abdullatif, and Deniz Yuret.
2020. Kuisail at semeval-2020 task 12: Bert-cnn
for offensive speech identification in social media.
In Proceedings of the International Workshop on Semantic Evaluation (SemEval).
Haitham Seelawi, Ahmad Mustafa, Wael Farhan, and
Hussein T. Al-Natsheh. 2019. NSURL-2019 task
8: Semantic question similarity in arabic. In Proceedings of the First Workshop on NLP Solutions for
Under Resourced Languages, NSURL ’19, Trento,
Italy.
Abu Bakr Soliman, Kareem Eissa, and Samhaa R ElBeltagy. 2017. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science, 117:256–265.
Toshiyuki Takezawa, Genichiro Kikui, Masahide
Mizushima, and Eiichiro Sumita. 2007. Multilingual spoken language corpus development for
182
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019b.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. In 7th International Conference on Learning Representations,
ICLR 2019.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122. Association for
Computational Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy
Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan
Sung, et al. 2019. Multilingual universal sentence
encoder for semantic retrieval. arXiv, pages arXiv–
1907.
Nasser Zalmout and Nizar Habash. 2019. Adversarial
multitask learning for joint multi-feature and multidialect morphological modeling. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pages 1775–1786, Florence,
Italy. Association for Computational Linguistics.
Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno
Pouliquen. 2016. The united nations parallel corpus v1. 0. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation
(LREC’16), pages 3530–3534.
A Additional Benchmark Details
B Benchmark Website Details
Each of our baselines is adapted to each ALUE
task as needed. As such, extra trainable layers are
added to each model to consume its outputs, which
in turn is trained to make task specific predictions.
Please consult the subsections below in addition
to the publicly available github repo6 for exact details on the fine-tuning process for each model.
Our online leaderboard is powered by CodaLab7 ,
but is self-hosted on our own servers. Many high
caliber Arabic NLP/NLU workshops are typically
hosted on Codalab, and as such many researchers
in the field are already familiar with its interface.
This was a deciding factor in selecting it to power
our leaderboard.
C Additional Diagnostics Details
A.1 BERT
All of our BERT models weights are fine-tuned per
task, using the [CLS] token, along with a dropout
layer followed by a linear output layer that is suitable for each task.
A.2 USE
The produced sentence representations are fed into
a feed-forward network that is designed and trained
independently for each task.
A.3 ELMo
Given that the size of embeddings produced by this
model is 1024, we use a BiLSTM for each task
with a hidden size of 1024 in both directions. The
hidden states of the last token are then fed to the
appropriate feed-forward network for a given task.
No fine tuning of the ELMo model itself is done.
A.4 AraVec
We use the skip-gram model with 300 vector size
of unigram type that is trained on 66.9M Arabic
tweets. This is mainly because skip-gram embeddings provide better representations for less frequent words compared to continuous bag of words.
The embedding of each token in a given sentence
is fed into a BiLSTM and a feed-forward network,
which are trained on each task separately, similar
to ELMo.
The Diagnostic dataset is composed of sentence
pairs. Each, has a premise and hypothesis sentences, and tagged with an entailment label : entailment, neutral or contradiction, in addition to
the linguistic phenomena involved in the relationship. Linguistic phenomena tags are divided into 4
coarse-grained categories: Lexical semantics (LS),
predicate-argument structure (PAS), logic (L) and
world-knowledge (K).
When applicable, coarse-grained categories are
tagged with fine-grained categories. Each sentencepair is followed by its inverse form to establish a
confusion of the entailment.
C.1 Diagnostics Categories
The original GLUE diagnostic dataset uses 4
coarse-grained categories further divided into
33 fine-grained categories. For an explanation
of these phenomena, please consult the related
sections in the original GLUE paper by Wang
et al. (2019b). In addition to these, we added the
following fine-grained categories, some of which
are specific to Arabic:
A.5 FastText
These are pre-trained word embeddings that use
subword information to avoid out-of-vocab issues.
The Arabic model is mainly trained on Arabic
Wikipedia dumps and Arabic content from the
Common Crawl. They come in a fixed size of
300. We use these embeddings to train a BiLSTM
and a feed-forward network for each task similar
to ELMo and AraVec.
6
Lexical Implication: a verb Y is entailed by X if
by doing X you must be doing Y:
•
by
183
Saeed is a sleep.
Topicalization: a syntactic movement where an
argument is moved to the beginning of the
sentence to put emphasis:
•
The city of Amman is located in Jordan
entails and is entailed by
Amman, it is located in Jordan.
The city of
Reciprocity: an alternation resulting in the
realization of the object as a part of a conjoined
subject.
7
https://github.com/hseelawi/alue baselines
Saeed is snoring entails but is not entailed
https://codalab.org/
•
Reptiles fight each other by biting
and scratching entails and is entailed by
Reptiles fight (each other) by biting and
scratching.
Causative/Inchoative: an alternation that
expresses a change of state leading to the omission
of the agent. This case is marked by the addition
of an inchoative n morpheme in Arabic.
•
Saeed broke the vase entails but is not
entailed by
The vase broke.
Adjectivation: use of relational adjectives instead
of the entity they describe i.e. China/Chinese.
•
These technologies will help
strengthen the Chinese social security system
entails and is entailed by
These
technologies will help strengthen the social
security system of China.
184