. Proposed approach
As we know in all languages, alphabets are the smallest components.
These alphabets are composed the words, Phrases and sentences
are then composed of words. In the sentimental research, however, posts
or feedbacks are not large and, in most cases, they may contain few sentences
or phrases. Tweets, for example, are short personal
viewpoints. For such cases, sentence-level SA can be very useful .
Sentence structure is made up of different components such as: different parts
of speech (POS), negations, modifiers, clauses, context, etc. Our approach is
to use a sentiment lexicon to compute the sentiment value of each component. We
then combine these values to obtain the sentiment polarity of the entire
sentence. The goal of using the lexicon- based approach is to label tweets with
the suitable sentiment label either positive, negative or neutral.
.3.1 data collection
The first step is to collect tweets through Streaming API, that
provide by Twitter . this step accomplished through python programming language
with tweepy library. The tweets were
about the consumer confidence about the economic situation in Sudan, all the
data was in Arabic language under the key wards :………………………………
The following subsection describes text pre-processing and
filtering, which are necessary when we talk about comments because ordinary
people post comments directly without any review from social networks
administrators or owners. That is why comments and navigators’ feedbacks are
not structured. Further, they may contain a mixture of words and characters and
may also have spelling mistakes. In addition, since tweets are our focus here,
we should pay attention to the special properties and symbols they have such as
#, @, URLs, usernames, etc. With the removal of usernames, hyperlinks, URLs,
repeated letters, etc., we can reduce the number of the features by about 54%
(Go et al., 2009).
The removal of the usernames had the largest impact on the
reduction, which is plausible, since the uniqueness of the usernames creates
many new features.
3.2 Text Pre-processing
The focus of this work is correcting the simple type errors, like
misspellings and repeated letters is important. For this task, dictionaries are
used. Before starting the correcting the tweets had been filtered ,where each
unrelated data was deleted then tokenize tweets
Similarly abbreviations and acronyms are replaced with words from a
predefined dictionary. The following three types of errors, whether intentional
or unintentional by users, occur in typing and writing.
1- A repetition of vowels such as ????????????????????. In other words, the
pre-processing tools should be able to handle the overuse of characters
properly. Note that while this task is typical in natural language processing
(NLP), it might be of interest to our work since, in tweets; it may be used to
stress or highlight some words or feelings.
2- Keyboard mistyping, this may have several
a- space bar issues, which might be due to missing spaces as in ????????, which Should be ???? ???? or
extra spaces as in ?? ??, which should be ????
b- Keyboard proximity where words such as ??? might
be written as ??? since
? and ? exist next to each other in the keyboard
c- Similarity between characters, which
causes confusion such as ???? and ????.
3- Phonetic errors which are errors based on the sounds of a
a- homophones: two words which sound the same word, such as: ??? and ???
b- Spoonerisms: switching two letters like ???and???
A specialized tool was developed to address all of these error
types. The tool exploited an
Arabic dictionary to make sure that the resulting text is written
Another feature reduction strategy we applied is to remove
irrelevant content, stop words. Stop
words may all fall within the neutral polarity and hence do not add any value to
the polarity decision. Stop words are usually filtered out during the
pre-processing of text based on their usefulness for the target application. In
SA, the filtered stop words will reduce the number of words for which the
sentiment value must be computed, which saves space and time. Examples of some
stop words in Arabic include:,????, ????, ???etc.
The main problem with stop words is the lack of a
definite list that all systems can use.
For Arabic, Khoja introduced a list of 168 stop
words (Khoja and Garside, 1999), which has been used repeatedly in the
literature. A larger list was created later by Chen and Gey that was a
translation from an English list and contained 1,131 stop words.
In our system, we use an online
list of stop words.
The negation word may make changes
. For example, in the phrase ‘?? ????’the word ‘????’has a positive emotional value but the negation article ‘??’ makes
the whole sentence a negative one. In fact, negation, in general, is so
important that it requires special attention. Its importance stems from the
potentially strong influence it has on the sentiment orientation of the
sentences in which it exists. A simple sentence filled with words of strong
positive sentiments can be easily turned into a negative sentence by the mere
introduction of a negative article. On the other hand, negation can sometimes
have limited effect on the words immediately following the negation article.
Moreover, different negation articles have varying influence based on their
is process of reducing words to their base forms. Unlike English, Arabic does
rely solely on structured ways of derivation, hence, the stemming process in
complicated due to the language’s complicated morphological structure. Words in
are derived from roots as the bases, which, in most cases, consist of three or
To derive a new word different prefixes and suffixes are usually used with the
letters to generate a new word form and meaning. a simple stemmer that does
deal with patterns or infixes.Instead, it mainly removes prefixes and/or
it removes the letter’?’ from
the beginning of the word only if the resulting word
more than three letters. It also removes definite articles ??/ ?/? from the beginning of the word only if the resulting words are
two or more letters. As for suffixes,
removes ?,?,?,??,??,??, from the end of the
word (longer first) only if the resulting
is two letters or more.
lexicon is a special lexicon that contains a set of words along with
sentiment polarity. In our lexicon, the sentiment takes value 1 to the
positive,-1 to the negative and 0 for the neutral.
The automatic construction of sentiment
lexicons is a very challenging task that
the success of lexicon-based SA approaches. The challenges come from the
of the Arabic language and the huge number of words to be considered.
determining the sentiment values of many words can be very difficult for
reasons such as the different meanings and connotations (and thus the possibly
sentiment value) of each word depending on the context and the cultural
of the person posting the tweet.
general challenges for building a lexicon can be similar to those of building a
parser in general where the size of the input or evaluated data is large.
sentiment lexicon of Arabic words is built and evaluated in this work. Thorough
and experimentation is conducted to evaluate the quality of the developed
Since the NLP literature for the Arabic language lacks many of the fundamental
and resources to help in building a sentiment lexicon, we resort to benefiting
English sentiment lexicon. The process of building our sentiment lexicon is
into three steps: collect Arabic stems from tweets as testing lexicon, use online
sentiment lexicons and translate it to Arabic ,use it is value to determine the
languages have huge numbers of words, which means that the task of
them all is very challenging and tedious. To do so for the Arabic language, we
by taking the stems from a famous publicly available dataset to
the basis of our lexicon. We then
our own tool to process these articles and extract distinct Arabic stems from
the second phase, we exploit the free online machine translation service by
the collected stems . After omitting the repeated
we are left with about 5230 Arabic/ Sudanese terms in our lexicon. The Sudanese
terms come from the testing tweets.