. are used. Before starting the correcting the

. Proposed approach

As we know in all languages, alphabets are the smallest components.
These alphabets are composed the words, Phrases and sentences
are then composed of words. In the sentimental research, however, posts
or feedbacks are not large and, in most cases, they may contain few sentences
or phrases. Tweets, for example, are short personal
viewpoints. For such cases, sentence-level SA can be very useful .
Sentence structure is made up of different components such as: different parts
of speech (POS), negations, modifiers, clauses, context, etc. Our approach is
to use a sentiment lexicon to compute the sentiment value of each component. We
then combine these values to obtain the sentiment polarity of the entire
sentence. The goal of using the lexicon- based approach is to label tweets with
the suitable sentiment label either positive, negative or neutral.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now



.3.1 data collection

The first step is to collect tweets through Streaming API, that
provide by Twitter . this step accomplished through python programming language
with tweepy  library. The tweets were
about the consumer confidence about the economic situation in Sudan, all the
data was in Arabic language under the key wards :………………………………

The following subsection describes text pre-processing and
filtering, which are necessary when we talk about comments because ordinary
people post comments directly without any review from social networks
administrators or owners. That is why comments and navigators’ feedbacks are
not structured. Further, they may contain a mixture of words and characters and
may also have spelling mistakes. In addition, since tweets are our focus here,
we should pay attention to the special properties and symbols they have such as
#, @, URLs, usernames, etc. With the removal of usernames, hyperlinks, URLs,
repeated letters, etc., we can reduce the number of the features by about 54%
(Go et al., 2009).

The removal of the usernames had the largest impact on the
reduction, which is plausible, since the uniqueness of the usernames creates
many new features.

3.2 Text Pre-processing

The focus of this work is correcting the simple type errors, like
misspellings and repeated letters is important. For this task, dictionaries are
used. Before starting the correcting the tweets had been filtered ,where each
unrelated data was deleted then tokenize tweets

Similarly abbreviations and acronyms are replaced with words from a
predefined dictionary. The following three types of errors, whether intentional
or unintentional by users, occur in typing and writing.

1- A repetition of vowels such as ????????????????????. In other words, the
pre-processing tools should be able to handle the overuse of characters
properly. Note that while this task is typical in natural language processing
(NLP), it might be of interest to our work since, in tweets; it may be used to
stress or highlight some words or feelings.

2- Keyboard mistyping, this may have several

a- space bar issues, which might be due to missing spaces as in ????????, which Should be ???? ???? or
extra spaces as in ?? ??, which should be ????

b- Keyboard proximity where words such as ??? might
be written as ??? since

? and  ? exist next to each other in the keyboard

c- Similarity between characters, which
causes confusion such as ???? and ????.

3- Phonetic errors which are errors based on the sounds of a

a- homophones: two words which sound the same word, such as: ???  and ???

b- Spoonerisms: switching two letters like ???and???

A specialized tool was developed to address all of these error
types. The tool exploited an

Arabic dictionary to make sure that the resulting text is written

Another feature reduction strategy we applied is to remove
irrelevant content, stop words. Stop
words may all fall within the neutral polarity and hence do not add any value to
the polarity decision. Stop words are usually filtered out during the
pre-processing of text based on their usefulness for the target application. In
SA, the filtered stop words will reduce the number of words for which the
sentiment value must be computed, which saves space and time. Examples of some
stop words in Arabic include:,????, ????, ???etc.

The main problem with stop words is the lack of a
definite list that all systems can use.

For Arabic, Khoja introduced a list of 168 stop
words (Khoja and Garside, 1999), which has been used repeatedly in the
literature. A larger list was created later by Chen and Gey that was a
translation from an English list and contained 1,131 stop words.

In our system, we use an online
list of stop words.

The negation word may make changes
. For example, in the phrase ‘?? ????’the word ‘????’has a positive emotional value but the negation article ‘??’ makes
the whole sentence a negative one. In fact, negation, in general, is so
important that it requires special attention. Its importance stems from the
potentially strong influence it has on the sentiment orientation of the
sentences in which it exists. A simple sentence filled with words of strong
positive sentiments can be easily turned into a negative sentence by the mere
introduction of a negative article. On the other hand, negation can sometimes
have limited effect on the words immediately following the negation article.
Moreover, different negation articles have varying influence based on their

3.3 Stemming

is process of reducing words to their base forms. Unlike English, Arabic does

rely solely on structured ways of derivation, hence, the stemming process in
Arabic is

complicated due to the language’s complicated morphological structure. Words in

are derived from roots as the bases, which, in most cases, consist of three or

To derive a new word different prefixes and suffixes are usually used with the

letters to generate a new word form and meaning. a simple stemmer that does

deal with patterns or infixes.Instead, it mainly removes prefixes and/or
suffixes. For

it removes the letter’?’ from
the beginning of the word only if the resulting word

more than three letters. It also removes definite articles ??/ ?/? from the beginning of the word only if the resulting words are
two or more letters. As for suffixes,

removes ?,?,?,??,??,??, from the end of the
word (longer first) only if the resulting

is two letters or more.

3.4Sentiment lexicon

lexicon is a special lexicon that contains a set of words along with

sentiment polarity. In our lexicon, the sentiment takes value 1 to the
positive,-1 to the negative and 0 for the neutral.

 The automatic construction of sentiment
lexicons is a very challenging task that

the success of lexicon-based SA approaches. The challenges come from the

of the Arabic language and the huge number of words to be considered.

determining the sentiment values of many words can be very difficult for

reasons such as the different meanings and connotations (and thus the possibly

sentiment value) of each word depending on the context and the cultural

of the person posting the tweet.

general challenges for building a lexicon can be similar to those of building a

parser in general where the size of the input or evaluated data is large.

sentiment lexicon of Arabic words is built and evaluated in this work. Thorough

and experimentation is conducted to evaluate the quality of the developed

Since the NLP literature for the Arabic language lacks many of the fundamental

and resources to help in building a sentiment lexicon, we resort to benefiting

English sentiment lexicon. The process of building our sentiment lexicon is

into three steps: collect Arabic stems from tweets as testing lexicon,  use online

sentiment lexicons and translate it to Arabic ,use it is value to determine the
Arabic values.

languages have huge numbers of words, which means that the task of

them all is very challenging and tedious. To do so for the Arabic language, we

by taking the stems from a famous publicly available dataset to

the basis of our lexicon. We then

our own tool to process these articles and extract distinct Arabic stems from

the second phase, we exploit the free online machine translation service by
Google to

the collected stems . After omitting the repeated

we are left with about 5230 Arabic/ Sudanese terms in our lexicon. The Sudanese
terms come from the testing tweets.


I'm William!

Would you like to get a custom essay? How about receiving a customized one?

Check it out