The proposed system Collection
data from the twitter social networking site and processes data using NLP techniques.
We are using two approach one is sentiment mining and other is data mining .Sentiment
mining is used for unstructured data and real time data. As data mining is used
for structured data and history data.
The system consists of the
1) Data collection module
2) Sentiment mining
3) data mining
4) output Classification
Data collection module
The tweets are fetched using the
Twitter API . The API provides a user friendly programming interface through
which download the tweet object in tweet
Object format. This object format helps
in extracting specific tweet attributes like
user name, location, time, re-tweet count etc. Once the data is fetched pre-processing
of the gathered data is done to extract features.
sentiment mining system identification
of tweet without knowing the previous
background. Before applying the algorithm data pre-processing is required. The data
undergo the following processes.
1) Stop Word removal
2) Repeated letters removal
3) Noise data removal
Parsing and tokenization
Stop word removal: Stop
words are those words which generally do not carry any useful information but are added to get the
grammar of the sentence. For example prepositions like on, in, to, above etc.,
articles like a, the, an question words like who, what, where, how etc.,
generally do not add any information to the content. But they are always found
in large amount in a sentence. So, these
words are to be removed from a sentence before applying the algorithm.
Repeated letters removal: People tend to show
their emotional state by repeating the letters of words in the tweets like
‘happpppyyyyy’. In English any word contains letters repeated twice to the
maximum. If a letter is repeated more than twice consecutively, the number of
its occurrence is reduced to two. Thus’happpppy’ becomes’happy’.
Noise data removal: By
noise data we mean the unwanted
data in the tweets like URLs, hashed words, names etc. The URLS
present in the tweets are removed.
4) Parsing and tokenization: Once the data is
cleansed, Parsing and tokenization is done. Tokenization helps in part of sentence part of the
word in a sentence. tokenization breaks
a stream of text into tokens, usually by looking for whitespace . A parser
takes the stream of tokens.
data mining :
data mining is use to find the
intensity of crime using the Naive Bayes algorithm. before apply the algorithm do the data
1) Feature extraction
3) data training and train module
after download the tweet , extracting specific tweet attributes
like user name, location, time, re-tweet
count etc. All the extracted feature are store in database.
Normalization: the extracted
tweet convert into the normal form for easy use and access .in normalization
all the attributes give the index .the attributes are further use as index.
data training and train module:
Algorithms learn from data. They
find relationships, make decisions, and evaluate their confidence from the
training data they’re given. And the better the training data is, the better
the model performs.
Data training apply on the data
set. New data is input to the train module and predict the output Intensity of data.
4) output Classification
If the tweet is related to crime then
its divide into the type of crime. Mainly we use in the system 4 type of crime
1) crime against parson
2) crime against property
3) crime against country
RESULTS AND DISCUSSION
we have chosen using negation algorithm as our main classifier, the
results are based on those experiments. For determining the accuracy of the
system we worked on a random set of sample 1000 tweets, from which 60% were no
crime and the rest were crime. Classes for these users were known already, out
of those 1000 tweets 93-95% were classified without mistake.