PROPOSED SYSTEMThe proposed system Collectiondata from the twitter social networking site and processes data using NLP techniques.
We are using two approach one is sentiment mining and other is data mining .Sentimentmining is used for unstructured data and real time data. As data mining is usedfor structured data and history data. The system consists of thefollowing modules1) Data collection module2) Sentiment mining3) data mining 4) output Classification Data collection module The tweets are fetched using theTwitter API . The API provides a user friendly programming interface throughwhich download the tweet object in tweetObject format.
This object format helpsin extracting specific tweet attributes like user name, location, time, re-tweet count etc. Once the data is fetched pre-processingof the gathered data is done to extract features. Sentiment mining:- sentiment mining system identificationof tweet without knowing the previousbackground. Before applying the algorithm data pre-processing is required. The dataundergo the following processes.1) Stop Word removal2) Repeated letters removal3) Noise data removal4)Parsing and tokenization1)Stop word removal: Stopwords are those words which generally do not carry any useful information but are added to get thegrammar of the sentence. For example prepositions like on, in, to, above etc.,articles like a, the, an question words like who, what, where, how etc.
,generally do not add any information to the content. But they are always foundin large amount in a sentence. So, thesewords are to be removed from a sentence before applying the algorithm. 2)Repeated letters removal: People tend to show their emotional state by repeating the letters of words in the tweets like’happpppyyyyy’. In English any word contains letters repeated twice to themaximum.
If a letter is repeated more than twice consecutively, the number ofits occurrence is reduced to two. Thus’happpppy’ becomes’happy’. 3)Noise data removal: Bynoise data we mean the unwanteddata in the tweets like URLs, hashed words, names etc. The URLSpresent in the tweets are removed.
4) Parsing and tokenization: Once the data iscleansed, Parsing and tokenization is done. Tokenization helps in part of sentence part of theword in a sentence. tokenization breaksa stream of text into tokens, usually by looking for whitespace . A parsertakes the stream of tokens. data mining :data mining is use to find theintensity of crime using the Naive Bayes algorithm.
before apply the algorithm do the datapre-processing .1) Feature extraction2) Normalization3) data training and train module Feature extraction:after download the tweet , extracting specific tweet attributeslike user name, location, time, re-tweetcount etc. All the extracted feature are store in database. Normalization: the extractedtweet convert into the normal form for easy use and access .in normalizationall the attributes give the index .
the attributes are further use as index. data training and train module:Algorithms learn from data. Theyfind relationships, make decisions, and evaluate their confidence from thetraining data they’re given. And the better the training data is, the betterthe model performs.
Data training apply on the dataset. New data is input to the train module and predict the output Intensity of data. 4) output Classification If the tweet is related to crime thenits divide into the type of crime. Mainly we use in the system 4 type of crime 1) crime against parson2) crime against property3) crime against country 4) other RESULTS AND DISCUSSION we have chosen using negation algorithm as our main classifier, theresults are based on those experiments. For determining the accuracy of thesystem we worked on a random set of sample 1000 tweets, from which 60% were nocrime and the rest were crime. Classes for these users were known already, outof those 1000 tweets 93-95% were classified without mistake. ConfusionMatrix crime No crime crime 37% 4% No crime 3% 56% Accuracy Recall Precision 93% 90.24% 92.50%