Thursday, 1 March 2012

Sentiment analysis with Weka

With the ever increasing growth in online social networking, text mining and social analytics are hot topics in predictive analytics. The standard approach to learning a document classifier is to convert unstructured text documents into something called the bag-of-words representation and then apply a standard propositional learning scheme to the result. Essentially this means splitting documents up into their constituent words, building a dictionary for the corpus and then converting each document into a fixed length vector of either binary word presence/absence indicators or word frequency counts. In general, this involves two passes over the data (especially if further transformations such as TF-IDF are to be applied) - one to build the dictionary, and a second one to convert the text to vectors. Following this a classifier can be learned.

Certain types of classifiers lend themselves naturally to incremental streaming scenarios, and can perform the tokenization of text and construction of the model in one pass.  Naive Bayes multinomial is one such algorithm; linear support vector machines and logistic regression learned via stochastic gradient descent (SGD) are some others. These methods have the advantage of being "any time" algorithms - i.e. they can produce a prediction at any stage in the learning process. Furthermore, they scale linearly with the amount of data and can be considered "Big Data" methods.

Text classification is a supervised learning task, which means that each training document needs to have a category or "class" label  provided by a "teacher". Manually labeling training data is a labor intensive process and typical training sets are not huge. This seems to preclude the need for big data methods. Enter Twitter's endless data stream and the prediction of sentiment. The limited size of tweets encourage the use of emoticons as a compact way of indicating the tweeter's mood, and these can be used to automate the labeling of training examples [1].

So how can this be implemented in Weka [2]? Some new text processing components for Weka's Knowledge Flow and the addition of NaiveBayesMultinomialText and SGDText for learning models directly from string attributes make it fairly simple.


This example Knowledge Flow process incrementally reads a file containing some 850K tweets. However, using the Groovy Scripting step with a little custom code, along with the new JsonFieldExtractor step, it would be straightforward to connect directly to the Twitter streaming service and process tweets in real-time. The SGDText classifier component performs tokenization, stemming, stopword removal, dictionary pruning and the learning of a linear logistic regression model all incrementally. Evaluation is performed by interleaved testing and training, i.e. a prediction is produced for each incoming instance before it is incorporated into the model. For the purposes of evaluation this example flow discards all tweets that don't contain any emoticons, which results in most of the data being discarded. If evaluation wasn't performed then all tweets could be scored, with only labeled ones getting used to train the model.

SGDText is included in Weka 3.7.5. The SubstringReplacer, SubstringLabeler and NaiveBayesMultinomialText classifier (not shown in the screenshot above) will be included with Weka 3.7.6 (due out April/May 2012). In the meantime, interested folks can grab a nightly snapshot of the developer version of Weka.

Options for SGDText

References
[1] Albert Bifet and Eibe Frank. Sentiment knowledge discovery in Twitter streaming data. In Proc 13th International Conference on Discovery Science, Canberra, Australia, pages 1-15. Springer, 2010.

[2] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington, MA, 3 edition, 2011.

25 comments:

  1. Hi, is the labelled twitter data available?
    Thanks..

    ReplyDelete
  2. Hi Jens,

    You can grab the labelled data in ARFF format from:

    https://docs.google.com/open?id=0B1pvkpCwTsiSd1pyTFZkdWVRdEs5Q1NiQW1mRmF1Zw

    Cheers,
    Mark.

    ReplyDelete
  3. Hi Mark,
    We are final year students doing a research on twitter sentiment analysis for one of our course module. We r planning to first separate tweets into polar or neural and then into + and -.

    We need some assistance on how to classify in Weak and how to create data set.Hope you will guide on this matter ...

    Cheers,
    Pri

    ReplyDelete
  4. Hi, will you be making the layout available for download by any chance?

    ReplyDelete
  5. Sure. You can grab the flow layout from:

    https://docs.google.com/open?id=0B1pvkpCwTsiSTHdfWHJ1ZHpJWms

    Cheers,
    Mark.

    ReplyDelete
  6. After downloading the arff file and the layout file, I started to run the application, but there was no results shown in the "TextViewer" token. Is there any procedures going wrong when I processing it?

    ReplyDelete
  7. Hi Ken,

    The data that I posted earlier was the labelled data (i.e. it had passed through the part of the flow that does the automatic labeling via substring matching). The flow is designed to work with the original 850K instance unlabeled data. You can get this from:

    https://docs.google.com/open?id=0B1pvkpCwTsiSV3dGX2huaHRpdHc

    You'll have to change the path in the ArffLoader to point to wherever you downloaded this file to.

    Cheers,
    Mark.

    ReplyDelete
    Replies
    1. Thanks Mark,

      Thanks for providing the original data set and I successfully operated it. Besides I have crawled some twitter data transforming into the 'csv' format, but failed to be loaded from the "arff viewer", where the specific errors said "java.io.IOException: wrong number of values. Read 2, expected 1, read Token[EOL], line 18". So, I wonder how changing the csv to arff format without the above errors?

      Ken

      Delete
    2. how could you configure the several components , runnig the kfml file needs to weka api or we can do it wuth the graphique interface ???

      Delete
  8. Can we use export the model file created after classification in PMML format?

    ReplyDelete
  9. Weka can consume certain PMML models, but doesn't have an export facility yet. This is on the roadmap for a future release.

    Cheers,
    Mark.

    ReplyDelete
  10. Hi Mark,

    I have just started using Weka, and I want to use it for a simple text classification problem with the features such as unigrams,word count,position with a naive bayes classifier. I have found some information online and in documentation but I am finding difficult to understand the difference between unigrams feature set and bag of words representation in arff files.

    suppose if I have input arff file such as the below one:

    @relation text_files

    @attribute review string
    @attribute sentiment {dummy,negative, positive}

    @data
    "this is some text", positive
    "this is some more text", positive
    "different stuff", negative

    after applying stringToWordVector and Reorder filters, the output arff will be:


    @relation 'bagOfWords'

    @attribute different numeric
    @attribute is numeric
    @attribute more numeric
    @attribute some numeric
    @attribute stuff numeric
    @attribute text numeric
    @attribute this numeric
    @attribute sentiment {dummy,negative,positive}

    @data

    {1 1,3 1,5 1,6 1,7 positive}
    {1 1,2 1,3 1,5 1,6 1,7 positive}
    {0 1,4 1,7 negative}

    suppose If I want to train the classifier with unigram count in each class, that is something like the following arff file:

    @relation 'bagOfWords WordCount'

    @attribute unigram string
    @attribute count numeric
    @attribute sentiment {dummy,negative,positive}

    @data
    "this",2,positive
    "is",2,positive
    "some",2,positive
    "text",2,positive
    "more",1,positive
    "different",1,negative
    "stuff",1,negative

    I understood the third way clearly and if I want to extend the feature set later (such as unigram position etc) seems to be easy relatively. But my doubt is, is this the correct way of representing the data in arff, for the classifier?

    With API, I got that, with stringToWordVector I can set options such as -C -T for word count and term frequencies.

    Suppose if I want to include other features in bagofwords arff file, how can I do that?


    Thank you in anticipation.

    Nirmala

    ReplyDelete
    Replies
    1. Hi Nimala,

      Your original input ARFF file (the one with the string attribute and class label) can have other features that you compute elsewhere. The StringToWordVector filter will only process the string attributes - other features will be left untouched.

      Cheers,
      Mark.

      Delete
  11. How can I create the arff file of the current twitter data?

    ReplyDelete
    Replies
    1. The original 850K tweet file in ARFF format is available for download - see the link in the earlier comments.

      Cheers,
      Mark.

      Delete
  12. Hi

    I don't understand the configuration of substring labeler filter and subtring replacer filter. If you can help me are very gratfull.

    Thanks

    Ana

    ReplyDelete
    Replies
    1. If you download the example Knowledge Flow layout file you can take a look at the configuration I used for processing the tweets.

      Cheers,
      Mark.

      Delete
  13. how can I incporporate the layout in weka to configure the several components , have I to run it with eclipse ?

    ReplyDelete
  14. Just launch the Knowledge Flow GUI from Weka's GUIChooser application. Alternatively, you can execute the flow layout from the command line or a script by invoking weka.gui.beans.FlowRunner.

    Cheers,
    Mark.

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. it's alright , I run it the the command line. My project is to do the same work but for statuses written in arabic , is it possible to do it with this template , and if yes what I've to change ?

    ReplyDelete
  18. Great article here.
    Could you please provide a simple example of streaming twitter data with groovy+jsonfieldextractor?

    ReplyDelete