Sentiment Classification is one of the most popular algorithms used on a daily basis by many companies. In the age of the Internet and online life, various brands and web portals want to classify certain text streams in real-time to suggest relevant products to users or direct them to specific places on their platforms. In order to deal with such issues comes the concept of Online Learning. In this blog post, I will introduce you to what Online Learning is and why it is applied in this kind of classification. This will be explained by the example of Vowpal Wabbit usage in Python.
First of all, let me start by explaining the concept of Online Learning. It is worth mentioning that this is the opposite of Batch Learning. What is the difference between them? I think it is helpful to illustrate how they work in a graphic:
As you can see in the diagram, in the case of Batch Learning(1) you need to store a dataset of multiple instances first and then train a model once (of course, you are able to update it later when you update your dataset). On the other hand, Active Learning(2) is a technique that collects and processes sequentially only one learning example at a time. It sets up an initial prediction model and updates its parameters for future predictions at each step. For more details, take a look at this article written by Ajitesh Kumar.
Vowpal Wabbit provides fast and efficient online machine learning techniques for reinforcement learning, supervised learning and so on. It is driven by the community, research and many proven algorithms. Noteworthy, the main contributor to this library is Microsoft Research.
I used Stanford’s large movie review dataset for the experiment. It includes a training set and a test set – altogether 25000 movie reviews downloaded from IMDb. There are equal numbers of ‘positive’ and ‘negative’ labels in both datasets, indicating different sentiment polarities.
Finally, we can focus on the implementation. We will need the following libraries:
from sklearn.datasets import load_files
The OS module provides functions to interact with the operating system and will be used by us to access our dataset. RE library is used to deal with regular expressions and will be used by us to convert the dataset to a format supported by Vowpal Wabbit. The other two libraries need no introduction. 😉
As I brought up, we need a method for converting our dataset to a particular format. For the binary example problem it looks like the following:
[Label] |[Namespace] [Feature]
Label is a number and in our case of classification, it is 1 and -1. Feature is just a text for which we define a sentiment. Namespaces are used to create separate feature spaces, which is particularly useful for multiple features.
The mentioned method is seen below:
Now we can go to the part of preparing our data and using that method:
I think that the implementation is understandable and it is not needed to explain it line by line. An example of the obtained dataset in the ‘vw’ format looks like this:
The next step in our code is responsible for the learning process. As I pointed out earlier, the learning process, in this case, happens in real-time, here while loading successive examples from the dataset:
Finally, we are able to proceed with checking how our model deals with the prediction:
I just hardcoded the example text for testing purposes, but of course, in a real system, this would be done in an ‘elegant’ way. For given example we got a value equals -0.4. We can interpret this to mean that the text has a negative sentiment.
I wanted to introduce you to this library, which has been gaining popularity recently. Vowpal Wabbit is mainly used for sentiment classification, both binary and multiclass. In addition, however, it is also employed for forecasting and various types of optimisation methods.