I had an idea yesterday. I want to build a quick app that displays the relative sentiment (positive v negative) of a firehose of content in real-time. Think of it as a cross between work Rob Hawkes did with Twitter sentiment analysis and the WordPress.com live activity chart.
My initial instinct was to follow search traffic from Google and the like, plotting the sentiment of both search terms and the top results returned for a query. This would give a good sense of the sentiment of content being consumed by the public. Unfortunately there isn’t a firehose of this kind available.
I’d have to build my own.
Likely a Chrome extension that tracks search data and submits it to a central API somewhere. Not impossible, just more than I want to tackle right now. So keeping that eventuality in mind, I could instead do the same thing Rob did and consume the Twitter firehose instead. It gives me the sense of sentiment of produced content instead of consumed, but could still be an interesting project.
I see various components here:
- An application that grabs data from the firehose, extracts the parts I care about (text + location data) and feeds it into a queue
- A queue handler – either RabbitMQ or Gearman, not sure which I want to use yet
- Several workers that grab the data out of the queue, process for sentiment, then pass the location + keywords + sentiment score along to another application
- An application for consuming the worker data and displaying points on a chart
The first application should take the incoming stream, filter it, and pass the data directly into the queue. The queue can be a cluster of handlers, not a single endpoint – both RabbitMQ and Gearman support this. The workers can be isolated and, thanks to libraries for both tools, written in any language. The final application should be two components – a website for displaying the initial charts and assets (static) and a pipe like the first application that streams incoming worker data to websockets.
All of this could live on a single server (and I’ll likely construct it that way) thanks to Docker Compose.
Streaming Application
Application #1 will be a simple stream – a single application that subscribes to the firehose, manipulates the incoming data, and pipes it directly into the queue. It will be a single Docker container that imports links from the queue handler.
Queue Handler
The queue servers will be multiple Docker containers, each internally identical and mapping to linked hostnames within the streaming application. I think I’ll start with one container and see how things fare.
Workers
The workers will each also live in individual Docker containers and link internally to the queue handlers and to the final presentation application. They’ll merely take data from one (the queue), process it, and dump it to the other.
Presentation Application
This component is the trickiest. First, it needs a static HTML presentation that brings in the assets (styling + JS) to build out the page. Then it needs to allow websocket communication from the front end. Node has been proven to support up to 1M concurrent websocket connections, so I’m not concerned there. The presentation App will likely be a single, Node application that responds to GET requests for assets, maintains socket connections, and streams POSTs directly to the sockets.
Ultimate Goals
First, I want to play with streaming data. Second, I want to experiment with workers implemented in different languages. I’ll probably start with PHP since I’m familiar with it, but I think gauging the performance differences between PHP and Java and Scala could be fun.
Secondary Goals
I think building out such a system that could (eventually) be tied into search analysis could be useful. Comparing content produced (i.e. Tweets, feeds) with content consumed (i.e search results for specific terms) could be interesting.