posted on Monday, July 31, 2006 5:45 PM by Jonathan Hodgson

Tag Clouds and spotting trends

I've noticed a few blogs are displaying 'tag clouds' so people can at a glance get a general feeling of what the content is about.

ZoomClouds is a free service that takes an RSS feed and builds a clickable sets of terms sized by frequency, for example - CNN Top Stories or SlashDot. Other products like Taglocity aim to do it for emails in Outlook.

It is a similar trend spotting technique to NewsMap, although we are a still away from trading off the news.

So I wondered how difficult it would be to write a tag cloud service, the two main functions being generate a list of tags and frequency and render font size based that frequency from a given set of content.

Although there are some examples of the font distribution algorithms and a good Cloud control for ASP.NET, the difficult part is the automatic generation of keyword tags.

Using the BBC News RSS feed simple word parsing didn't give as good results as the same input to ZoomClouds - their cloud service knows how to break and combine words into phrases, ie. Tony Blair as one tag.

So the next idea was natural language parsing, Richard Northedge's CodeProject article on a C# port of OpenNLP looked promising. However the OpenNLP library wasn't mature enough and took huge amounts of CPU and memory to parse even a simple string and not that successfully unfortunately - coming up with Tony as a verb and Blair as a noun.

A bit more research later and I came across the Yahoo! Search Web Services and in particular the Term Extraction API. This allows you give it a sentence like "Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration." and it will generate extracted terms in order of importance, eg. italian sculptors, virgin marg, painters, renaissance and inspiration. Although this works well, I didn't fit my model of not using external services.

I did look at using DotLucene, the opensource search engine, but again it's phrase recognition didn't work as well as needed.

Being realistic automatic keyword analysis of text is better left to commercial products like Autonomy or Inxight's SmartDiscovery Extraction Server. The Inxight ThingFinder product has advanced extraction, categorization, variant indentification and grouping.

So I'm leaving this mini-project for the moment, unless somewhere out there has written something they want to share!

Comments

# More natural language processing using GATE

Monday, August 28, 2006 12:39 PM by Anonymous
After my previous entry about natural language processing, I came across a project by the University...

# tag clouds, zoom clouds, MO clouds

Tuesday, September 05, 2006 6:52 AM by Anonymous
Jonathan very kindly mailed me and pointed me at ZoomClouds which would have done exactly...