Monday, July 31, 2006 - Posts

Tag Clouds and spotting trends

I've noticed a few blogs are displaying 'tag clouds' so people can at a glance get a general feeling of what the content is about.

ZoomClouds is a free service that takes an RSS feed and builds a clickable sets of terms sized by frequency, for example - CNN Top Stories or SlashDot. Other products like Taglocity aim to do it for emails in Outlook.

It is a similar trend spotting technique to NewsMap, although we are a still away from trading off the news.

So I wondered how difficult it would be to write a tag cloud service, the two main functions being generate a list of tags and frequency and render font size based that frequency from a given set of content.

Although there are some examples of the font distribution algorithms and a good Cloud control for ASP.NET, the difficult part is the automatic generation of keyword tags.

Using the BBC News RSS feed simple word parsing didn't give as good results as the same input to ZoomClouds - their cloud service knows how to break and combine words into phrases, ie. Tony Blair as one tag.

So the next idea was natural language parsing, Richard Northedge's CodeProject article on a C# port of OpenNLP looked promising. However the OpenNLP library wasn't mature enough and took huge amounts of CPU and memory to parse even a simple string and not that successfully unfortunately - coming up with Tony as a verb and Blair as a noun.

A bit more research later and I came across the Yahoo! Search Web Services and in particular the Term Extraction API. This allows you give it a sentence like "Italian sculptors and painters of the renaissance favored the Virgin Mary for inspiration." and it will generate extracted terms in order of importance, eg. italian sculptors, virgin marg, painters, renaissance and inspiration. Although this works well, I didn't fit my model of not using external services.

I did look at using DotLucene, the opensource search engine, but again it's phrase recognition didn't work as well as needed.

Being realistic automatic keyword analysis of text is better left to commercial products like Autonomy or Inxight's SmartDiscovery Extraction Server. The Inxight ThingFinder product has advanced extraction, categorization, variant indentification and grouping.

So I'm leaving this mini-project for the moment, unless somewhere out there has written something they want to share!

with 2 Comments

Longhorn networking enhancements

The Microsoft.com operations teams blog is definitely worth a read. This post details some of the testing they have been doing with regards to TCP/IP enhancements in Windows Vista/Longhorn and if the figures can be believed, they make some serious reading:

We set up one server in Bothell, WA and one in Santa Clara, CA (~22ms round-trip latency) and let the Devs have at testing with TTCP.

Now, TTCP pushes the limits of the stack, CPU, bus, network, etc, but that doesn't reflect the normal file transfers that happen as part of doing real work.  Since those file transfers create some of the more challenging scenarios for us, we put two new servers in WA and two in CA, all with GigE NICs.  Each data center has one W2K3 server and one Longhorn server.

From there we set up two Robocopy jobs to pull 20 1GB files from the servers in CA and drop them onto the servers in WA.  One job was run with W2K3 at each end and another was run with Longhorn.  All servers are the same HP DL385 Dual Core machines with 16GB RAM and GigE network uplinks.  Results:

  • Pull with W2K3 at both ends (CA and WA) :  ~12Mb/s (includes SMB and TCP/IP tweaks)
  • Pull with Longhorn at both ends:  >400Mb/s (default config...no tweaks)
  • Pull of same 1GB files between two Longhorn boxes on same VLAN:  502Mb/s

So, I know, you're thinking, but I don't move a bunch of 1GB test files back and forth all day, I pull web logs from remote servers back to a central location for processing and that takes a significant amount of time. We thought the same thing so, for a real-world sample of something we do regularly we pulled a single hourly web log file (199 MB) from a www.microsoft.com server in CA back to a couple servers in the WA data center.  The WWW server in CA is a W2K3 box with GigE and we pulled the file across the wire with a W2K3 and Longhorn server in WA.  For a good view into the future we also put the file on a Longhorn server in CA and pulled from the same Longhorn server in WA.  Results: (represented in terms of time because when you get up to make a sandwich between file copies, this is how long you have):

  • Pull from W2K3 in CA to W2K3 in WA:  ~2:12
  • Pull from W2K3 in CA to Longhorn in WA:  ~0:12
  • Pull from Longhorn in CA to Longhorn in WA:  ~0:04 (not much sandwich time)

Maybe combined with hot-patching feature, which should let all non-kernel updates occur without the need for a system reboot, Longhorn server might be more than the standard upgrade to run what new Microsoft product is tied to it.

If either product ever ships and looking at some of the Windows Vista bug reports that Robert McLaws put together charting the bug counts over time things still need some attention.

with 0 Comments

Giving better presentations

Darren Strange, the Microsoft Office Product Manager in the UK, has written up some great tips on giving better presentations, including:

  • Know your core material
  • Make friends with the stage
  • Chat with the audience beforehand
  • Speak from a place of passion
  • Pre-empty the questions
  • Tell stories, do demos
  • Make one main point
  • Stand up and take notice

I must update the DeveloperDeveloperDeveloper! Day speaker tips to include these.

with 0 Comments

Handy translations list and MSDN Library for free

Translating applications and websites is a time consuming process and even with professional translators you aren't always guaranteed they'll argee on terminology.

So this list from Microsoft common technical terms is a great resource, whether it is broadband (in French: haut débit) or default printer (in Romanian: imprimantă implicită) it should save you some time.

Also now available for free download is the MSDN Library, although I must say I tend to use www.google.com over F1 these days - also I was surprised at some of these negative comments.

with 0 Comments

Microsoft Research making it into products

Back in 2002, Microsoft said it would spend $5.3 billion on R&D over the coming year and a large part of that going to Microsoft Research.

But has that paid off? Some look at the R&D investment by major technology companies since 2000, especially when comparing market capitalization during the same period.

So it was with interest I read Alain's post on the Microsoft Research - It does pay off! and especially the link to the products Microsoft Research has contributed to:

Windows XP

  • ClearType display technology allows a crisper, higher-resolution display of text on ordinary LCD screens.
  • IPv6 is an implementation of the Internet Protocol version 6 that is fully supported in the shipping version of the operating system.
  • Source code analysis tool advancements allow developers to find more subtle and complex bugs.
  • Performance optimization tool advancements optimize the load time, memory requirements and overall performance of the operating system. 

XBox

  • True Skill, developed by the Machine Learning and Perception group, is a new ranking and matchmaking system that uses a mathematical model of uncertainty to address weaknesses in existing ranking systems. Bayesian analysis enables the True Skill ranking systems to identify player skill with great speed, to the extent that a new player joining a league consisting of a million players can be ranked accurately in fewer than 20 games.
  • IP network probing Xbox Live provides online gaming and uses Microsoft Research technology to help ensure that gamers get the best online experience. This technology measures the connection quality between gamers players, pairing them with others who have similar connection speeds, which ensures a more equal gaming experience. 
  • Graphics. Xbox focuses on very realistic images and uses Microsoft Research graphics technology specifically for modeling animal fur.
  • Audio codecs. Just as in Windows, similar audio compression technologies are provided by Microsoft Research.

Visual Studio 2005 / .Net Framework 2.0

  • Generics is an extension to the .NET Common Intermediary Language that anables object-oriented code to be annotated with parameters that indicate how the code can be resused in different ways. It lets developers write more of their code in a way that is more reliable (has stronger static checking) without sacrificing efficiency or code flexibility. Generics metadata is understood by C#, Visual Basic, C++, and other .NET language compilers.

 

Let's how more ideas move from the labs into real shipping products, including Photosynth, check out these videos.

with 0 Comments

SOX? BASEL II? Regulatory Compliance Demystified

Anyone working in IT for finance companies can't have missed the changes in the last few years from Sarbanes-Oxley and other regulatory compliance.

But often the developers don't really get a clear picture of why and what it means for them, so this article "Regulatory Compliance Demystified: An Introduction to Compliance for Developers" on MSDN aims to explain those points.

The major acts get a summary of the legislation and the process steps required, ie.confidentiality, availability, integrity, access controls, auditing, logging and change management. Well worth a read.

 

In a similar vain, as more and more companies use SharePoint for document and project sharing, new features such as Auditing in MOSS 2007 are a very welcome addition, including the programatic access via SPAudit. Also there is a whitepaper on Excel 2007 regulatory compliance and PwC whitepaper on spreadsheets in general.

Don't forget other presentations from the Microsoft Financial Developers Conference are online.

with 0 Comments