Monday, July 31, 2006

Text mining on the horizon

During my undergrad days, I took a database class and just touched the surface of data mining. But the techniques I learned made me think there was a way to mine text the same way.

Researchers at UC-Irvine, according to ZDNet, have done it, with "a relatively new method named topic mining."

From the press release:

Performing what a team of dedicated and bleary-eyed newspaper librarians would need months to do, scientists at UC Irvine have used an up-and-coming technology to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.

And here's how it works:
Text mining allows a computer to extract useful information from unstructured text. Until recently, text mining required a great deal of preparation before documents could be analyzed in a meaningful way. A new text-mining technique called "topic modeling" - which UCI scientists used in their New York Times experiment - looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics - all with minimal human effort.

There's also a link to software that researchers can experiment with.

