25.6.09

my job, or something

What's in a name? A
Keyword by any other
Name would smell as sweet.

Hard to believe that I've been here for a month. In that time, my research has actually made some progress! (In spite of what it may look like from my blog activity, I have actually been working, you know.) Just yesterday I made the major breakthrough that means that the rest of the summer will tend towards the downhill. My supervisor gave me a pep talk a few days ago about how "research pace picks up towards the end," and I guess that's true.

I have a better idea about what exactly it is that I'm doing now, too, which is, um, good. My supervisor's research is in the area of keyphrase extraction, and I'm doing an extension of that: the technical title of my summer project falls somewhere in the topic of "unsupervised back-of-the-book indexing." Basically, I have to write a programme that is intelligent enough to read a book and decide what words belong in an index for it. This also involves word sense disambiguation (that is, it means that I shouldn't index the word "sausage" separately from the word "bratwurst," but that I should instead realize that the two are related in an "instanceof" way and display something to that effect).

UIMA is an interesting framework. I can't say that I approve of their use of .xml files, but a lot of the ideas that underlie the actual coding seem sound. There are three types of processors: readers (they gather the initial data about a document or a collection of documents), annotators (they parse the document text and overlay markings that indicate whether specific phrases are Named Entities or Noun Phrases or what have you), and consumers (they aren't allowed to add information, but they read it and can output statistics, like the precision and recall of a given run of the annotators). Having an xml file expressly for the purpose of pointing to a java file, though, bothers me a bit.

Anyway, I'm excited to get to the real science of my project. Not much work has been done in this area; Torsten was only successful in finding three papers, and all three were by the same two authors. Csomai and Mihalcea and Valkyrie will be the only ones to have their names on anything. What a strange combination of syllables. :)

0 comments:

Post a Comment