Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion

Lecturer : 
Event type: 
Doctoral dissertation
Doctoral dissertation
Mika Timonen
Docent Timo Honkela (Aalto University)
Hannu Toivonen
Event time: 
2013-01-25 12:00 to 15:00
University of Helsinki Main Building, Auditorium XIII

This thesis focuses on term weighting in short documents. I propose weighting approaches for assessing the importance of terms for three tasks: (1) document categorization, which aims to classify documents such as tweets into categories, (2) keyword extraction, which aims to identify and extract the most important words of a document, and (3) keyword association modeling, which aims to identify links between keywords and use them for query expansion.

The main finding of this study is that the existing term weighting approaches have trouble performing well with short documents. The novel algorithms proposed in this thesis produce promising results both for the keyword extraction and for the text categorization. In addition, when using keyword weighting with query expansion, we show that we are able to produce better search results especially when the original search terms would not produce any results.

Last updated on 9 Jan 2013 by Pirjo Moen - Page created on 9 Jan 2013 by Pirjo Moen