Filter Keywords and Majority Class Strategies for Company Name Disambiguation in Twitter

  • Damiano Spina
  • Enrique Amigó
  • Julio Gonzalo
CLEF 2011 Conference on Multilingual and Multimodal Information Access Evaluation, 2011

Monitoring the online reputation of a company starts by retrieving all (fresh) information where the company is mentioned; and a major problem in this context is that company names are often ambiguous (apple may refer to the company, the fruit, the singer, etc.). The problem is particularly hard in microblogging, where there is little context to disambiguate: this was the task addressed in the WePS-3 CLEF lab exercise in 2010. This paper introduces a novel fingerprint representation technique to visualize and compare system results for the task. We apply this technique to the systems that originally participated in WePS-3, and then we use it to explore the usefulness of filter keywords (those whose presence in a tweet reliably signals either the positive or the negative class) and finding the majority class (whether positive or negative tweets are predominant for a given company name in a tweet stream) as signals that contribute to address the problem. Our study shows that both are key signals to solve the task, and we also find that, remarkably, the vocabulary associated to a company in the Web does not seem to match the vocabulary used in Twitter streams: even a manual extraction of filter keywords from web pages has substantially lower recall than an oracle selection of the best terms from the Twitter stream.

@InProceedings{spina2011filter,
author="Spina, Damiano
and Amig{\'o}, Enrique
and Gonzalo, Julio",
title="Filter Keywords and Majority Class Strategies for Company Name Disambiguation in Twitter",
booktitle="Multilingual and Multimodal Information Access Evaluation",
year="2011",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="50--61",
isbn="978-3-642-23708-9"
}
Damiano Spina