Addressing diverse corpora with cluster-based term weighting

Full APA Reference

Organisciak, P. (2013). Addressing diverse corpora with cluster-based term weighting. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries, (pp. 163-166). doi:10.1145/2467696.2467740

Publication Abstract

Highly heterogeneous collections present difficulties to term weighting models that are informed by corpus-level frequencies. Collections which span multiple languages or large time periods do not provide realistic statistics on which words are interesting to a system. This paper presents a case where diverse corpora can frustrate term weighting and proposes a modification that weighs documents according to their class or cluster within the collection. In cases of diverse corpora, the proposed modification better represents the intuitions behind corpus-level document frequencies.

See Also URL