Datenbestand vom 10. Dezember 2024
Verlag Dr. Hut GmbH Sternstr. 18 80538 München Tel: 0175 / 9263392 Mo - Fr, 9 - 12 Uhr
aktualisiert am 10. Dezember 2024
978-3-8439-1774-2, Reihe Informatik
Dominique Ziegelmayer Character n-gram-based sentiment analysis
199 Seiten, Dissertation Universität Köln (2014), Softcover, A5
With growing availability and popularity of user-generated content, automatic analysis and aggregation of such information becomes increasingly important. Sentiment polarity classification, one of the main tasks in sentiment analysis, aims to analyze and classify documents according to opinions stated therein. Existing work has mainly focused on standard machine learning techniques. Below, we investigate a novel approach that has proven successful in conventional text classification tasks such as authorship attribution or topic categorization.
This thesis examines classifiers based on adaptive statistical data compression models or more general based on statistics about variable or fixed length character sequences, i.e. character n-grams. We define a classifier using the prediction by partial matching (PPM) compression algorithm and introduce the p2-Measure as a simple abstraction of PPM, motivated in information theory. By coupling the p2-Measure with feature weighting and feature selection schemes, it consistently outperforms the far more sophisticated SVM.
In the course of this work, we analyze advantages of the p2-Measure and character n-gram based approaches in detail. Besides the transfer performance between different source and target domains, namely cross-domain sentiment analysis, we are also interested in potential benefits of our method on foreign language datasets. Moreover, we will investigate to which extend the
p2-Measure can be used to determine not only the polarity but also the strength and even the original rating of a document. Altogether, our results show that the p2-Measure is a serious alternative to the word-based standard approach and that it is especially suitable for noisy or foreign language datasets.