sentiwordnet的简单使用

浏览数：37 / 时间：2015年06月09日

# Example line:

# POS ID PosS NegS SynsetTerm#sentimentscore Desc

# a 00009618 0.5 0.25 spartan#4 austere#3 ascetical#2 describe

在sentiwordnet中，一行数据如上所示，第一项是单词的词性；第二项是单词的ID；第三第四项分别是积极得分和消极得分；第五项字符串分别是：单词#情感得分单词#情感得分，这里列出的单词都是同义词，意思在第六项所描述；第六项描述前面一组同义词的意思。

因为一个单词有很多种意思，比如‘good’，作为名词有4种含义（即会出现在4行里面，下同），作为形容词有21种含义，作为副词有2种含义。当我们要判断‘good’这个词的情感的时候，我们不会去直接判断这个‘good’到底是什么意思，再代入某一行计算，而是先把所有‘good’统计一下，得到一个平均值，用这个平均值代表所有‘good’的情感得分，这个得分的值在-1到1之间。如果得分是正的，就是积极的；如果是负的，就是消极的；如果是0.0，就是中性的。

具体计算公式是：score = ，n是单词的所有含义数

Sum =

最后的得分 = score/sum。

使用的时候，只需传入单词和词性，即可得到大部分情感词的情感得分，从而判断极性。

# author:kou
# date:2014年3月14日 

from __future__ import division

class SentiWordNet():
    def __init__(self,netpath):
        self.netpath = netpath
    self.dictionary = {}
    
    def infoextract(self):
    tempdict = {}
    templist = []
        try:
            f = open(self.netpath,"r")
        except IOError:
            print "failed to open file!"
            exit()
        print ‘start extracting.......‘
        
    # Example line:
    # POS     ID     PosS  NegS SynsetTerm#sensenumber Desc
    # a   00009618  0.5    0.25  spartan#4 austere#3 ascetical#2  ……

        for sor in f.readlines():
            if sor.strip().startswith("#"):
                pass
            else:
                data = sor.split("\t")
                if len(data) != 6:
                    print ‘invalid data‘
                    break
                wordTypeMarker = data[0]      
                synsetScore = float(data[2]) - float(data[3])   #// Calculate synset score as score = PosS - NegS
                synTermsSplit = data[4].split(" ")    # word#sentimentscore
        for w in synTermsSplit:
            synTermAndRank = w.split("#")           #
            synTerm = synTermAndRank[0] + "#" + wordTypeMarker    #单词#词性
            synTermRank = int(synTermAndRank[1])    
            if  tempdict.has_key(synTerm):
            t = [synTermRank,synsetScore]
            tempdict.get(synTerm).append(t)            
            else:
            temp = {synTerm:[]}
            t = [synTermRank,synsetScore]
            temp.get(synTerm).append(t)
            tempdict.update(temp)            
            
    for key in tempdict.keys():
        score = 0.0
        ssum = 0.0            
        for wordlist in tempdict.get(key):
        score += wordlist[1]/wordlist[0]
        ssum += 1.0/wordlist[0]
        score /= ssum
        self.dictionary.update({key:score})
    
    def getscore(self,word,pos):
    return self.dictionary.get(word + "#" + pos)
        
            
            
                
if __name__ == ‘__main__‘:
    netpath = "C:\\Users\\Administrator\\Desktop\\SentiWordNet.txt"
    swn= SentiWordNet(netpath)
    swn.infoextract()
    print "good#a "+str(swn.getscore(‘good‘,‘a‘))
    print "bad#a "+str(swn.getscore(‘bad‘,‘a‘))
    print "blue#a "+str(swn.getscore(‘blue‘,‘a‘))
    print "blue#a "+str(swn.getscore(‘blue‘,‘n‘))