第三次实现Logistic Regression（c++）_接口

浏览数：44 / 时间：2015年06月08日

看《我是歌手》第三期回放，张宇唱得实在太烂，还是回来写会儿blog吧。

1. 初衷

前两次实现，第一次的输入特征限于0-1特征，第二次限于实值特征，两者混用的还没尝试过，这次也不打算尝试。前面实现的都是二分类问题，在现实生活中，最经常遇到的还是多分类问题。由二分类器转成多分类器有两种方法：one vs all，or one vs one。前者，对于K个类别，建立K个分类器，每个分类器区分当前类别和其余类别。后者，K个类别需要K(K-1)个分类器，分别区分每两个类别。工程实践中，相信很少有人好事儿选择后者。

这两天再次学习李航老师的《统计学习方法》，翻看逻辑回归那部分章节。书中简略的提到了逻辑回归在多分类上的公式。所以，手痒，动手实现一个。

2. 原理

对于二分类LR模型，最后用sigmoid函数来计算当前类别的概率，sigmoid函数公式为：f(x) = 1.0 / (1.0 + exp(-x))。在实际计算中，为了提升计算精度，往往用他的变形：f(x) = exp(x) / (1.0 + exp(x))。

对于多分类问题，李航老师书中给出的概率计算函数为：f_i(x) = exp(x_i) / (1.0 + sum_j (exp(x_j)) )。大家凑合着看，带有‘_i’或者‘_j’的都是表示下标。

对于SGD的权重更新公式，我没有具体推导。从代码实现上看，在从前的代码上稍作改动即可。

3. 参数

在处理二分类的时候，每一个特征有一个权重，权重形成一个权重向量。在处理K分类的时候，对于每一个特征，在每一个分类中都有一个权重，特征+分类，其权重形成一个权重矩阵。当特征数为N的时候，按刚才描述，权重矩阵大小为K * N。不过，1 * N能处理二分类问题；则（K-1）*N就能处理K类问题，只要增加一个default分类即可。不知道说明白没有，代码如下：

private:
	// the number of target class
	int iClassNum;
	// the number of feature
	int iFeatureNum;
	// the theta matrix of iMaxFeatureNum * (iClassNum - 1)
	// note: for binary class, we need only 1 vector of theta; for multi-class, 
	// iMaxFeatureNum * (iClassNum - 1) is always enough
	vector< vector<double> > ThetaMatrix;

4. 整体接口

整体函数接口，定义在LogisticRegression.h文件中，如下：

/***********************************************************************************
* Logistic Regression classifier version 0.03
* Implemented by Jinghui Xiao ([email protected] or [email protected])
* Last updated on 2014-1-17
***********************************************************************************/

#pragma once

#include <vector>
#include <fstream>
#include <iostream>
#include <iterator>
#include <sstream>
#include <algorithm>
#include <cmath>

using namespace std;

// The represetation for a feature and its value, init with ‘-1‘
class FeaValNode
{
public:
	int iFeatureId;
	double dValue;

	FeaValNode (void);
	~FeaValNode (void);
};

// The represetation for a sample
class Sample
{
public:
	// the class index for a sample: 0-1 value, init with ‘-1‘
	int iClass;
	vector<FeaValNode> FeaValNodeVec;

	Sample (void);
	~Sample (void);
};

// the minimal float number for smoothing for scaling the input samples
#define SMOOTHFATOR 1e-100

// The logistic regression classifier for MULTI-classes
class LogisticRegression
{
public:
	LogisticRegression(void);
	~LogisticRegression(void);

	// scale all of the sample values and put the result into txt
	bool ScaleAllSampleValTxt (const char * sFileIn, int iFeatureNum, const char * sFileOut);
	// train by SGD on the sample file
	bool TrainSGDOnSampleFile (
				const char * sFileName, int iClassNum, int iFeatureNum,		// about the samples
				double dLearningRate,										// about the learning 
				int iMaxLoop, double dMinImproveRatio						// about the stop criteria
				);
	// train by SGD on the sample file, decreasing dLearningRate during loop
	bool TrainSGDOnSampleFileEx (
				const char * sFileName, int iClassNum, int iFeatureNum,		// about the samples
				double dLearningRate,										// about the learning 
				int iMaxLoop, double dMinImproveRatio						// about the stop criteria
				);
	// save the model to txt file: the theta matrix with its size
	bool SaveLRModelTxt (const char * sFileName);
	// load the model from txt file: the theta matrix with its size
	bool LoadLRModelTxt (const char * sFileName);
	// load the samples from file, predict by the LR model
	bool PredictOnSampleFile (const char * sFileIn, const char * sFileOut, const char * sFileLog);

	// just for test
	void Test (void);

private:
	// read a sample from a line, return false if fail
	bool ReadSampleFrmLine (string & sLine, Sample & theSample);
	// load all of the samples into sample vector, this is for scale samples
	bool LoadAllSamples (const char * sFileName, vector<Sample> & SampleVec);
	// initialize the theta matrix with iClassNum and iFeatureNum
	bool InitThetaMatrix (int iClassNum, int iFeatureNum);
	// calculate the model function output for iClassIndex by feature vector
	double CalcFuncOutByFeaVec (vector<FeaValNode> & FeaValNodeVec, int iClassIndex);
	// calculate the model function output for all the classes, and return the class index with max probability
	int CalcFuncOutByFeaVecForAllClass (vector<FeaValNode> & FeaValNodeVec, vector<double> & ClassProbVec);
	// calculate the gradient and update the theta matrix, it returns the cost
	double UpdateThetaMatrix (Sample & theSample, vector<double> & ClassProbVec, double dLearningRate);
	// predict the class for one single sample
	int PredictOneSample (Sample & theSample);

private:
	// the number of target class
	int iClassNum;
	// the number of feature
	int iFeatureNum;
	// the theta matrix of iMaxFeatureNum * (iClassNum - 1)
	// note: for binary class, we need only 1 vector of theta; for multi-class, 
	// iMaxFeatureNum * (iClassNum - 1) is always enough
	vector< vector<double> > ThetaMatrix;
};

增加了一个样本scale函数，用来处理训练和测试样本。增加了另一个SGD训练函数，区别在于学习率随着迭代逐渐衰减。函数实现在LogisticRegression.cpp中，见后续博文。

转载请注明出处：http://blog.csdn.net/xceman1997/article/details/18426073

郑重声明：本站内容如果来自互联网及其他传播媒体，其版权均属原媒体及文章作者所有。转载目的在于传递更多信息及用于网络分享，并不代表本站赞同其观点和对其真实性负责，也不构成任何其他建议。

第三次实现Logistic Regression（c++）_接口