在本练习中,我们将使用支持向量机(SVM)来构建垃圾邮件分类器。 我们将从一些简单的2D数据集开始使用SVM来查看它们的工作原理。 然后,我们将对一组原始电子邮件进行一些预处理工作,并使用SVM在处理的电子邮件上构建分类器,以确定它们是否为垃圾邮件。
我们要做的第一件事是看一个简单的二维数据集,看看线性SVM如何对数据集进行不同的C值(类似于线性/逻辑回归中的正则化项)。
实例1 线性可分的支持向量机 1.1 数据读取 import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sb import warningswarnings.simplefilter("ignore")我们将其用散点图表示,其中类标签由符号表示(+表示正类,o表示负类)。
data1 = pd.read_csv('data/svmdata1.csv') data1.head() X1X2y01.96434.5957112.27533.8589122.97814.5651132.93203.5519143.57722.85601 positive=data1[data1["y"].isin([1])]negative=data1[data1["y"].isin([0])]negative X1X2y201.584103.35750212.010303.20390221.952702.78430232.275302.71270242.309902.95840252.828302.63090263.047302.29310272.482702.03730282.505702.38530291.872102.05770302.010302.35460311.226902.32390321.895102.91740331.561003.07090341.549502.69230351.687802.40570361.491902.02710370.962002.68200381.169302.92760390.812202.99920400.973503.38810411.250003.19370421.319103.51090432.229202.20100442.448202.64110452.793801.96560462.091001.61770472.540302.88670480.904403.01980490.766152.58990 positive = data1[data1['y'].isin([1])]negative = data1[data1['y'].isin([0])]fig, ax = plt.subplots(figsize=(6,4))ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')ax.legend()plt.show()请注意,还有一个异常的正例在其他样本之外。 这些类仍然是线性分离的,但它非常紧凑。 我们要训练线性支持向量机来学习类边界。 在这个练习中,我们没有从头开始执行SVM的任务,所以用scikit-learn。
1.2 准备训练数据在这里,我们不准备测试数据,直接用所有数据训练,然后查看训练完成后,每个点属于这个类别的置信度
X_train=data1[["X1","X2"]].values y_train=data1["y"].values 1.3 实例化线性支持向量机 #建立第一个支持向量机对象,C=1from sklearn import svmsvc1=svm.LinearSVC(C=1,loss="hinge",max_iter=1000)svc1.fit(X_train,y_train)svc1.score(X_train,y_train) 0.9803921568627451 from sklearn.model_selection import cross_val_scorecross_val_score(svc1,X_train,y_train,cv=5).mean() 0.9800000000000001让我们看看如果C的值越大,会发生什么
#建立第二个支持向量机对象C=100svc2=svm.LinearSVC(C=100,loss="hinge",max_iter=1000)svc2.fit(X_train,y_train)svc2.score(X_train,y_train) 0.9411764705882353 from sklearn.model_selection import cross_val_scorecross_val_score(svc2,X_train,y_train,cv=5).mean() 0.96 X_train.shape (51, 2) svc1.decision_function(X_train).shape (51,) #建立两个支持向量机的决策函数data1["SV1 decision function"]=svc1.decision_function(X_train)data1["SV2 decision function"]=svc2.decision_function(X_train) data1 X1X2ySV1 decision functionSV2 decision function01.9643004.595710.7984134.49075412.2753003.858910.3808092.54457822.9781004.565111.3730255.66814732.9320003.551910.5185622.39631543.5772002.856010.3320071.00000054.0150003.193710.8666422.62154963.3814003.429110.6840952.57173673.9113004.176111.6073625.60736882.7822004.043110.8309913.76609192.5518004.616211.1626165.294331103.3698003.910111.0699334.082890113.1048003.070910.2280631.087807121.9182004.053410.3284032.712621132.2638004.370610.7917714.153238142.6555003.500810.3133121.886635153.1855004.288811.2701115.052445163.6579003.869211.2069334.315328173.9113003.429110.9974963.237878183.6002003.122110.5628601.872985193.0357003.316510.3877081.779986201.5841003.35750-0.4373420.085220212.0103003.20390-0.3106760.133779221.9527002.78430-0.687313-1.269605232.2753002.71270-0.554972-1.091178242.3099002.95840-0.333914-0.268319252.8283002.63090-0.294693-0.655467263.0473002.29310-0.440957-1.451665272.4827002.03730-0.983720-2.972828282.5057002.38530-0.686002-1.840056291.8721002.05770-1.328194-3.675710302.0103002.35460-1.004062-2.560208311.2269002.32390-1.492455-3.642407321.8951002.91740-0.612714-0.919820331.5610003.07090-0.684991-0.852917341.5495002.69230-1.000889-2.068296351.6878002.40570-1.153080-2.803536361.4919002.02710-1.578039-4.250726370.9620002.68200-1.356765-2.839519381.1693002.92760-1.033648-1.799875390.8122002.99920-1.186393-2.021672400.9735003.38810-0.773489-0.585307411.2500003.19370-0.768670-0.854355421.3191003.51090-0.4688330.238673432.2292002.20100-1.000000-2.772247442.4482002.64110-0.511169-1.100940452.7938001.96560-0.858263-2.809175462.0910001.61770-1.557954-4.796212472.5403002.88670-0.256185-0.206115480.9044003.01980-1.115044-1.840424490.7661502.58990-1.547789-3.377865500.0864054.10451-0.7132610.571946采用决策函数的值作为颜色来看看每个点的置信度,比较两个支持向量机产生的结果的差异
1.4 可视化分析 #绘制图片plt.figure(figsize=(12,4))plt.subplot(1,2,1)plt.scatter(data1["X1"],data1["X2"],marker="s",c=data1["SV1 decision function"],cmap='seismic')plt.title("SVC1")plt.subplot(1,2,2)plt.scatter(data1["X1"],data1["X2"],marker="x",c=data1["SV2 decision function"],cmap='seismic')plt.title("SVC2")plt.show() 实例2 核支持向量机现在我们将从线性SVM转移到能够使用内核进行非线性分类的SVM。 我们首先负责实现一个高斯核函数。 虽然scikit-learn具有内置的高斯内核,但为了实现更清楚,我们将从头开始实现。
2.1 读取数据集 data2 = pd.read_csv('data/svmdata2.csv')data2 X1X2y00.1071430.603070110.0933180.649854120.0979260.705409130.1555300.784357140.2108290.8662281............8580.9942400.51666718590.9642860.47280718600.9758060.43947418610.9896310.42543918620.9965440.4149121863 rows × 3 columns
#可视化数据点positive = data2[data2['y'].isin([1])]negative = data2[data2['y'].isin([0])]fig, ax = plt.subplots(figsize=(6,4))ax.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')ax.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')ax.legend()plt.show() 2.2 定义高斯核函数 def gaussian(x1,x2,sigma):return np.exp(-np.sum((x1-x2)**2)/(2*(sigma**2))) x1=np.arange(1,5)x2=np.arange(6,10)gaussian(x1,x2,2) 3.726653172078671e-06 x1 = np.array([1.0, 2.0, 1.0])x2 = np.array([0.0, 4.0, -1.0])sigma = 2gaussian(x1,x2,2) 0.32465246735834974 X2_train=data2[["X1","X2"]].valuesy2_train=data2["y"].valuesX2_train,y2_train (array([[0.107143 , 0.60307 ],[0.093318 , 0.649854 ],[0.0979263, 0.705409 ],...,[0.975806 , 0.439474 ],[0.989631 , 0.425439 ],[0.996544 , 0.414912 ]]),array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1], dtype=int64))该结果与练习中的预期值相符。 接下来,我们将检查另一个数据集,这次用非线性决策边界。
对于该数据集,我们将使用内置的RBF内核构建支持向量机分类器,并检查其对训练数据的准确性。 为了可视化决策边界,这一次我们将根据实例具有负类标签的预测概率来对点做阴影。 从结果可以看出,它们大部分是正确的。
2.3 创建非线性的支持向量机 import sklearn.svm as svmnl_svc=svm.SVC(C=100,gamma=10,probability=True)nl_svc.fit(X2_train,y2_train) SVC(C=100, gamma=10, probability=True) nl_svc.score(X2_train,y2_train) 0.9698725376593279 2.4 可视化样本类别 #将样本属于正类的概率作为颜色来对两类样本进行可视化输出plt.figure(figsize=(12,4))plt.subplot(1,2,1)positive = data2[data2['y'].isin([1])]negative = data2[data2['y'].isin([0])]plt.scatter(positive['X1'], positive['X2'], s=50, marker='x', label='Positive')plt.scatter(negative['X1'], negative['X2'], s=50, marker='o', label='Negative')plt.legend()plt.subplot(1,2,2)data2["probability"]=nl_svc.predict_proba(data2[["X1","X2"]])[:,1]plt.scatter(data2["X1"],data2["X2"],s=30,c=data2["probability"],cmap="Reds")plt.show()对于第三个数据集,我们给出了训练和验证集,并且基于验证集性能为SVM模型找到最优超参数。 虽然我们可以使用scikit-learn的内置网格搜索来做到这一点,但是本着遵循练习的目的,我们将从头开始实现一个简单的网格搜索。
实例3 如何选择最优的C和gamma 3.1 读取数据 #读取文件,获取数据集data3=pd.read_csv('data/svmdata3.csv')#读取文件,获取验证集data3val=pd.read_csv('data/svmdata3val.csv') data3 X1X2y0-0.1589860.42397711-0.3479260.47076012-0.5046080.35380113-0.5967740.11403514-0.518433-0.1725151............206-0.399885-0.6219301207-0.124078-0.1266081208-0.316935-0.2289471209-0.294124-0.1347950210-0.1531110.1845030211 rows × 3 columns
data3val X1X2yvaly0-0.353062-0.673902001-0.2271260.4473201120.092898-0.7535240030.148243-0.718473004-0.0015120.16292800...............1950.005203-0.544449111960.176352-0.572454001970.127651-0.340938001980.248682-0.49750200199-0.316899-0.42941300200 rows × 4 columns
X = data3[['X1','X2']].valuesXval = data3val[['X1','X2']].valuesy = data3['y'].valuesyval = data3val['yval'].values 3.2 利用数据集中的验证集做模型选择 C_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]gamma_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]best_score = 0best_params = {'C': None, 'gamma': None}for C in C_values:for gamma in gamma_values:svc = svm.SVC(C=C, gamma=gamma)svc.fit(X, y)score = svc.score(Xval, yval)if score > best_score:best_score = scorebest_params['C'] = Cbest_params['gamma'] = gammabest_score, best_params (0.965, {'C': 0.3, 'gamma': 100}) from sklearn import svm, datasetsfrom sklearn.model_selection import GridSearchCVparameters = {'gamma':[0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100], 'C': [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100]}svc = svm.SVC()clf = GridSearchCV(svc, parameters)clf.fit(X, y)# sorted(clf.cv_results_.keys())max_index=np.argmax(clf.cv_results_['mean_test_score']) clf.cv_results_["params"][max_index] {'C': 30, 'gamma': 3} 实例4 基于鸢尾花数据集的决策边界绘制 4.1 读取鸢尾花数据集(特征选择花萼长度和花萼宽度) from sklearn.svm import SVCfrom sklearn import datasetsimport matplotlib as mplimport matplotlib.pyplot as pltmpl.rc('axes', labelsize=14)mpl.rc('xtick', labelsize=12)mpl.rc('ytick', labelsize=12)iris = datasets.load_iris()X = iris["data"][:, (2, 3)] # petal length, petal widthy = iris["target"]setosa_or_versicolor = (y == 0) | (y == 1)X = X[setosa_or_versicolor]y = y[setosa_or_versicolor]# SVM Classifier modelsvm_clf = SVC(kernel="linear", C=5)svm_clf.fit(X, y) SVC(C=5, kernel='linear') np.max(X[:,0]) 5.1 4.2 随机绘制几条决策边界可视化 # Bad modelsx0 = np.linspace(0, 5.5, 200)pred_1 = 5 * x0 - 20pred_2 = x0 - 1.8pred_3 = 0.1 * x0 + 0.5 #基于随机绘制的决策边界来叠加图plt.figure(figsize=(6,4))plt.plot(x0, pred_1, "g--", linewidth=2)plt.plot(x0, pred_2, "r--", linewidth=2)plt.plot(x0, pred_3, "b--", linewidth=2)plt.scatter(X[:,0][y==0],X[:,1][y==0],marker="s")plt.scatter(X[:,0][y==1],X[:,1][y==1],marker="*")plt.axis([0, 5.5, 0, 2])plt.show()plt.show() 4.3 随机绘制几条决策边界可视化 svm_clf.coef_[0] array([1.29411744, 0.82352928]) svm_clf.intercept_[0] -3.7882347112962464 svm_clf.support_vectors_ array([[1.9, 0.4],[3. , 1.1]]) np.max(X[:,0]),np.min(X[:,0]) (5.1, 1.0) 4.4 最大间隔决策边界可视化 def plot_svc_decision_boundary(svm_clf, xmin, xmax):w = svm_clf.coef_[0]b = svm_clf.intercept_[0]# At the decision boundary, w0*x0 + w1*x1 + b = 0# => x1 = -w0/w1 * x0 - b/w1x0 = np.linspace(xmin, xmax, 200)decision_boundary = -w[0]/w[1] * x0 - b/w[1]# margin = 1/np.sqrt(w[1]**2+w[0]**2)margin = 1/0.9margin = 1/w[1]gutter_up = decision_boundary + margingutter_down = decision_boundary - marginsvs = svm_clf.support_vectors_plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#FFAAAA')plt.plot(x0, decision_boundary, "k-", linewidth=2)plt.plot(x0, gutter_up, "k--", linewidth=2)plt.plot(x0, gutter_down, "k--", linewidth=2) plt.figure(figsize=(6,4))plot_svc_decision_boundary(svm_clf, 0, 5.5)plt.plot(X[:, 0][y == 1], X[:, 1][y == 1], "bs")plt.plot(X[:, 0][y == 0], X[:, 1][y == 0], "yo")plt.xlabel("Petal length", fontsize=14)plt.axis([0, 5.5, 0, 2])plt.show() 实例5 特征是否应该进行标准化? 5.1 原始特征的决策边界可视化 #准备数据Xs = np.array([[1, 50], [5, 20], [3, 80], [5, 60]]).astype(np.float64)ys = np.array([0, 0, 1, 1])#实例化模型svm_clf = SVC(kernel="linear", C=100)svm_clf.fit(Xs, ys)#绘制图形plt.figure(figsize=(6,4))plt.plot(Xs[:, 0][ys == 1], Xs[:, 1][ys == 1], "bo")plt.plot(Xs[:, 0][ys == 0], Xs[:, 1][ys == 0], "ms")plot_svc_decision_boundary(svm_clf, 0, 6)plt.xlabel("$x_0$", fontsize=20)plt.ylabel("$x_1$ ", fontsize=20, rotation=0)plt.title("Unscaled", fontsize=16)plt.axis([0, 6, 0, 90]) (0.0, 6.0, 0.0, 90.0) 5.1 标准化特征的决策边界可视化 from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_scaled = scaler.fit_transform(Xs)svm_clf.fit(X_scaled, ys)plt.plot(X_scaled[:, 0][ys == 1], X_scaled[: 1][ys == 1], "bo")plt.plot(X_scaled[:, 0][ys == 0], X_scaled[:, 1][ys == 0], "ms")plot_svc_decision_boundary(svm_clf, -2, 2)plt.xlabel("$x_0$", fontsize=20)plt.title("Scaled", fontsize=16)plt.axis([-2, 2, -2, 2])plt.show() 实例6 #回到鸢尾花数据集X = iris["data"][:, (2, 3)] # petal length, petal widthy = iris["target"] X_outliers = np.array([[3.4, 1.3], [3.2, 0.8]])y_outliers = np.array([0, 0])Xo1 = np.concatenate([X, X_outliers[:1]], axis=0)yo1 = np.concatenate([y, y_outliers[:1]], axis=0)Xo2 = np.concatenate([X, X_outliers[1:]], axis=0)yo2 = np.concatenate([y, y_outliers[1:]], axis=0)svm_clf1= SVC(kernel="linear", C=10**9)svm_clf1.fit(Xo1, yo1)plt.figure(figsize=(12, 4))plt.subplot(121)plt.plot(Xo1[:, 0][yo1 == 1], Xo1[:, 1][yo1 == 1], "bs")plt.plot(Xo1[:, 0][yo1 == 0], Xo1[:, 1][yo1 == 0], "yo")plt.text(0.3, 1.0, "Impossible!", fontsize=24, color="red")plot_svc_decision_boundary(svm_clf1, 0, 5.5)plt.xlabel("Petal length", fontsize=14)plt.ylabel("Petal width", fontsize=14)plt.annotate("Outlier",xy=(X_outliers[0][0], X_outliers[0][1]),xytext=(2.5, 1.7),ha="center",arrowprops=dict(facecolor='black', shrink=0.1),fontsize=16,)plt.axis([0, 5.5, 0, 2])svm_clf2 = SVC(kernel="linear", C=10**9)svm_clf2.fit(Xo2, yo2)plt.subplot(122)plt.plot(Xo2[:, 0][yo2 == 1], Xo2[:, 1][yo2 == 1], "bs")plt.plot(Xo2[:, 0][yo2 == 0], Xo2[:, 1][yo2 == 0], "yo")plot_svc_decision_boundary(svm_clf2, 0, 5.5)plt.xlabel("Petal length", fontsize=14)plt.annotate("Outlier",xy=(X_outliers[1][0], X_outliers[1][1]),xytext=(3.2, 0.08),ha="center",arrowprops=dict(facecolor='black', shrink=0.1),fontsize=16,)plt.axis([0, 5.5, 0, 2])plt.show()plt.show() 实例7 非线性可分的决策边界 7.1 做一个新的数据 from sklearn.pipeline import Pipeline from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.15, random_state=42) np.min(X[:,0]),np.max(X[:,0]) (-1.2720155884887554, 2.4093807207967215) np.min(X[:,1]),np.max(X[:,1]) (-0.6491427462708279, 1.2711135917248466) x0s = np.linspace(2, 15, 2)x1s = np.linspace(3,12,2)x0, x1 = np.meshgrid(x0s, x1s)x0s ,x1s ,x0, x1 (array([ 2., 15.]),array([ 3., 12.]),array([[ 2., 15.],[ 2., 15.]]),array([[ 3., 3.],[12., 12.]])) x1.ravel() array([ 3., 3., 12., 12.]) x0.ravel() array([ 2., 15., 2., 15.]) X = np.c_[x0.ravel(), x1.ravel()]X.shape,X ((4, 2),array([[ 2., 3.],[15., 3.],[ 2., 12.],[15., 12.]])) y_pred=np.array([[1,0],[0,1]]) np.meshgrid(x0s, x1s) [array([[ 2., 15.],[ 2., 15.]]),array([[ 3., 3.],[12., 12.]])] X = np.c_[x0.ravel(), x1.ravel()]X.shape,x0.shape ((4, 2), (2, 2)) x0 array([[ 2., 15.],[ 2., 15.]]) 7.2 绘制高高线表示预测结果 def plot_predictions(clf, axes):x0s = np.linspace(axes[0], axes[1], 100)x1s = np.linspace(axes[2], axes[3], 100)x0, x1 = np.meshgrid(x0s, x1s)X = np.c_[x0.ravel(), x1.ravel()]y_pred = clf.predict(X).reshape(x0.shape)y_decision = clf.decision_function(X).reshape(x0.shape)plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1) 7.3 绘制原始数据 def plot_dataset(X, y, axes):plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")plt.axis(axes)plt.grid(True, which='both')plt.xlabel(r"$x_1$", fontsize=20)plt.ylabel(r"$x_2$", fontsize=20, rotation=0) 7.4 绘制不同gamma和C对应的 from sklearn.svm import SVCX, y = make_moons(n_samples=100, noise=0.15, random_state=42)gamma1, gamma2 = 0.1, 5C1, C2 = 0.001, 1000hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)svm_clfs = []for gamma, C in hyperparams:rbf_kernel_svm_clf = Pipeline([("scaler", StandardScaler()),("svm_clf",SVC(kernel="rbf", gamma=gamma, C=C))])rbf_kernel_svm_clf.fit(X, y)svm_clfs.append(rbf_kernel_svm_clf)plt.figure(figsize=(6,4))for i, svm_clf in enumerate(svm_clfs):plt.subplot(221 + i)plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])gamma, C = hyperparams[i]plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=12)plt.show() 实例8* 手写SVM 8.1 创建数据 import numpy as npimport pandas as pdfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitimport matplotlib.pyplot as plt%matplotlib inline # datadef create_data():iris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['label'] = iris.targetdf.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']data = np.array(df.iloc[:100, [0, 1, -1]])for i in range(len(data)):if data[i,-1] == 0:data[i,-1] = -1return data[:,:2], data[:,-1] X, y = create_data()X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) plt.scatter(X[:50,0],X[:50,1], label='0')plt.scatter(X[50:,0],X[50:,1], label='1')plt.legend() <matplotlib.legend.Legend at 0x1c516838670> 8.2 定义支持向量机 class SVM:def __init__(self, max_iter=100, kernel='linear'):self.max_iter = max_iterself._kernel = kerneldef init_args(self, features, labels):self.m, self.n = features.shapeself.X = featuresself.Y = labelsself.b = 0.0# 将Ei保存在一个列表里self.alpha = np.ones(self.m)self.E = [self._E(i) for i in range(self.m)]# 松弛变量self.C = 1.0def _KKT(self, i):y_g = self._g(i) * self.Y[i]if self.alpha[i] == 0:return y_g >= 1elif 0 < self.alpha[i] < self.C:return y_g == 1else:return y_g <= 1# g(x)预测值,输入xi(X[i])def _g(self, i):r = self.bfor j in range(self.m):r += self.alpha[j] * self.Y[j] * self.kernel(self.X[i], self.X[j])return r# 核函数def kernel(self, x1, x2):if self._kernel == 'linear':return sum([x1[k] * x2[k] for k in range(self.n)])elif self._kernel == 'poly':return (sum([x1[k] * x2[k] for k in range(self.n)]) + 1)**2return 0# E(x)为g(x)对输入x的预测值和y的差def _E(self, i):return self._g(i) - self.Y[i]def _init_alpha(self):# 外层循环首先遍历所有满足0<a<C的样本点,检验是否满足KKTindex_list = [i for i in range(self.m) if 0 < self.alpha[i] < self.C]# 否则遍历整个训练集non_satisfy_list = [i for i in range(self.m) if i not in index_list]index_list.extend(non_satisfy_list)for i in index_list:if self._KKT(i):continueE1 = self.E[i]# 如果E2是+,选择最小的;如果E2是负的,选择最大的if E1 >= 0:j = min(range(self.m), key=lambda x: self.E[x])else:j = max(range(self.m), key=lambda x: self.E[x])return i, jdef _compare(self, _alpha, L, H):if _alpha > H:return Helif _alpha < L:return Lelse:return _alphadef fit(self, features, labels):self.init_args(features, labels)for t in range(self.max_iter):# traini1, i2 = self._init_alpha()# 边界if self.Y[i1] == self.Y[i2]:L = max(0, self.alpha[i1] + self.alpha[i2] - self.C)H = min(self.C, self.alpha[i1] + self.alpha[i2])else:L = max(0, self.alpha[i2] - self.alpha[i1])H = min(self.C, self.C + self.alpha[i2] - self.alpha[i1])E1 = self.E[i1]E2 = self.E[i2]# eta=K11+K22-2K12eta = self.kernel(self.X[i1], self.X[i1]) + self.kernel(self.X[i2],self.X[i2]) - 2 * self.kernel(self.X[i1], self.X[i2])if eta <= 0:# print('eta <= 0')continuealpha2_new_unc = self.alpha[i2] + self.Y[i2] * (E1 - E2) / eta #此处有修改,根据书上应该是E1 - E2,书上130-131页alpha2_new = self._compare(alpha2_new_unc, L, H)alpha1_new = self.alpha[i1] + self.Y[i1] * self.Y[i2] * (self.alpha[i2] - alpha2_new)b1_new = -E1 - self.Y[i1] * self.kernel(self.X[i1], self.X[i1]) * (alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(self.X[i2],self.X[i1]) * (alpha2_new - self.alpha[i2]) + self.bb2_new = -E2 - self.Y[i1] * self.kernel(self.X[i1], self.X[i2]) * (alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(self.X[i2],self.X[i2]) * (alpha2_new - self.alpha[i2]) + self.bif 0 < alpha1_new < self.C:b_new = b1_newelif 0 < alpha2_new < self.C:b_new = b2_newelse:# 选择中点b_new = (b1_new + b2_new) / 2# 更新参数self.alpha[i1] = alpha1_newself.alpha[i2] = alpha2_newself.b = b_newself.E[i1] = self._E(i1)self.E[i2] = self._E(i2)return 'train done!'def predict(self, data):r = self.bfor i in range(self.m):r += self.alpha[i] * self.Y[i] * self.kernel(data, self.X[i])return 1 if r > 0 else -1def score(self, X_test, y_test):right_count = 0for i in range(len(X_test)):result = self.predict(X_test[i])if result == y_test[i]:right_count += 1return right_count / len(X_test)def _weight(self):# linear modelyx = self.Y.reshape(-1, 1) * self.Xself.w = np.dot(yx.T, self.alpha)return self.w 8.3 初始化支持向量机并拟合 svm = SVM(max_iter=100)svm.fit(X_train, y_train) 'train done!' 8.4 支持向量机得到分数 svm.score(X_test, y_test) 0.72 实验1 采用以下数据作为数据集,分别基于线性和核支持向量机进行分类,对于线性核绘制决策边界 1 获取数据 from sklearn.svm import SVCfrom sklearn import datasetsimport matplotlib as mplimport matplotlib.pyplot as pltmpl.rc('axes', labelsize=14)mpl.rc('xtick', labelsize=12)mpl.rc('ytick', labelsize=12)iris = datasets.load_iris()X = iris["data"][:, (2, 3)] # petal length, petal widthy = iris["target"]X,y (array([[1.4, 0.2],[1.4, 0.2],[1.3, 0.2],[1.5, 0.2],[1.4, 0.2],[1.7, 0.4],[1.4, 0.3],[1.5, 0.2],[1.4, 0.2],[1.5, 0.1],[1.5, 0.2],[1.6, 0.2],[1.4, 0.1],[1.1, 0.1],[1.2, 0.2],[1.5, 0.4],[1.3, 0.4],[1.4, 0.3],[1.7, 0.3],[1.5, 0.3],[1.7, 0.2],[1.5, 0.4],[1. , 0.2],[1.7, 0.5],[1.9, 0.2],[1.6, 0.2],[1.6, 0.4],[1.5, 0.2],[1.4, 0.2],[1.6, 0.2],[1.6, 0.2],[1.5, 0.4],[1.5, 0.1],[1.4, 0.2],[1.5, 0.2],[1.2, 0.2],[1.3, 0.2],[1.4, 0.1],[1.3, 0.2],[1.5, 0.2],[1.3, 0.3],[1.3, 0.3],[1.3, 0.2],[1.6, 0.6],[1.9, 0.4],[1.4, 0.3],[1.6, 0.2],[1.4, 0.2],[1.5, 0.2],[1.4, 0.2],[4.7, 1.4],[4.5, 1.5],[4.9, 1.5],[4. , 1.3],[4.6, 1.5],[4.5, 1.3],[4.7, 1.6],[3.3, 1. ],[4.6, 1.3],[3.9, 1.4],[3.5, 1. ],[4.2, 1.5],[4. , 1. ],[4.7, 1.4],[3.6, 1.3],[4.4, 1.4],[4.5, 1.5],[4.1, 1. ],[4.5, 1.5],[3.9, 1.1],[4.8, 1.8],[4. , 1.3],[4.9, 1.5],[4.7, 1.2],[4.3, 1.3],[4.4, 1.4],[4.8, 1.4],[5. , 1.7],[4.5, 1.5],[3.5, 1. ],[3.8, 1.1],[3.7, 1. ],[3.9, 1.2],[5.1, 1.6],[4.5, 1.5],[4.5, 1.6],[4.7, 1.5],[4.4, 1.3],[4.1, 1.3],[4. , 1.3],[4.4, 1.2],[4.6, 1.4],[4. , 1.2],[3.3, 1. ],[4.2, 1.3],[4.2, 1.2],[4.2, 1.3],[4.3, 1.3],[3. , 1.1],[4.1, 1.3],[6. , 2.5],[5.1, 1.9],[5.9, 2.1],[5.6, 1.8],[5.8, 2.2],[6.6, 2.1],[4.5, 1.7],[6.3, 1.8],[5.8, 1.8],[6.1, 2.5],[5.1, 2. ],[5.3, 1.9],[5.5, 2.1],[5. , 2. ],[5.1, 2.4],[5.3, 2.3],[5.5, 1.8],[6.7, 2.2],[6.9, 2.3],[5. , 1.5],[5.7, 2.3],[4.9, 2. ],[6.7, 2. ],[4.9, 1.8],[5.7, 2.1],[6. , 1.8],[4.8, 1.8],[4.9, 1.8],[5.6, 2.1],[5.8, 1.6],[6.1, 1.9],[6.4, 2. ],[5.6, 2.2],[5.1, 1.5],[5.6, 1.4],[6.1, 2.3],[5.6, 2.4],[5.5, 1.8],[4.8, 1.8],[5.4, 2.1],[5.6, 2.4],[5.1, 2.3],[5.1, 1.9],[5.9, 2.3],[5.7, 2.5],[5.2, 2.3],[5. , 1.9],[5.2, 2. ],[5.4, 2.3],[5.1, 1.8]]),array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])) X_train=X[(y==1) | (y==2)]y_train=y[(y==1) | (y==2)] y_train array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) 2 可视化数据 plt.scatter(X_train[:50,0],X_train[:50,1],marker='x',label='Positive')plt.scatter(X_train[50:,0],X_train[50:,1],marker='o',label='Negative')plt.legend() <matplotlib.legend.Legend at 0x1c515115610> 3 试试采用线性支持向量机来拟合 from sklearn.svm import SVCsvm_clf = SVC(kernel="linear", C=10,max_iter=1000)svm_clf.fit(X_train,y_train) SVC(C=10, kernel='linear', max_iter=1000) svm_clf.score(X_train,y_train) 0.95 4 试试采用核支持向量机 import sklearn.svm as svmnl_svc=svm.SVC(C=1,gamma=1,probability=True)nl_svc.fit(X_train,y_train)nl_svc.score(X_train,y_train) 0.95 5 绘制线性支持向量机的决策边界 def plot_svc_decision_boundary(svm_clf, xmin, xmax):w = svm_clf.coef_[0]b = svm_clf.intercept_[0]# At the decision boundary, w0*x0 + w1*x1 + b = 0# => x1 = -w0/w1 * x0 - b/w1x0 = np.linspace(xmin, xmax, 200)decision_boundary = -w[0]/w[1] * x0 - b/w[1]# margin = 1/np.sqrt(w[1]**2+w[0]**2)margin = 1/0.9margin = 1/w[1]gutter_up = decision_boundary + margingutter_down = decision_boundary - marginsvs = svm_clf.support_vectors_plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#FFAAAA')plt.plot(x0, decision_boundary, "k-", linewidth=2)plt.plot(x0, gutter_up, "k--", linewidth=2)plt.plot(x0, gutter_down, "k--", linewidth=2) np.min(X_train[:,0]),np.max(X_train[:,0]) (3.0, 6.9) plt.figure(figsize=(6,4))plot_svc_decision_boundary(svm_clf,3,7)plt.plot(X[:, 0][y == 1], X[:, 1][y == 1], "bs")plt.plot(X[:, 0][y == 2], X[:, 1][y == 2], "yo")plt.xlabel("Petal length", fontsize=14)plt.axis([3,7,0,2])plt.show() 6 绘制非线性决策边界 def plot_predictions(clf, axes):x0s = np.linspace(axes[0], axes[1], 100)x1s = np.linspace(axes[2], axes[3], 100)x0, x1 = np.meshgrid(x0s, x1s)X = np.c_[x0.ravel(), x1.ravel()]y_pred = clf.predict(X).reshape(x0.shape)y_decision = clf.decision_function(X).reshape(x0.shape)plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1) def plot_dataset(X, y, axes):plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs")plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^")plt.axis(axes)plt.grid(True, which='both')plt.xlabel(r"$x_1$", fontsize=20)plt.ylabel(r"$x_2$", fontsize=20, rotation=0) np.min(X_train[:,0]),np.max(X_train[:,0]), (3.0, 6.9) np.min(X_train[:,1]),np.max(X_train[:,1]) (1.0, 2.5) plot_predictions(nl_svc, [2.5,7,1,3])plot_dataset(X, y, [2.5,7,1,3])