<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Trueno&#39;s Blog</title>
  
  
  <link href="/atom.xml" rel="self"/>
  
  <link href="http://yoursite.com/"/>
  <updated>2018-10-07T05:22:42.516Z</updated>
  <id>http://yoursite.com/</id>
  
  <author>
    <name>Trueno</name>
    
  </author>
  
  <generator uri="http://hexo.io/">Hexo</generator>
  
  <entry>
    <title>LongestPalindromicSubstring</title>
    <link href="http://yoursite.com/2018/08/19/LongestPalindromicSubstring/"/>
    <id>http://yoursite.com/2018/08/19/LongestPalindromicSubstring/</id>
    <published>2018-08-19T14:57:23.000Z</published>
    <updated>2018-10-07T05:22:42.516Z</updated>
    
    <content type="html"><![CDATA[<p>本文主要是介绍利用动态规划与Manacher 算法求解最长回文子串，LeetCode<em><a href="https://leetcode.com/problems/longest-palindromic-substring/description/" target="_blank" rel="noopener">链接</a></em></p><h2 id="动态规划"><a href="#动态规划" class="headerlink" title="动态规划"></a>动态规划</h2><h3 id="问题分析"><a href="#问题分析" class="headerlink" title="问题分析"></a>问题分析</h3><p>对于最长回文子串的求解可以抽象为一个动态规划的多阶段决策过程。例如，对于对于字符串S”ababa”回文判断，如果我们知道该回文字符串的子串s”bab”为回文字符串的话，那么显然对于字符串S的判断，只需要判断S左右两端字符是否一致即可。若S左右两端的字符相同，则在s为回文串的前提下，S为回文字符串；若两端字符不相同，则S不是回文字符串。</p><a id="more"></a><h3 id="多阶段决策过程抽象"><a href="#多阶段决策过程抽象" class="headerlink" title="多阶段决策过程抽象"></a>多阶段决策过程抽象</h3><p>基于上述的问题分析，假设需要判断的字符串S[i:j]是否为回文串，那么该决策过程可以分解为两步，首先判断子串S[i+1:j-1]是否为回文串，若不是，则S串不是回文字符串；若该子串是回文串，且S[i]等于S[j]，则S为回文字符串。因此，对于任一字符串S[i:j]，我们可以定义二维数组P表示该字符串状态，其中$i &lt; j$：</p><script type="math/tex; mode=display">P(i,j)=\begin{cases}true, 如果S[i:j]是回文串\\false, 非回文串\\\end{cases}</script><p>基于上述二维数组，可以定义如下状态转移方程：</p><script type="math/tex; mode=display">P(i,j)=(P(i+1,j-1) and S[i] == S[j])</script><p>上述状态转移方程表示的含义即为$P(i,j)$为true的条件为$P(i+1,j-1)$为真（即S[i+1,j-1]为回文串）且S[i]字符与S[j]字符相等。<br>由于上述状态转移方程之中，必须要求$ i \leq j $ 且 $ i +1 \leq j - 1 $，所以，需要先初始化$i = j $和$ i + 1 = j $的情况，其中$ P[i,i] $初始化为true，$ P[i, i + 1] $依据字符串的实际情况进行。</p><h3 id="python实现"><a href="#python实现" class="headerlink" title="python实现"></a>python实现</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Solution</span>:</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">longestPalindrome</span><span class="params">(self, s)</span>:</span></span><br><span class="line">        <span class="string">"""</span></span><br><span class="line"><span class="string">        :type s: str</span></span><br><span class="line"><span class="string">        :rtype: str</span></span><br><span class="line"><span class="string">        """</span></span><br><span class="line">        <span class="comment">#采用动态规划的方法</span></span><br><span class="line">        DPMa = [[<span class="number">0</span> <span class="keyword">for</span> i <span class="keyword">in</span> range(len(s))] <span class="keyword">for</span> j <span class="keyword">in</span> range(len(s))]  <span class="comment">#初始化二维矩阵，用户动态规划预计算</span></span><br><span class="line"></span><br><span class="line">        LenPa = <span class="number">1</span></span><br><span class="line">        iMax = jMax = <span class="number">0</span></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> range(len(s)):</span><br><span class="line">            DPMa[i][i] = <span class="number">1</span></span><br><span class="line">            <span class="keyword">if</span> i + <span class="number">1</span> &lt; len(s) <span class="keyword">and</span> s[i] == s[i + <span class="number">1</span>]:<span class="comment">#出现相同两位的回文字符串</span></span><br><span class="line">                DPMa[i][i + <span class="number">1</span>] = <span class="number">1</span></span><br><span class="line">                LenPa = <span class="number">2</span></span><br><span class="line">                iMax = i</span><br><span class="line">                jMax = i + <span class="number">1</span></span><br><span class="line">        <span class="keyword">for</span> k <span class="keyword">in</span> range(<span class="number">2</span>, len(s)):</span><br><span class="line">            <span class="keyword">for</span> i <span class="keyword">in</span> range(<span class="number">0</span>, len(s) - k):</span><br><span class="line">                j = i + k</span><br><span class="line">                <span class="keyword">if</span> DPMa[i + <span class="number">1</span>][j - <span class="number">1</span>] <span class="keyword">and</span> s[i] == s[j]:</span><br><span class="line">                    DPMa[i][j] = <span class="number">1</span></span><br><span class="line">                    <span class="keyword">if</span> k + <span class="number">1</span> &gt; LenPa:</span><br><span class="line">                        LenPa = k + <span class="number">1</span></span><br><span class="line">                        iMax = i</span><br><span class="line">                        jMax = j</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> s[iMax:jMax + <span class="number">1</span>]</span><br></pre></td></tr></table></figure><h3 id="时间复杂度与空间复杂度"><a href="#时间复杂度与空间复杂度" class="headerlink" title="时间复杂度与空间复杂度"></a>时间复杂度与空间复杂度</h3><p><em>时间复杂度：</em>由上述代码实现可以看出，动态规划过程为两重循环，因此时间复杂度为$O(n^2)$<br><em>空间复杂度：</em>使用了二维数组保存个子串状态，空间复杂度为$O(n^2)$</p><h2 id="Manacher-算法"><a href="#Manacher-算法" class="headerlink" title="Manacher 算法"></a>Manacher 算法</h2><p>利用动态规划解决最长回文子串问题需要 $O(n^2)$ 的时间复杂度，但对于该问题有一种称之为Manacher算法的，$O(n)$的解决方法。以下主要是对LeetCode上关于该算法文章进行翻译<br>、整理而成的（<em><a href="https://articles.leetcode.com/longest-palindromic-substring-part-ii/" target="_blank" rel="noopener">原文</a></em>）</p><h3 id="Note"><a href="#Note" class="headerlink" title="Note"></a>Note</h3><p>这是“最长回文子串”的第二部分。在这篇文章中，我们将分析一个能再线性时间内解决最长回文子串的算法——Manacher算法。在之前的LeetCode的solution中给出了四种不同求解最长回文子串的方法，最简单的算法时间复杂度为$O(n^2)$，空间复杂度为常数级。本文介绍的Manacher算法的时间复杂度和空间复杂度都为O(n).</p><h3 id="Hint"><a href="#Hint" class="headerlink" title="Hint"></a>Hint</h3><p>考虑在之前的解决算法之中出现的最坏情况，即输入字符串是多个回文字符串重叠而成的情况。例如，输入字符串是”aaaaaaaa”和”cabcbabcbabcba”。对于出现的这种最坏情况，我们可以利用回文字符串的对称性来避免一些不必要的重复计算。</p><h3 id="An-O-N-Solution-Manacher’s-Algorithm"><a href="#An-O-N-Solution-Manacher’s-Algorithm" class="headerlink" title="An O(N) Solution (Manacher’s Algorithm):"></a>An O(N) Solution (Manacher’s Algorithm):</h3><p>首先我们需要对输入的字符串进行处理，处理方式为：对输入字符串S，在首尾以及每个字符之插入特殊字符’#’。</p><blockquote><p>例：S=”abaaba”，处理完毕后 T=”#a#b#a#a#b#a#”<br>S为输入原始字符串，T为预处理后的字符串。<br>$S_i$表示S的第i位置的字符，$T_i$表示T的第i位置的字符</p></blockquote><p>经过这样的处理，可以对输入的字符串进行统一化处理。在完成插入操作之后，不论输入的字符串长度是偶数还是奇数，在处理后，字符串长度都将变成奇数长度，并且原字符串中各字符的相对位置不变。</p><p>为了找出最长回文子串，我们可以对每一个$T<em>i$进行如$T</em>(i-d)…T_(i+d)$的扩展来组成一个回文字符串。对于这种扩展，你可以立即看到，d就是以$T_i$为中心扩展而成的回文字符串的长度(这个长度减去了特殊字符’#’的长度))。</p><p>我们需要定义一个存储中间结果的数组P，中间数组$P[i]$存储的是以$T_i$为中心的回文字符串的长度。因此，在P数组计算完成之后，最长回文字符子串的中心点索引就是数组P中最大值的索引。<br>对于上述例子，我们可以从左到右手动计算数组P的值，结果如下：</p><blockquote><p>T = # a # b # a # a # b # a #<br>P = 0 1 0 3 0 1 6 1 0 3 0 1 0</p></blockquote><p>从上述数组P的值中可以看出，P[6]为6，具有最大值，该索引值对应字符子串是”abaaba”，由原始输入字符串可知，该字符串即为最长回文子串。</p><p>若是在回文字符串”abaaba”画一条中轴线，可以看到P数组的值将根据该中轴线对称分布。这不仅仅是这一个回文串的性质，对于任意一回文串，都具有该对称性质。</p><p>根据以上性质，让我们考虑一个更加复杂的、具有回文串叠加的字符串S=”babcbabcbaccba”。</p><blockquote><p><img src="/2018/08/19/LongestPalindromicSubstring/img1.png" title="字符串S分析"></p><p>上图展示了S字符串在完成特殊插入之后的字符串T。现在先假定你已经到达了P数组被部分完成的状态。实心中轴线表示回文子串”abcbabcba”的中点，两条虚线严格表示该回文串的边界，L和R表示左右边界索引值。基于该中轴线，索引i的镜像位置的索引表示$i^,$。那么接下来的问题是，如何计算数组元素$P[i]$的值？</p></blockquote><p>假定我们当前需要计算的是当索引为$P[13]$的值。首先我们可以观察其基于以C为中心的回文子串的镜像位置$i^,$索引表示的值。针对$i=13$，其镜像对称位置$i^,$为9。</p><blockquote><p><img src="/2018/08/19/LongestPalindromicSubstring/img2.png" title="字符串S分析"></p><p>上图两条绿色的实线表示的是以$i$和$i^,$为中心表示的回文串的覆盖范围。我们可以看到i的镜像位置是$i^,$，$P[i^,]=P[9]=1$。由上图回文字符串基于中心的对称性可知，可以清楚的看到$P[i]=P[13]$的值也是1。</p></blockquote><p>正如之前看到的，由于回文字符串的对称性，$P[i^,]=P[i]=1$显然是正确的。事实上，在上述字符串T中，位于中心点之后的三个元素都符合对称性的要求，即有$P[12]=P[10]=0, P[13]=P[9]=1, P[14]=P[8]=0$。</p><blockquote><p><img src="/2018/08/19/LongestPalindromicSubstring/img3.png" title="字符串S分析"></p><p>现在需要考虑的问题是，当i为15时，基于C的镜像位置$i^,=7$，那么是否有$P[15]=P[7]=7$呢？</p></blockquote><p>现在需要考虑索引为15的时候，$P[i]$的值是多少？如果考虑对称性的话，$P[i]$的值应该与$P[i^,]$的值相同，都为7。但从上图中可以看出，显然这个是不成立的。如果我们以$T_15$为中心进行回文字符串的扩展，将可以组成回文串”a#b#c#b#a”，这个回文串显然是比对称性所计算出来的回文串要小。因此，接下来需要考虑为什么会出现这种情况。</p><blockquote><p><img src="/2018/08/19/LongestPalindromicSubstring/img4.png" title="字符串S分析"></p><p>在上图之中，绿色实线表示的区域是以C为中心基于对称性的边界范围。红色的实线表示的是以C为中心对称性不匹配的区域，即不是回文串的部分。绿色的虚线表示的区域是跨越中心的部分。</p></blockquote><p>从上图中，以$i^,=7$以及$i=15$为中心的子串在绿色实线区域内是匹配的非常好的。在穿越中心的部分（绿色虚线部分）也是非常完整的复合对称性的要求。仔细观察可以发现，$P[i^,]=7$，而却其扩展一直延伸穿过了回文字符串的左边缘（直到红色实线部分），红色实线部分就不在属于以C为中心的回文对称性质部分。针对上述$i=15$时，当$P[i]&gt;5$时，我们可以知道，要找出其准确的$P[i]$值，必须对边界R之外的字符串进行格外的匹配。在本例中，因为$P[21] \not= P[1]$，我们可以得出结论，$P[i]=5$。</p><p>上述算法要点：</p><script type="math/tex; mode=display">if P[i^,] \leq R - i,then P[i] \leftarrow P[i^,]else P[i] \geq R - i</script><p><em>上述算法要点，原文关于else的赋值给错了，原文是$P[i] \geq P[i^,]$，应修改为上述表达，表示当前索引串的右边界必定会大于当前右边界</em></p><p>上述步骤就是整个Manacher 算法的核心步骤以及本质。基于上述算法过程，接下来剩下的步骤就是确定我们应该在什么是时候向右移动中心点以及对应的边界R。相比之前的分析，这个问题是非常简单的，移动方式如下：</p><blockquote><p>如果以i为中心的回文串的扩展越过了当前右边界R，我们将更新中心点C为i（即新的回文串的中心），并且我们将边界R扩展到新的回文串的右边界。</p></blockquote><p>从上述算法中可以看出，每一个步骤都存在两种情况：如果$P[i] \leq R - i$，则只有$P[i] = P[i^,]$ 这一个步骤；其他一种情况是，我们需要尝试对右边界进行扩展，并将回文字符串的中心修改为i。扩展右边界R（内部循环完成）最多将花费N步完成（N表示输入字符串的长度），每一个中心点的定位和测试总计将花费N步。因此，这个算法将保证能在最多$2*N$步内完成。因此，该算法的时间复杂度就是O(n)。</p><h3 id="python实现-1"><a href="#python实现-1" class="headerlink" title="python实现"></a>python实现</h3><p>原文之中给出了java代码实现，以下是依据该代码的python实现。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">Solution</span>:</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">longestPalindrome</span><span class="params">(self, s)</span>:</span></span><br><span class="line">        s = Solution.PreProcess(s)  <span class="comment">#预处理</span></span><br><span class="line">        P = [<span class="number">0</span>] * (len(s))  <span class="comment">#p数组，保存回文长度</span></span><br><span class="line">        C = <span class="number">0</span>   <span class="comment">#中心点位置，初始化为0</span></span><br><span class="line">        R = <span class="number">0</span>   <span class="comment">#右边界位置，初始化为0</span></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> range(<span class="number">1</span>, len(s) - <span class="number">1</span>):</span><br><span class="line">            iMirror = <span class="number">2</span>*C - i</span><br><span class="line">            <span class="keyword">if</span> iMirror &lt; <span class="number">0</span>:</span><br><span class="line">                iMirror = <span class="number">0</span></span><br><span class="line">            <span class="comment"># 计算在当前以C为中心的的最大对称位置</span></span><br><span class="line">            P[i] = min(R - i, P[iMirror]) <span class="keyword">if</span> R &gt; i <span class="keyword">else</span> <span class="number">0</span></span><br><span class="line">            <span class="comment"># 对边界进行扩展，利用循环对边界进行扩展</span></span><br><span class="line">            <span class="keyword">while</span> (s[i - <span class="number">1</span> - P[i]] == s[i + <span class="number">1</span> + P[i]]):</span><br><span class="line">                P[i] += <span class="number">1</span></span><br><span class="line">            <span class="comment"># 计算新的C和R的位置</span></span><br><span class="line">            <span class="keyword">if</span> (i + P[i] &gt; R):</span><br><span class="line">                R = i + P[i]</span><br><span class="line">                C = i</span><br><span class="line"></span><br><span class="line">        MaxLen = max(P)</span><br><span class="line">        index = P.index(MaxLen)</span><br><span class="line">        <span class="keyword">return</span> s[index - MaxLen: index + MaxLen + <span class="number">1</span>].replace(<span class="string">'#'</span>, <span class="string">''</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">PreProcess</span><span class="params">(s)</span>:</span></span><br><span class="line">       <span class="string">"""</span></span><br><span class="line"><span class="string">       预处理过程，在字符串收尾插入^$ ，在字符串间插入#</span></span><br><span class="line"><span class="string">       :param s: 待处理字符串</span></span><br><span class="line"><span class="string">       :return: 处理完成的字符串</span></span><br><span class="line"><span class="string">       """</span></span><br><span class="line">       sList = list(<span class="string">'^'</span> + s + <span class="string">'$'</span>)</span><br><span class="line">       <span class="keyword">return</span> <span class="string">'#'</span>.join(sList)</span><br></pre></td></tr></table></figure><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><blockquote><p>[1] <a href="https://articles.leetcode.com/longest-palindromic-substring-part-ii/" target="_blank" rel="noopener">https://articles.leetcode.com/longest-palindromic-substring-part-ii/</a></p></blockquote>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;本文主要是介绍利用动态规划与Manacher 算法求解最长回文子串，LeetCode&lt;em&gt;&lt;a href=&quot;https://leetcode.com/problems/longest-palindromic-substring/description/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;链接&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2 id=&quot;动态规划&quot;&gt;&lt;a href=&quot;#动态规划&quot; class=&quot;headerlink&quot; title=&quot;动态规划&quot;&gt;&lt;/a&gt;动态规划&lt;/h2&gt;&lt;h3 id=&quot;问题分析&quot;&gt;&lt;a href=&quot;#问题分析&quot; class=&quot;headerlink&quot; title=&quot;问题分析&quot;&gt;&lt;/a&gt;问题分析&lt;/h3&gt;&lt;p&gt;对于最长回文子串的求解可以抽象为一个动态规划的多阶段决策过程。例如，对于对于字符串S”ababa”回文判断，如果我们知道该回文字符串的子串s”bab”为回文字符串的话，那么显然对于字符串S的判断，只需要判断S左右两端字符是否一致即可。若S左右两端的字符相同，则在s为回文串的前提下，S为回文字符串；若两端字符不相同，则S不是回文字符串。&lt;/p&gt;
    
    </summary>
    
      <category term="LeetCode" scheme="http://yoursite.com/categories/LeetCode/"/>
    
    
      <category term="字符串" scheme="http://yoursite.com/tags/%E5%AD%97%E7%AC%A6%E4%B8%B2/"/>
    
  </entry>
  
  <entry>
    <title>LDA主题模型</title>
    <link href="http://yoursite.com/2018/07/27/LDAModel/"/>
    <id>http://yoursite.com/2018/07/27/LDAModel/</id>
    <published>2018-07-27T02:53:33.000Z</published>
    <updated>2018-07-29T13:40:06.453Z</updated>
    
    <content type="html"><![CDATA[<p>本文主要是介绍如何利用sklearn框架中LatentDirichletAllocation类完成LDA模型的训练。</p><h2 id="理论基础"><a href="#理论基础" class="headerlink" title="理论基础"></a>理论基础</h2><p>&emsp;&emsp;LDA是一个无监督的学习模型，它假设每个文档包含多个主题，文档中的每个主题都是基于词的概率分布。作为一个基于贝叶斯网络的文档生成模型，LDA刻画的是文档生成的一个概率化过程。<br>&emsp;&emsp;LDA的输入由一组文档D组成的词料库。LDA的输出包括文档主题分布$\theta$和主题中的词语的分布$\phi$。这里的$\theta$和$\phi$都假设服从多项式分布。为让分布更平滑，再假设这两个参数的先验分布服从狄利克雷分布，参数分别为$\alpha$和$\beta$。因为狄利克雷分布是多项式分布的共轭先验分布，所以假设该多项式分布的先验服从狄利克雷分布可以极大简化统计计算的过程。狄利克雷分布中有多个参数，再LDA中利用狄利克雷分布时，大多将参数设为同一个数值，这种设为同一个数值的狄利克雷分布称为对称的狄利克雷分布。以下时LDA模型生成一篇文档的方式。</p><a id="more"></a><blockquote><ol><li><p>按照先验概率$p(d_i)$的方式选择一篇文档$d_i$。</p></li><li><p>从超参数$\alpha$的狄利克雷分布中取样生成文档$d_i$的多项式主题分布$\theta_i$，即主题分布$\theta_i$由超参数$\alpha$的狄利克雷分布生成。</p></li><li><p>用$z_{i,j}$表示从主题的多项式分布$\theta_i$中采样生成文档$d_i$第j个词的主题。</p></li><li><p>从超参数为$\beta$的狄利克雷分布中采样生成主题$z_{i,j}$对应的词语分布$\phi_z$，即词语分布时由超参数$\beta$的狄利克雷分布生成的。</p></li><li><p>$W_{i,j}$表示从词语多项式分布$\phi_z$中采样生成最终的词语。  </p></li></ol><footer><strong>刘兵</strong><cite>情感分析：挖掘观点、情感和情绪</cite></footer></blockquote><h2 id="sklearn常用类"><a href="#sklearn常用类" class="headerlink" title="sklearn常用类"></a>sklearn常用类</h2><h3 id="CountVectorizer"><a href="#CountVectorizer" class="headerlink" title="CountVectorizer"></a>CountVectorizer</h3><p>&emsp;&emsp;sklearn常用的文本特征提取类有CountVectorizer和TfidfVectorizer，以下依据<em><a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html" target="_blank" rel="noopener">sklearn的官方文档</a></em>对CountVectorizer类中的部分参数做出解释。</p><ul><li><p>input：参数类型为string，当参数为<em>filename</em> 预计时需要读取文件原始内容来进行分析；当参数为 <em>file</em>，官方文档给出的解释必须要有一个read方法来读取内存中的字节内容。具体用法不知；如果不是上述两项参数，则将视为直接进行分析的字符串序列或字节流内容。</p></li><li><p>encoding：参数类型为string，表示对输入字符串的解码方式，默认值为<em>utf-8</em>，</p></li><li><p>decoder_error: 参数可选为<em>{‘strict’, ‘ignore’, ‘replace’}</em>，表示对输入内容呢，按照encoding参数设置的内容进行解码时，若出现不符合编码方式的错误时的解决办法。默认参数为strict，表示将会抛出UnicodeDecodeError错误；ignore表示忽略当前解码错误；replace参数作用尚不明确。</p></li><li><p>analyzer：参数可选为<em>{‘word’, ‘char’, ‘char_wb’}</em>，该参数决定特征是否应该由单词或字符的n-gram组成，char_b表示创建的n-gram特征的字符范围为文本字符，n-gram不足部分用空格填充。</p></li><li><p>preprocessor: 重写预处理阶段，但标记化和n-gram的步骤会保留下来，默认参数为None。</p></li><li><p>tokenizer：类似于preprocessor参数，重写字符串标记化过程，但会保留预处理和n-gram步骤，该参数当analyer为word时才生效。</p></li><li><p>ngram_range: n-gram特征提取范围，参数类型为元组(tuple): (min_n, max_n)，在$min_n \leq n \leq max_n$的n-gram都将被提取。</p></li><li><p>stop_word: 停用词去除参数。参数为’english’时，去除英语中的停用词；当参数为list类型的数据时，会假定该list中包含所有需要去除的停用词，将去除原始文本中该list指向的所有词，该参数当analyzer参数为word时才生效。</p></li><li><p>lowercase：参数类型为布尔值，默认参数为True，表示是否对文本进行小写化。</p></li><li><p>max_df: 该参数表示一个最大阈值，当参数类型为float时，参数范围为$[0.0, 1.0]$；当参数类型为int时，默认值为1。在建立词汇表时，若词汇中某个单词出现的频率(float)或次数(int)大于当前阈值时，该单词将不会加入到词汇统计中。</p></li><li><p>min_df: 该参数类似于max_df，表示最小阈值。</p></li><li><p>max_features: 参数类型为int，默认值为None。当参数为int时，表示构建的词汇表的词汇仅是语料中词频在排在max_features之前的词。</p></li></ul><!-- strip_accents、token_pattern、vocabulary、binary、dtype --><h3 id="LatentDirichletAllocation"><a href="#LatentDirichletAllocation" class="headerlink" title="LatentDirichletAllocation"></a>LatentDirichletAllocation</h3><p>&emsp;&emsp;sklearn中训练LDA主题模型的类是<em><a href="http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html" target="_blank" rel="noopener">LatentDirichletAllocation</a></em>。</p><h4 id="参数"><a href="#参数" class="headerlink" title="参数"></a>参数</h4><ul><li><p>n_components: 模型训练的预设主题数，参数类型为int，默认值为10。</p></li><li><p>doc_topic_prior: 即文档主题分布$\theta$的参数$\alpha$，参数类型为float，若参数为None，则$\alpha$参数默认为$1 / n_components$</p></li><li><p>topic_word_prior: 即主题词语分布$\phi$的参数$\beta$，参数类型为float，若参数为None，则$\beta$参数默认为$1 / n_components$</p></li><li><p>learning_method: LDA的求解算法，有’batch’和’online’两种，默认为’batch’。当数据规模较大时，’online’将比’batch’更快。</p></li><li><p>max_iter: 最大迭代次数，参数类型为int。</p></li></ul><h4 id="方法"><a href="#方法" class="headerlink" title="方法"></a>方法</h4><ul><li><p>fit(X[, y]): 利用训练数据训练LDA模型，输入参数为CountVectorizer类提取的文本词频矩阵。</p></li><li><p>transform(X[, y]): 利用训练好的模型推断语料X中文档的主题分布。 </p></li><li><p>fit_transform(X[, y]): 对输入语料（训练数据）训练LDA模型，并推断输入语料的主题分布。</p></li><li><p>perplexity(X[, doc_topic_distr, sub_sampling])：计算语料X的的困惑度。</p></li></ul><h2 id="模型训练与调参"><a href="#模型训练与调参" class="headerlink" title="模型训练与调参"></a>模型训练与调参</h2><h3 id="以困惑度（Perplexity）为基础调参"><a href="#以困惑度（Perplexity）为基础调参" class="headerlink" title="以困惑度（Perplexity）为基础调参"></a>以困惑度（Perplexity）为基础调参</h3><p>&emsp;&emsp;LDA在进行训练之前，需要由算法设计人员指定主题数目参数n，主题数目的选择会在一定程度上影响主题检测的效果。 因此可以考虑计算Perplexity的值来帮助选主题数目参数。具体调参方式为：</p><ol><li><p>指定主题的数目范围为：n_topics = range(min, max)。</p></li><li><p>对$min \leq n \leq max$进行LDA主题模型训练的迭代。计算每一次迭代的Perplexity。</p></li><li><p>绘制Perplexity与n_topics的曲线图，从图像最低点附近寻找最合适的参数。</p></li></ol><h3 id="类实现"><a href="#类实现" class="headerlink" title="类实现"></a>类实现</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> CountVectorizer</span><br><span class="line"><span class="keyword">from</span> sklearn.decomposition <span class="keyword">import</span> LatentDirichletAllocation</span><br><span class="line"><span class="keyword">from</span> sklearn.externals <span class="keyword">import</span> joblib</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">LDATrain</span>:</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self, DocLst, NumTopics, MaxIter, MaxFeatures)</span>:</span></span><br><span class="line">        <span class="string">"""初始化，DocLst为输入的训练或测试文档"""</span></span><br><span class="line">        self.DocLst = DocLst    <span class="comment">#输入参数，List类型数据结构，List中每一个元素为一个文档</span></span><br><span class="line">        self.NumTopics = NumTopics  <span class="comment">#话题数目</span></span><br><span class="line">        self.MaxIter = MaxIter  <span class="comment">#最大迭代数据</span></span><br><span class="line">        self.MaxFeatures = MaxFeatures  <span class="comment">#进行词频统计的最大数目</span></span><br><span class="line"></span><br><span class="line">        <span class="comment">#计算词频时需要用到的变量</span></span><br><span class="line">        self.TFVectorizer = <span class="keyword">None</span></span><br><span class="line">        self.TF = <span class="keyword">None</span>  <span class="comment">#词频统计结果</span></span><br><span class="line">        <span class="comment">#进行模型迭代时需要的变量</span></span><br><span class="line">        self.LDAModelLst = []   <span class="comment">#迭代产生的LDA模型列表</span></span><br><span class="line">        self.PerplexityLst = [] <span class="comment">#困惑度列表</span></span><br><span class="line">        self.BestIndex = <span class="keyword">None</span></span><br><span class="line">        self.BestLDAModel = <span class="keyword">None</span>    <span class="comment">#最佳LDA 模型</span></span><br><span class="line">        self.BestLDAModelPerplexity = <span class="keyword">None</span>  <span class="comment">#最佳LDA模型的困惑度</span></span><br><span class="line">        self.BestTopicNum = <span class="keyword">None</span>    <span class="comment">#最合适的主题数</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">LDACountVectorizer</span><span class="params">(self, MaxDf = <span class="number">0.95</span>, MinDf = <span class="number">2</span>)</span>:</span><span class="comment">#这个值还有待确认</span></span><br><span class="line">        <span class="string">"""统计词频函数，调用CountVectorizer完成"""</span></span><br><span class="line">        self.TFVectorizer = CountVectorizer(max_df = MaxDf,\</span><br><span class="line">                                       min_df = MinDf,\</span><br><span class="line">                                       max_features = self.MaxFeatures)</span><br><span class="line">        self.TF = self.TFVectorizer.fit_transform(self.DocLst)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">LDASaveTF</span><span class="params">(self, TFModelPath)</span>:</span></span><br><span class="line">        <span class="string">"""保存词频"""</span></span><br><span class="line">        joblib.dump(self.TF, TFModelPath)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">LDALoadVectorizer</span><span class="params">(self, TFModelPath)</span>:</span></span><br><span class="line">        <span class="string">"""导入之前计算得到的词频统计结果"""</span></span><br><span class="line">        self.TF = joblib.load(TFModelPath)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">LDATrain</span><span class="params">(self, NumTopic, MaxIter)</span>:</span></span><br><span class="line">        <span class="string">"""一次LDA训练，NumTopic为当前训练的主题数，MaxIter为最大迭代数"""</span></span><br><span class="line">        LDAResult = LatentDirichletAllocation(n_components = NumTopic, \</span><br><span class="line">                                              max_iter = MaxIter,\</span><br><span class="line">                                              learning_method = <span class="string">'batch'</span>,\</span><br><span class="line">                                              <span class="comment"># evaluate_every = 200, \</span></span><br><span class="line">                                              <span class="comment"># perp_tol = 0.01</span></span><br><span class="line">                                              )</span><br><span class="line">        <span class="comment"># LDAResult.fit(self.TF)</span></span><br><span class="line">        <span class="comment"># TrainGamma = LDAResult.transform(self.TF)</span></span><br><span class="line">        <span class="comment"># TranPerplexity = LDAResult.perplexity(self.TF, TrainGamma)</span></span><br><span class="line">        <span class="comment"># return TrainGamma, TranPerplexity</span></span><br><span class="line">        <span class="keyword">return</span> LDAResult.fit(self.TF), LDAResult.perplexity(self.TF)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span>  <span class="title">IterationLDATrain</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""迭代训练最佳的LDA模型，</span></span><br><span class="line"><span class="string">        NumTopics为包括所有可能的主题数一个list，</span></span><br><span class="line"><span class="string">        MaxIter为一次LDA训练的最大迭代数"""</span></span><br><span class="line">        <span class="comment">#开始进行迭代训练</span></span><br><span class="line">        index = <span class="number">0</span></span><br><span class="line">        <span class="keyword">for</span> NumTopic <span class="keyword">in</span> self.NumTopics:</span><br><span class="line">            lda, perplexity = self.__LDATrain(NumTopic, self.MaxIter)</span><br><span class="line">            self.LDAModelLst.append(lda)</span><br><span class="line">            self.PerplexityLst.append(perplexity)</span><br><span class="line">            print(index)</span><br><span class="line">            index += <span class="number">1</span></span><br><span class="line">        <span class="comment">#保存最佳模型到</span></span><br><span class="line">        BestIndex = self.PerplexityLst.index(min(self.PerplexityLst))<span class="comment">#获取最佳模型的索引</span></span><br><span class="line">        self.BestLDAModelPerplexity = min(self.PerplexityLst)</span><br><span class="line">        self.BestTopicNum = self.NumTopics[BestIndex]</span><br><span class="line">        self.BestLDAModel = self.LDAModelLst[BestIndex]</span><br><span class="line">        self.BestIndex = BestIndex</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">TransformBestModel</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="keyword">for</span> doc <span class="keyword">in</span> self.TF:</span><br><span class="line">            print(self.BestLDAModel.transform(doc))</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__print_top_Words</span><span class="params">(self, model, FeatureNames, NumTopWords)</span>:</span></span><br><span class="line">        <span class="keyword">for</span> topic_idx, topic <span class="keyword">in</span> enumerate(model.components_):</span><br><span class="line">            print(<span class="string">"Topic #%d:"</span> % topic_idx)</span><br><span class="line">            print(<span class="string">" "</span>.join([FeatureNames[i]</span><br><span class="line">                            <span class="keyword">for</span> i <span class="keyword">in</span> topic.argsort()[:-NumTopWords - <span class="number">1</span>:<span class="number">-1</span>]]))</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SaveTopicWords</span><span class="params">(self, NumTopWords, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""保存主题关键词，NumTopWord表示前多少个词"""</span></span><br><span class="line">        TopicWords = <span class="string">""</span></span><br><span class="line">        FeatureNames = self.TFVectorizer.get_feature_names()</span><br><span class="line">        <span class="keyword">for</span> topic_idx, topic <span class="keyword">in</span> enumerate(self.BestLDAModel.components_):</span><br><span class="line">            TopicWords += <span class="string">"Topic #%d:"</span> % topic_idx</span><br><span class="line">            TopicWords += <span class="string">" "</span>.join([FeatureNames[i]</span><br><span class="line">                            <span class="keyword">for</span> i <span class="keyword">in</span> topic.argsort()[:-NumTopWords - <span class="number">1</span>:<span class="number">-1</span>]]) +<span class="string">'\n'</span></span><br><span class="line">        f = open(FilePath + <span class="string">'TopicWords.txt'</span>, <span class="string">'w'</span>)</span><br><span class="line">        f.write(TopicWords)</span><br><span class="line">        f.close()</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">PrintBestModelAndPerplexity</span><span class="params">(self, NumTopWords)</span>:</span></span><br><span class="line">        <span class="string">"""打印出最佳模型"""</span></span><br><span class="line">        print(<span class="string">"Best Number of Topic in LDA Model is "</span>, self.BestTopicNum)</span><br><span class="line">        print(<span class="string">"the min Perplexity is"</span>, self.BestLDAModelPerplexity)</span><br><span class="line">        print(<span class="string">"Best Model is \n"</span>)</span><br><span class="line">        self.__print_top_Words(self.BestLDAModel, self.TFVectorizer.get_feature_names(), NumTopWords)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SaveAllLDAMode</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""保存所有LDAModel"""</span></span><br><span class="line">        <span class="comment">#检查该目录是否存在，若不存在则创建</span></span><br><span class="line">        CreateDir(FilePath)</span><br><span class="line">        index = <span class="number">0</span></span><br><span class="line">        <span class="keyword">for</span> m <span class="keyword">in</span> self.LDAModelLst:</span><br><span class="line">            joblib.dump(m, FilePath + <span class="string">'LDA-model-'</span> + str(index) + <span class="string">'.model'</span>)</span><br><span class="line">            index += <span class="number">1</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SaveBestModel</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""保存最好的LDAmodel"""</span></span><br><span class="line">        joblib.dump(self.BestLDAModel, FilePath + <span class="string">'BestModel.model'</span>)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SavePerplexityCurveAndText</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""保存所有的困惑度（Perplexity），对应的曲线图像"""</span></span><br><span class="line">        <span class="comment">#检查该目录是否存在，若不存在则创建</span></span><br><span class="line">        CreateDir(FilePath)</span><br><span class="line">        <span class="comment"># 保存perplexity结果</span></span><br><span class="line">        <span class="keyword">with</span> open(FilePath + <span class="string">'Perplexity.txt'</span>, <span class="string">'w'</span>) <span class="keyword">as</span> f:</span><br><span class="line">            PerplexityLstStr = <span class="string">""</span></span><br><span class="line">            index = <span class="number">0</span></span><br><span class="line">            <span class="keyword">for</span> x <span class="keyword">in</span> self.PerplexityLst:</span><br><span class="line">                PerplexityLstStr += str(index) + <span class="string">'|'</span> + str(self.NumTopics[index]) + <span class="string">'|'</span> + str(x) + <span class="string">'\n'</span></span><br><span class="line">                index += <span class="number">1</span></span><br><span class="line">            f.write(PerplexityLstStr)</span><br><span class="line">        <span class="comment">#绘制曲线并保存</span></span><br><span class="line">        plt.close(<span class="string">'all'</span>)</span><br><span class="line">        Figure = plt.figure()</span><br><span class="line">        ax = Figure.add_subplot(<span class="number">1</span>, <span class="number">1</span>, <span class="number">1</span>)</span><br><span class="line">        ax.plot(self.NumTopics, self.PerplexityLst)</span><br><span class="line">        ax.set_xlabel(<span class="string">"# of topics"</span>)</span><br><span class="line">        ax.set_ylabel(<span class="string">"Approximate Perplexity"</span>)</span><br><span class="line">        plt.grid(<span class="keyword">True</span>)</span><br><span class="line">        plt.savefig(FilePath + <span class="string">'PerplexityTrend.png'</span>)</span><br><span class="line">        <span class="comment">#plt.show()</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">PrintDocTopicDist</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""打印出文档关于主题的矩阵，每一行表示文档，列表示是当前主题概率"""</span></span><br><span class="line">        doc_topic_dist = self.BestLDAModel.transform(self.TF)</span><br><span class="line">        <span class="keyword">for</span> idx, dist <span class="keyword">in</span> enumerate(doc_topic_dist):</span><br><span class="line">            <span class="comment"># 注意：由于sklearn LDA函数限制，此函数中输出的topic_word矩阵未normalize</span></span><br><span class="line">            dist = [str(x) <span class="keyword">for</span> x <span class="keyword">in</span> dist]</span><br><span class="line">            print(str(idx + <span class="number">1</span>) + <span class="string">','</span>)</span><br><span class="line">            print(<span class="string">','</span>.join(dist) + <span class="string">'\n'</span>)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SaveDocTopicDist</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""保存出文档关于主题的矩阵，每一行表示文档，列表示是当前主题概率"""</span></span><br><span class="line">        doc_topic_dist = self.BestLDAModel.transform(self.TF)</span><br><span class="line">        DocTopic = <span class="string">''</span></span><br><span class="line">        <span class="keyword">for</span> idx, dist <span class="keyword">in</span> enumerate(doc_topic_dist):</span><br><span class="line">            <span class="comment"># 注意：由于sklearn LDA函数限制，此函数中输出的topic_word矩阵未normalize</span></span><br><span class="line">            dist = [str(x) <span class="keyword">for</span> x <span class="keyword">in</span> dist]</span><br><span class="line">            DocTopic += <span class="string">'Document '</span> + str(idx + <span class="number">1</span>) + <span class="string">':'</span> +<span class="string">','</span>.join(dist) + <span class="string">'\n'</span></span><br><span class="line">            <span class="comment"># print str(idx + 1) + ','</span></span><br><span class="line">            <span class="comment"># print ','.join(dist) + '\n'</span></span><br><span class="line">        f = open(FilePath + <span class="string">'DocTopicDist.txt'</span>, <span class="string">'w'</span>)</span><br><span class="line">        f.write(DocTopic)</span><br><span class="line">        f.close()</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SaveConfigFile</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""保存文件配置"""</span></span><br><span class="line">        f = open(FilePath + <span class="string">'Config.txt'</span>, <span class="string">'a'</span>)</span><br><span class="line">        Config = \</span><br><span class="line">            <span class="string">'k param = '</span> + str(max(self.NumTopics) + <span class="number">1</span>) + <span class="string">'\n'</span> + \</span><br><span class="line">            <span class="string">'MaxFeatures = '</span> + str(self.MaxFeatures) + <span class="string">'\n'</span> + \</span><br><span class="line">            <span class="string">'BestTopicNum = '</span> + str(self.BestTopicNum) + <span class="string">'\n'</span> + \</span><br><span class="line">            <span class="string">'BestIndex = '</span> + str(self.BestIndex)</span><br><span class="line">        f.write(Config)</span><br><span class="line">        f.close()</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="实验"><a href="#实验" class="headerlink" title="实验"></a>实验</h2><h3 id="主题-Perplexity曲线图"><a href="#主题-Perplexity曲线图" class="headerlink" title="主题-Perplexity曲线图"></a>主题-Perplexity曲线图</h3><img src="/2018/07/27/LDAModel/PerplexityTrend.png" title="主题-Perplexity"><p>从图中可以看出，困惑度的最低点是主题数为4时。</p><h3 id="主题-关键词"><a href="#主题-关键词" class="headerlink" title="主题-关键词"></a>主题-关键词</h3><p>以下时主题数为4的主题-关键词分布。</p><blockquote><p>Topic #0:couple counseling marriage therapy tip gloria relationship save saving expert help dr back get cambridge<br>Topic #1:today via thought bomber day go say im year love prayer http dont around affected<br>Topic #2:marathon explosion people new looking line injured finish runner victim friend today area via dead<br>Topic #3:suspect bombing police marathon amp photo news say breaking local official officer fbi three area</p></blockquote><p>从关键词的性质可以看出，主题的分类效果较好。</p><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><blockquote><p>[1] 靳志辉,LDA数学八卦[M],2013.<br>[2] <a href="https://blog.csdn.net/TiffanyRabbit/article/details/76445909" target="_blank" rel="noopener">https://blog.csdn.net/TiffanyRabbit/article/details/76445909</a></p></blockquote><p><strong> 文中代码链接：<a href="https://github.com/zhaohuang123/sentiment-analysis/tree/master/blog/LDA" target="_blank" rel="noopener">https://github.com/zhaohuang123/sentiment-analysis/tree/master/blog/LDA</a> </strong></p>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;本文主要是介绍如何利用sklearn框架中LatentDirichletAllocation类完成LDA模型的训练。&lt;/p&gt;
&lt;h2 id=&quot;理论基础&quot;&gt;&lt;a href=&quot;#理论基础&quot; class=&quot;headerlink&quot; title=&quot;理论基础&quot;&gt;&lt;/a&gt;理论基础&lt;/h2&gt;&lt;p&gt;&amp;emsp;&amp;emsp;LDA是一个无监督的学习模型，它假设每个文档包含多个主题，文档中的每个主题都是基于词的概率分布。作为一个基于贝叶斯网络的文档生成模型，LDA刻画的是文档生成的一个概率化过程。&lt;br&gt;&amp;emsp;&amp;emsp;LDA的输入由一组文档D组成的词料库。LDA的输出包括文档主题分布$\theta$和主题中的词语的分布$\phi$。这里的$\theta$和$\phi$都假设服从多项式分布。为让分布更平滑，再假设这两个参数的先验分布服从狄利克雷分布，参数分别为$\alpha$和$\beta$。因为狄利克雷分布是多项式分布的共轭先验分布，所以假设该多项式分布的先验服从狄利克雷分布可以极大简化统计计算的过程。狄利克雷分布中有多个参数，再LDA中利用狄利克雷分布时，大多将参数设为同一个数值，这种设为同一个数值的狄利克雷分布称为对称的狄利克雷分布。以下时LDA模型生成一篇文档的方式。&lt;/p&gt;
    
    </summary>
    
      <category term="机器学习" scheme="http://yoursite.com/categories/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/"/>
    
      <category term="主题模型" scheme="http://yoursite.com/categories/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/%E4%B8%BB%E9%A2%98%E6%A8%A1%E5%9E%8B/"/>
    
    
      <category term="LDA" scheme="http://yoursite.com/tags/LDA/"/>
    
      <category term="sklearn" scheme="http://yoursite.com/tags/sklearn/"/>
    
  </entry>
  
  <entry>
    <title>基于Jaccard Distance的K-Means++算法</title>
    <link href="http://yoursite.com/2018/07/25/KMeansPP/"/>
    <id>http://yoursite.com/2018/07/25/KMeansPP/</id>
    <published>2018-07-25T05:36:57.000Z</published>
    <updated>2018-07-29T13:36:32.511Z</updated>
    
    <content type="html"><![CDATA[<p>本文的内容以<em><a href="http://localhost:4000/2018/07/19/KMeansJD/" target="_blank" rel="noopener">基于Jaccard Distance的K-Means算法</a></em>为基础，将原始K-Means算法随机选择初始向量的方法修改为K-Means++的概率选择方法，同样本文K-Means++算法的距离计算同样是基于Jaccard Distance距离的。</p><h2 id="理论基础"><a href="#理论基础" class="headerlink" title="理论基础"></a>理论基础</h2><p>&emsp;&emsp;原始K-Means算法通过贪心策略，迭代化求解最小平方化误差（K-Means算法的求解过程实际上就是求解最小平方化误差）。但是由于其初始均值向量的选择具有随机性、不确定性，其算法不够稳定，有可能会造成迭代次数较多，算法时间复杂度较高。针对K-Means算法的问题，David Arthur等人提出了K-Means++算法.   </p><p>&emsp;&emsp;K-Means++算法与K-Means算法的主要差异在于初始均值的选取。选取初始均值向量算法如下：</p><a id="more"></a><ol><li><p>随机从 $ D = \left[\vec x_1, \vec x_2,…,\vec x_m \right] $ 选取一个元素作为初始向量 $ \vec \mu_1 $.</p></li><li><p>按概率分布 $ \frac {D(\vec x)^2} {\sum_{\vec x \in D} D(\vec x)^2} $，从样本集D中随机选取新的初始均值向量$ \vec \mu_i $.</p></li><li><p>重复步骤2，直到选出k个初始均值向量。</p></li></ol><p>&emsp;&emsp;上述算法中的$ D(x) $表示样本与均值向量集合的最短距离。</p><p>&emsp;&emsp;由上述算法可以看出，K-Means++算法的核心思想是：在初始均值向量选择时，尽可能选择相距较远的点。直观上理解即相聚越远的点位于不同簇的可能性越高，后续迭代优化的次数显然越少。</p><h2 id="算法过程"><a href="#算法过程" class="headerlink" title="算法过程"></a>算法过程</h2><p>&emsp;&emsp;由于在基于Jaccard Distaance的K-Means中没有均值向量的计算，因此需要对原始K-Means++算法提出的初始向量选择方式进行改造。其算法流程如图。<br><img src="/2018/07/25/KMeansPP/K-MeansPP.png" title="K-Means++"><br>&emsp;&emsp;图中Seeds表示初始均值向量集合；DistanceMatrix存储所有样本与初始向量集合的最短距离；SumDist为DistanceMatrix的距离和；Seed为遍历Seeds集合指针；Probability为根据DistanceMatrix和SumDist计算出的每个样本的概率。<br>&emsp;&emsp;图中算法流程为先将初始均值向量集合Seeds初始化为空集，再随机从样本集中选择一个样本作为初始均值向量加入到Seeds中，在整个流程中，仅有第一个初始均值向量是随机选择的。之后进入到计算新初始均值向量的循环中，每次循环都需要对DistanceMatrix、SumDist进行初始化，Seed指向集合Seeds中的第一个值。该循环的出口条件为Seeds集合元素个数大于k，即生成了k个初始均值向量时。</p><h2 id="代码实现"><a href="#代码实现" class="headerlink" title="代码实现"></a>代码实现</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">KMeansPPChooseSeeds</span><span class="params">(self)</span>:</span></span><br><span class="line">    <span class="string">"""K-Means++算法，用于选择初始均值向量"""</span></span><br><span class="line">    <span class="comment">#1a. Take one center c1, chosen uniformly at random from Tweets.（the english annotations are from D.Arthur paper ）</span></span><br><span class="line">    seed = random.choice(list(self.tweets.keys()))</span><br><span class="line"></span><br><span class="line">    <span class="comment">#Take a new center ci, choosing x from X with probability D(x)^2 / Sum(D(xi)^2)</span></span><br><span class="line">    seeds = set([seed])</span><br><span class="line">    <span class="keyword">while</span> len(seeds) &lt; self.k:</span><br><span class="line">        DistanceMatrix = &#123;&#125;</span><br><span class="line">        SumSqrDist = <span class="number">0</span></span><br><span class="line">        <span class="keyword">for</span> seed <span class="keyword">in</span> seeds:</span><br><span class="line">            <span class="comment">#V1 = set(self.tweets[seed].strip(' ').split(' '))</span></span><br><span class="line">            <span class="keyword">for</span> ID <span class="keyword">in</span> self.tweets:</span><br><span class="line">                <span class="keyword">if</span> ID == seed:</span><br><span class="line">                    <span class="keyword">continue</span></span><br><span class="line">                <span class="comment">#V2 = set(self.tweets[ID].strip(' ').split(' '))</span></span><br><span class="line">                Dist = self.JaccardDistance(self.tweets[seed], self.tweets[ID])</span><br><span class="line">                <span class="keyword">if</span> ID <span class="keyword">not</span> <span class="keyword">in</span> DistanceMatrix <span class="keyword">or</span> Dist &lt; DistanceMatrix[ID]:</span><br><span class="line">                    DistanceMatrix[ID] = Dist</span><br><span class="line">        ProbabilityDict = &#123;&#125;</span><br><span class="line">        <span class="keyword">for</span> ID <span class="keyword">in</span> DistanceMatrix:</span><br><span class="line">            SumSqrDist += DistanceMatrix[ID] * DistanceMatrix[ID]</span><br><span class="line">        <span class="keyword">for</span> ID <span class="keyword">in</span> DistanceMatrix:</span><br><span class="line">            ProbabilityDict[ID] = DistanceMatrix[ID] * DistanceMatrix[ID] / SumSqrDist <span class="comment">##可以优化</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="实验结果"><a href="#实验结果" class="headerlink" title="实验结果"></a>实验结果</h2><p>实验结果与上一篇（<a href="http://localhost:4000/2018/07/19/KMeansJD/" target="_blank" rel="noopener">基于Jaccard Distance的K-Means算法</a>）基本相同。</p><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><blockquote><p>[1] Arthur D, Vassilvitskii S. k-means++:the advantages of careful seeding[C] Eighteenth Acm-Siam Symposium on Discrete Algorithms, New Orleans, Louisiana. Society for Industrial and Applied Mathematics, 2007:1027-1035.<br>[2] <a href="https://github.com/findkim/Jaccard-K-Means" target="_blank" rel="noopener">https://github.com/findkim/Jaccard-K-Means</a></p></blockquote><p><strong> 文中代码链接： <a href="https://github.com/zhaohuang123/sentiment-analysis/tree/master/blog/K-Means%2B%2B" target="_blank" rel="noopener">https://github.com/zhaohuang123/sentiment-analysis/tree/master/blog/K-Means%2B%2B</a></strong></p>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;本文的内容以&lt;em&gt;&lt;a href=&quot;http://localhost:4000/2018/07/19/KMeansJD/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;基于Jaccard Distance的K-Means算法&lt;/a&gt;&lt;/em&gt;为基础，将原始K-Means算法随机选择初始向量的方法修改为K-Means++的概率选择方法，同样本文K-Means++算法的距离计算同样是基于Jaccard Distance距离的。&lt;/p&gt;
&lt;h2 id=&quot;理论基础&quot;&gt;&lt;a href=&quot;#理论基础&quot; class=&quot;headerlink&quot; title=&quot;理论基础&quot;&gt;&lt;/a&gt;理论基础&lt;/h2&gt;&lt;p&gt;&amp;emsp;&amp;emsp;原始K-Means算法通过贪心策略，迭代化求解最小平方化误差（K-Means算法的求解过程实际上就是求解最小平方化误差）。但是由于其初始均值向量的选择具有随机性、不确定性，其算法不够稳定，有可能会造成迭代次数较多，算法时间复杂度较高。针对K-Means算法的问题，David Arthur等人提出了K-Means++算法.   &lt;/p&gt;
&lt;p&gt;&amp;emsp;&amp;emsp;K-Means++算法与K-Means算法的主要差异在于初始均值的选取。选取初始均值向量算法如下：&lt;/p&gt;
    
    </summary>
    
      <category term="机器学习" scheme="http://yoursite.com/categories/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/"/>
    
      <category term="聚类算法" scheme="http://yoursite.com/categories/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/%E8%81%9A%E7%B1%BB%E7%AE%97%E6%B3%95/"/>
    
    
      <category term="Jaccard Distance" scheme="http://yoursite.com/tags/Jaccard-Distance/"/>
    
      <category term="K-Means++" scheme="http://yoursite.com/tags/K-Means/"/>
    
  </entry>
  
  <entry>
    <title>基于Jaccard Distance的K-Means算法</title>
    <link href="http://yoursite.com/2018/07/19/KMeansJD/"/>
    <id>http://yoursite.com/2018/07/19/KMeansJD/</id>
    <published>2018-07-19T15:26:29.000Z</published>
    <updated>2018-07-29T13:36:23.768Z</updated>
    
    <content type="html"><![CDATA[<p>本文主要是介绍如何以Jaccard Distance(JD)为基础，对Twitter的推文数据进行数据聚类。</p><h2 id="理论基础"><a href="#理论基础" class="headerlink" title="理论基础"></a>理论基础</h2><h3 id="性能度量方式"><a href="#性能度量方式" class="headerlink" title="性能度量方式"></a>性能度量方式</h3><p>  &emsp;&emsp;聚类性能度量级聚类 即<em>聚类有效性指标（vlidity index）</em>，该指标可以分为两类：外部指标（external index），将聚类结果与某个“参考模型”比较；内部指标（internal index），直接分析聚类结果，不利用外部参考模型，以下给出两个内部指标度量方式：Dunn指数（Dunn Index，DI）以及误差平方和（Sum Of The Squared Errors，SSE）。<br><a id="more"></a></p><ul><li>DI值<br>在介绍DI值计算前，先需要计算以下两个值:<script type="math/tex; mode=display">diam(C) = max_{1 \leq i \leq j \leq |C| } dist(x_i,x_j)</script><script type="math/tex; mode=display">d_{min} (C_i,C_j) = min_{x_i \in C_i,x_j \in C_j} dist(x_i, x_j)</script>其中$dist(.,.)$用于计算两个样本间的距离；$diam(C)$为簇$C$内样本的最远距离；$d_{min} (C_i,C_j)$为簇$C_i$与簇$C_j$最近样本间的距离。<br>基于上述两个公式，DI值计算公式如下：<script type="math/tex; mode=display">DI = min_{1 \leq i \leq k} \left\{min_{j=i}\left(\frac {d_{min}(C_i,C_j)} {max{1 \leq l \leq k} diam(C_l)}\right)\right\}</script>若要满足“簇间相似度低，簇内相似度高”，DI应越大越好</li><li>SSE值<br>SSE值计算公式如下:<script type="math/tex; mode=display">SSE = \sum^k_{i=1} \sum^{|C_i|}_{p \in C_i}dist(p,\mu_i)^2</script>SSE值判断聚类好坏的核心思想是：随着k参数的增大，样本的划分会更加精细，即“簇内相似度会增高”，因此，SSE值将随k参数递减。但当k参数小于真实聚类数时，k的增加会大幅度增加“簇内相似”程度，所以SSE值下降率会减小，图像上对应的曲线即是SSE值将会随着k的增加趋于平缓。综上，最合适的k值最可能为拐点处的k值。  </li></ul><h3 id="距离计算方式"><a href="#距离计算方式" class="headerlink" title="距离计算方式"></a>距离计算方式</h3><p>在聚类计算中，距离计算函数$dsit(.,.)$需要满足以下四条性质：</p><ul><li>非负性：$ dist(\vec x_i, \vec x_j) \geq 0 $</li><li>同一性：$ dist(\vec x_i, \vec x_j) = 0 $ 当且仅当$ \vec x_i = \vec x_j $</li><li>对称性：$ dist(\vec x_i, \vec x_j) = dist(\vec x_j, \vec x_i) $</li><li>直递性：$ dist(\vec x_i, \vec x_j) \leq dist(\vec x_i, \vec x_k) + dist(\vec x_k, \vec x_j) $</li></ul><h3 id="Jaccard-Distance-定义"><a href="#Jaccard-Distance-定义" class="headerlink" title="Jaccard Distance 定义"></a>Jaccard Distance 定义</h3><p>Jaccard系数（Jaccard Coefficient，JC）定义如下：</p><script type="math/tex; mode=display">J(A,B) = \frac {|A \bigcap B|}{|A \bigcup B|} =  \frac {|A \bigcap B|}{|A| + |B| - |A \bigcap B|}</script><p>Jaccard距离定义如下：</p><script type="math/tex; mode=display">d_J(A,B)=1-J(A,B)</script><p>其中A,B为两个任意的集合</p><h2 id="可行性证明"><a href="#可行性证明" class="headerlink" title="可行性证明"></a>可行性证明</h2><p>由之前的JD定义可知，JD显然满足非负性、同一性、对称性这三种距离度量的基本性质，第四种性质“直递性”证明过程如下：</p><hr><p>JD直递性证明：<br>假设有任意三个集合A,B,C，直递性要求有以下不等式成立：</p><script type="math/tex; mode=display">d_J(A,B) \leq d_J(A,C)+d_J(C,B)</script><p>利用反证法证明上式，即假设有不等式：<br>假设：<script type="math/tex">d_J(A,B) > d_J(A,C)+d_J(C,B)</script><br>成立，则根据假设不等式，A,B,C三个集合也满足如下不等式：</p><script type="math/tex; mode=display">d_J(A,C) > d_J(A,B)+d_J(B,C)</script><p>对上述两个不等式进行迭代，则可以得到下式：</p><script type="math/tex; mode=display">d_J(A,B) > d_J(A,C) + d_J(C,B) > d_J(A,B) + d_J(B,C) + d_J(C,B)</script><p>即有不等式：</p><script type="math/tex; mode=display">0 > d_J(C,B) + d_J(B,C) = 2d_J(C,B)</script><p>显然上式不满足JD所具有的非负性条件，因此假设不成立，原不等式成立。<br>综合上述分析，JD直递性成立，满足四种距离计算的四种性质，JD可以作为距离度量方式</p><hr><h2 id="算法过程"><a href="#算法过程" class="headerlink" title="算法过程"></a>算法过程</h2><p>原始K-Means算法可以分为两个步骤，即外层控制聚类迭代次数以及新旧聚类簇对比，即若聚类簇不再更新或达到最大迭代次数，则推出外层循环，程序结束；内层循环控制新聚类簇的计算。算法流程如图。<br><!-- ![Alt text](/KMeansJD/K-Means.png) --><br><img src="/2018/07/19/KMeansJD/K-Means.png" title="K-Means"><br>基于JD的K-Means算法与原始K-Means算法存在的差异为：</p><ul><li>在基于JD的算法中，每轮迭代没有初始均值向量的再计算，则是直接计算新簇。</li><li>基于JD的K-Means算法没有样本向量化，而是直接使用样本集合的方式，以JD计算样本距离。</li></ul><p>基于JD的K-Means算法计算新簇的流程如图<br><img src="/2018/07/19/KMeansJD/NewCluster.png" title="计算新簇"><br>上图中完整的计算过程包括三层循环，整体逻辑为：遍历样本每一个元素，对每一个元素遍历原始聚类中的每一个簇，计算每一个簇与样本的距离，将与样本距离最小的簇作为该样本的新簇。具体相关变量意义见代码实现。</p><!-- ![Alt text](/path/to/img.jpg "Optional title") --><h2 id="代码实现"><a href="#代码实现" class="headerlink" title="代码实现"></a>代码实现</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br><span class="line">190</span><br><span class="line">191</span><br><span class="line">192</span><br><span class="line">193</span><br><span class="line">194</span><br><span class="line">195</span><br><span class="line">196</span><br><span class="line">197</span><br><span class="line">198</span><br><span class="line">199</span><br><span class="line">200</span><br><span class="line">201</span><br><span class="line">202</span><br><span class="line">203</span><br><span class="line">204</span><br><span class="line">205</span><br><span class="line">206</span><br><span class="line">207</span><br><span class="line">208</span><br><span class="line">209</span><br><span class="line">210</span><br><span class="line">211</span><br><span class="line">212</span><br><span class="line">213</span><br><span class="line">214</span><br><span class="line">215</span><br><span class="line">216</span><br><span class="line">217</span><br><span class="line">218</span><br><span class="line">219</span><br><span class="line">220</span><br><span class="line">221</span><br><span class="line">222</span><br><span class="line">223</span><br><span class="line">224</span><br><span class="line">225</span><br><span class="line">226</span><br><span class="line">227</span><br><span class="line">228</span><br><span class="line">229</span><br><span class="line">230</span><br><span class="line">231</span><br><span class="line">232</span><br><span class="line">233</span><br><span class="line">234</span><br><span class="line">235</span><br><span class="line">236</span><br><span class="line">237</span><br><span class="line">238</span><br><span class="line">239</span><br><span class="line">240</span><br><span class="line">241</span><br><span class="line">242</span><br><span class="line">243</span><br><span class="line">244</span><br><span class="line">245</span><br><span class="line">246</span><br><span class="line">247</span><br><span class="line">248</span><br><span class="line">249</span><br><span class="line">250</span><br><span class="line">251</span><br><span class="line">252</span><br><span class="line">253</span><br><span class="line">254</span><br><span class="line">255</span><br><span class="line">256</span><br><span class="line">257</span><br><span class="line">258</span><br><span class="line">259</span><br><span class="line">260</span><br><span class="line">261</span><br><span class="line">262</span><br><span class="line">263</span><br><span class="line">264</span><br><span class="line">265</span><br><span class="line">266</span><br><span class="line">267</span><br><span class="line">268</span><br><span class="line">269</span><br><span class="line">270</span><br><span class="line">271</span><br><span class="line">272</span><br><span class="line">273</span><br><span class="line">274</span><br><span class="line">275</span><br><span class="line">276</span><br><span class="line">277</span><br><span class="line">278</span><br><span class="line">279</span><br><span class="line">280</span><br><span class="line">281</span><br><span class="line">282</span><br><span class="line">283</span><br><span class="line">284</span><br><span class="line">285</span><br><span class="line">286</span><br><span class="line">287</span><br><span class="line">288</span><br><span class="line">289</span><br><span class="line">290</span><br><span class="line">291</span><br><span class="line">292</span><br><span class="line">293</span><br><span class="line">294</span><br><span class="line">295</span><br><span class="line">296</span><br><span class="line">297</span><br><span class="line">298</span><br><span class="line">299</span><br><span class="line">300</span><br><span class="line">301</span><br><span class="line">302</span><br><span class="line">303</span><br><span class="line">304</span><br><span class="line">305</span><br><span class="line">306</span><br><span class="line">307</span><br><span class="line">308</span><br><span class="line">309</span><br><span class="line">310</span><br><span class="line">311</span><br><span class="line">312</span><br><span class="line">313</span><br><span class="line">314</span><br><span class="line">315</span><br><span class="line">316</span><br><span class="line">317</span><br><span class="line">318</span><br><span class="line">319</span><br><span class="line">320</span><br><span class="line">321</span><br><span class="line">322</span><br><span class="line">323</span><br><span class="line">324</span><br><span class="line">325</span><br><span class="line">326</span><br><span class="line">327</span><br><span class="line">328</span><br><span class="line">329</span><br><span class="line">330</span><br><span class="line">331</span><br><span class="line">332</span><br><span class="line">333</span><br><span class="line">334</span><br><span class="line">335</span><br><span class="line">336</span><br><span class="line">337</span><br><span class="line">338</span><br><span class="line">339</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="comment">#!/usr/bin/python</span></span><br><span class="line"><span class="comment"># -*- coding: UTF-8 -*-</span></span><br><span class="line"></span><br><span class="line"><span class="string">"""基于jaccard 距离的迭代K-Means算法实现</span></span><br><span class="line"><span class="string">"""</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#   一些功能函数</span></span><br><span class="line"><span class="keyword">import</span> os</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">CreateDir</span><span class="params">(FilePath)</span>:</span></span><br><span class="line">    <span class="keyword">if</span> <span class="keyword">not</span> os.path.exists(FilePath):  <span class="comment"># 目录不存在时，创建目录</span></span><br><span class="line">        os.mkdir(FilePath)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">WriteFileLine</span><span class="params">(FilePath, DataList, style)</span>:</span></span><br><span class="line">    <span class="keyword">try</span>:</span><br><span class="line">        f = open(FilePath, style)</span><br><span class="line">        <span class="keyword">for</span> DataLine <span class="keyword">in</span> DataList:</span><br><span class="line">            f.write(DataLine)</span><br><span class="line">        f.close()</span><br><span class="line">    <span class="keyword">except</span> Exception <span class="keyword">as</span> e:</span><br><span class="line">        print(<span class="string">'Create File ERROR'</span> + str(e))</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">ReadFile</span><span class="params">(FilePath)</span>:</span></span><br><span class="line">    <span class="string">"""读取预处理文件，返回值为Dict"""</span></span><br><span class="line">    <span class="keyword">try</span>:</span><br><span class="line">        f = open(FilePath)</span><br><span class="line">    <span class="keyword">except</span> IOError:</span><br><span class="line">        print(<span class="string">"Can't find the file"</span>)</span><br><span class="line">        <span class="keyword">return</span></span><br><span class="line">    tweets = &#123;&#125;</span><br><span class="line">    <span class="keyword">while</span> <span class="keyword">True</span>:</span><br><span class="line">        line = f.readline()</span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> line:</span><br><span class="line">            <span class="keyword">break</span></span><br><span class="line">        <span class="keyword">try</span>:</span><br><span class="line">            ID = line.split(<span class="string">'|'</span>)[<span class="number">0</span>]</span><br><span class="line">            Text = line.split(<span class="string">'|'</span>)[<span class="number">1</span>].strip(<span class="string">'\n'</span>)</span><br><span class="line">            tweets[ID] = set(Text.strip(<span class="string">' '</span>).split(<span class="string">' '</span>))</span><br><span class="line">        <span class="keyword">except</span> IndexError:</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">    <span class="keyword">return</span> tweets</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> copy</span><br><span class="line"><span class="keyword">import</span> random <span class="keyword">as</span> Inrandom</span><br><span class="line"><span class="keyword">from</span> numpy <span class="keyword">import</span> random</span><br><span class="line"><span class="keyword">import</span> time</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">KMeans</span>:</span></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self, tweets, k, MaxIterations)</span>:</span></span><br><span class="line">        <span class="string">"""初始参数，tweets为推文，数据类型为Dict，键：tweetID，值：分词后的tweet，以集合的形式保存</span></span><br><span class="line"><span class="string">        k为聚类簇数，MaxIterations为最大迭代次数'"""</span></span><br><span class="line">        self.tweets = tweets</span><br><span class="line">        self.k = k</span><br><span class="line">        self.MaxIterations = MaxIterations</span><br><span class="line"></span><br><span class="line">        self.seeds = []<span class="comment">#随机选取的初始均值向量的ID</span></span><br><span class="line">        self.Clusters = &#123;&#125;<span class="comment">#用户存放聚类结果，字典之中的每一个键对应一簇</span></span><br><span class="line">        self.RevClusters = &#123;&#125;<span class="comment">#反向索引，字典之中键为tweets向量ID，值为簇序号</span></span><br><span class="line">        self.JaccardMatrix = &#123;&#125;<span class="comment">#设置为矩阵，用于存储每一对向量的jaccard距离</span></span><br><span class="line">        self.LenTwitter = len(tweets)<span class="comment">#推文的长度</span></span><br><span class="line"></span><br><span class="line">        <span class="comment">#运行初始均值向量随机选取函数</span></span><br><span class="line">        self.InitializeChooseSeeds()</span><br><span class="line">        <span class="comment">#运行初始化聚类函数</span></span><br><span class="line">        self.InitializeClusters()</span><br><span class="line">        <span class="comment">#运行计算Jaccard距离矩阵函数，修改：由于增加了LoadJaccardMatrix矩阵，所以不在__init__中直接运行，</span></span><br><span class="line">        <span class="keyword">if</span> self.LenTwitter &lt;= <span class="number">1500</span>:</span><br><span class="line">            self.InitializeJaccardMatrix()</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">JaccardDistance</span><span class="params">(self, SetA, SetB)</span>:</span></span><br><span class="line">        <span class="string">"""计算Jaccard距离函数，SetA 和 SetB为两个集合"""</span></span><br><span class="line">        <span class="keyword">try</span>:</span><br><span class="line">            <span class="keyword">return</span> <span class="number">1</span> - float(len(SetA.intersection(SetB))) / float(len(SetA.union(SetB)))</span><br><span class="line">        <span class="keyword">except</span> TypeError:</span><br><span class="line">            print(<span class="string">'Error, SetA or SetB is none.'</span>)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">InitializeChooseSeeds</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""kmeans算法，选取初始均值向量，利用sample函数，从tweets键中随机选取k个ID"""</span></span><br><span class="line">        self.seeds = Inrandom.sample(list(self.tweets.keys()), self.k)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">InitializeClusters</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""对聚类进行初始化"""</span></span><br><span class="line">        <span class="comment">#对反向索引进行初始化，由于当前没有进行聚类，反向索引置为-1</span></span><br><span class="line">        <span class="keyword">for</span> ID <span class="keyword">in</span> self.tweets:</span><br><span class="line">            self.RevClusters[ID] = <span class="number">-1</span></span><br><span class="line"></span><br><span class="line">        <span class="comment">#对聚类簇字典进行初始化，初始化使用随机选取的初始均值向量</span></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> range(self.k):</span><br><span class="line">            self.Clusters[i] = set([self.seeds[i]]) <span class="comment">#将tweet的ID以集合形式存储起来，i为簇序号也是字典键</span></span><br><span class="line">            self.RevClusters[self.seeds[i]] = i <span class="comment">#初始均值向量对应的簇序号已经确定，对反向索引进行赋值</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">InitializeJaccardMatrix</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""计算出每一对tweet的Jaccard距离,动态规划思想，以空间换时间</span></span><br><span class="line"><span class="string">        数据量过大时，可能会发生memoryerror的错误</span></span><br><span class="line"><span class="string">        """</span></span><br><span class="line">        <span class="comment">#利用两层循环进行每一对ID的匹配</span></span><br><span class="line">        k = <span class="number">0</span> <span class="comment">#调试变量</span></span><br><span class="line">        <span class="keyword">try</span>:</span><br><span class="line">            <span class="keyword">for</span> ID1 <span class="keyword">in</span> self.tweets:</span><br><span class="line">                self.JaccardMatrix[ID1] = &#123;&#125;</span><br><span class="line">                <span class="keyword">for</span> ID2 <span class="keyword">in</span> self.tweets:</span><br><span class="line">                    <span class="keyword">if</span> ID2 <span class="keyword">not</span> <span class="keyword">in</span> self.JaccardMatrix:</span><br><span class="line">                        self.JaccardMatrix[ID2] = &#123;&#125;</span><br><span class="line">                    Distance = self.JaccardDistance(self.tweets[ID1], self.tweets[ID2])<span class="comment">#计算出jaccard距离</span></span><br><span class="line">                    self.JaccardMatrix[ID1][ID2] = Distance<span class="comment">#距离赋值</span></span><br><span class="line">                    self.JaccardMatrix[ID2][ID1] = Distance</span><br><span class="line">                    k += <span class="number">1</span></span><br><span class="line">        <span class="keyword">except</span> MemoryError:</span><br><span class="line">            print(k)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">LoadJaccardMatrix</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""导入已经计算好的Jaccard矩阵</span></span><br><span class="line"><span class="string">        数据格式为：ID1 | ID2 | Value</span></span><br><span class="line"><span class="string">        """</span></span><br><span class="line">        <span class="keyword">try</span>:</span><br><span class="line">            f = open(FilePath, <span class="string">'r'</span>)</span><br><span class="line">        <span class="keyword">except</span> IOError:</span><br><span class="line">            print(<span class="string">"Error! The file don't exist"</span>)</span><br><span class="line">            <span class="keyword">return</span></span><br><span class="line">        <span class="keyword">while</span> <span class="keyword">True</span>:</span><br><span class="line">            line = f.readline()</span><br><span class="line">            <span class="keyword">if</span> <span class="keyword">not</span> line:</span><br><span class="line">                <span class="keyword">break</span></span><br><span class="line">            <span class="keyword">try</span>:</span><br><span class="line">                <span class="comment">#读出当前行的ID1 ID2 Distance</span></span><br><span class="line">                ID1 = line.split(<span class="string">'|'</span>)[<span class="number">0</span>]</span><br><span class="line">                ID2 = line.split(<span class="string">'|'</span>)[<span class="number">1</span>]</span><br><span class="line">                Distance = float(line.split(<span class="string">'|'</span>)[<span class="number">2</span>])</span><br><span class="line">                <span class="comment">#若不存在啊ID1 ID2等相关键，则进行创建</span></span><br><span class="line">                <span class="keyword">if</span> ID1 <span class="keyword">not</span> <span class="keyword">in</span> self.JaccardMatrix:</span><br><span class="line">                    self.JaccardMatrix[ID1] = &#123;&#125;</span><br><span class="line">                <span class="keyword">if</span> ID2 <span class="keyword">not</span> <span class="keyword">in</span> self.JaccardMatrix:</span><br><span class="line">                    self.JaccardMatrix[ID2] = &#123;&#125;</span><br><span class="line">                self.JaccardMatrix[ID1][ID2] = Distance  <span class="comment"># 距离赋值</span></span><br><span class="line">                self.JaccardMatrix[ID2][ID1] = Distance</span><br><span class="line">            <span class="keyword">except</span> IndexError:</span><br><span class="line">                <span class="keyword">continue</span></span><br><span class="line">        f.close()</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">CalcNewClusters</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""计算新的聚类"""</span></span><br><span class="line">        <span class="comment">#初始化</span></span><br><span class="line">        NewClusters = &#123;&#125;<span class="comment">#新的聚类</span></span><br><span class="line">        NewRevCluster = &#123;&#125;<span class="comment">#新的反向索引</span></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> range(self.k):</span><br><span class="line">            NewClusters[i] = set()<span class="comment">#初始化为空集</span></span><br><span class="line"></span><br><span class="line">        <span class="comment">#遍历tweets中每一个元素，通过之前的聚类簇，构造出新的聚类</span></span><br><span class="line">        k = <span class="number">0</span>   <span class="comment">#调试变量</span></span><br><span class="line">        <span class="keyword">for</span> ID1 <span class="keyword">in</span> self.tweets:</span><br><span class="line">            MinDist = float(<span class="string">"inf"</span>)  <span class="comment">#将最小距离初始化为无穷小，保证存在出口</span></span><br><span class="line">            MinCluster = self.RevClusters[ID1]</span><br><span class="line"></span><br><span class="line">            <span class="comment">#遍历每一个簇，计算出对于元素ID具有最小的簇数</span></span><br><span class="line">            <span class="keyword">for</span> j <span class="keyword">in</span> self.Clusters:</span><br><span class="line">            <span class="comment">#for j in SampleResult:</span></span><br><span class="line">                Dist = <span class="number">0</span></span><br><span class="line">                Count = float(<span class="number">0</span>)</span><br><span class="line">                <span class="comment">#遍历当前簇之中的所有元素，计算出ID与其他元素的Jaccard之和</span></span><br><span class="line">                <span class="comment">#计算当前ID与当前簇的距离</span></span><br><span class="line">                <span class="keyword">for</span> ID2 <span class="keyword">in</span> self.Clusters[j]:</span><br><span class="line">                    <span class="keyword">if</span> self.LenTwitter &lt;= <span class="number">1500</span>:</span><br><span class="line">                        Dist += self.JaccardMatrix[ID1][ID2]</span><br><span class="line">                    <span class="keyword">else</span>:</span><br><span class="line">                        Dist += <span class="number">1</span> - \</span><br><span class="line">                                float(len(self.tweets[ID1].intersection(self.tweets[ID2]))) \</span><br><span class="line">                                / float(len(self.tweets[ID1].union(self.tweets[ID2])))  <span class="comment"># 计算当前选定元素ID与该类之中其他元素的距离和</span></span><br><span class="line">                    <span class="comment">#Dist += self.jaccardMatrix[ID1][ID2]</span></span><br><span class="line">                    Count += <span class="number">1</span> <span class="comment">#计算当前类里元素的数量</span></span><br><span class="line">                    k += <span class="number">1</span>  <span class="comment">#调试变量</span></span><br><span class="line">                <span class="keyword">if</span> Count &gt; <span class="number">0</span>:<span class="comment">#若之前遍历的簇不为空，则进行距离判定</span></span><br><span class="line">                    AvgDist = Dist / Count</span><br><span class="line">                    <span class="keyword">if</span> MinDist &gt; AvgDist: <span class="comment">#如果当前最小距离小于当前元素与该类的距离，则修改最小距离</span></span><br><span class="line">                        MinDist = AvgDist</span><br><span class="line">                        MinCluster = j</span><br><span class="line">            NewClusters[MinCluster].add(ID1)<span class="comment">#将当前元素添加到具有最小距离的类中</span></span><br><span class="line">            NewRevCluster[ID1] = MinCluster<span class="comment">#添加反向索引</span></span><br><span class="line">        <span class="keyword">return</span> NewClusters, NewRevCluster<span class="comment">#返回新的聚类结果</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">Converge</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""聚类顶层函数"""</span></span><br><span class="line">        <span class="comment">#初始化运行赋值</span></span><br><span class="line">        NewClusters, NewRevCluster = self.CalcNewClusters()</span><br><span class="line">        self.Clusters = copy.deepcopy(NewClusters)</span><br><span class="line">        self.RevClusters = copy.deepcopy(NewRevCluster)</span><br><span class="line"></span><br><span class="line">        <span class="comment">#开始进行迭代，直至收敛或达到最大迭代次数</span></span><br><span class="line">        Interations = <span class="number">1</span></span><br><span class="line">        <span class="keyword">while</span> Interations &lt; self.MaxIterations:<span class="comment">#循环出口条件，小于最大迭代次数</span></span><br><span class="line">            NewClusters, NewRevCluster = self.CalcNewClusters()</span><br><span class="line">            Interations += <span class="number">1</span></span><br><span class="line">            <span class="keyword">if</span> self.RevClusters != NewRevCluster: <span class="comment">#当最新的迭代结果与之前结果不一致时，对聚类结果进行更新</span></span><br><span class="line">                self.Clusters = copy.deepcopy(NewClusters)</span><br><span class="line">                self.RevClusters = copy.deepcopy(NewRevCluster)</span><br><span class="line">            <span class="keyword">else</span>:<span class="comment">#若结果收敛，则循环结束，聚类完成</span></span><br><span class="line">                print(Interations)</span><br><span class="line">                <span class="keyword">return</span> Interations</span><br><span class="line">        print(<span class="string">'Get the Max!'</span>)</span><br><span class="line">        <span class="keyword">return</span> self.MaxIterations + <span class="number">1</span></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">OneClusterSSE</span><span class="params">(self, Cluster)</span>:</span></span><br><span class="line">        <span class="string">"""计算每一簇聚类之中的误差平方"""</span></span><br><span class="line">        OneClusterSSE = <span class="number">0</span></span><br><span class="line">        Len = len(Cluster)</span><br><span class="line">        <span class="keyword">for</span> ID1 <span class="keyword">in</span> Cluster:</span><br><span class="line">            S = <span class="number">0</span></span><br><span class="line">            <span class="keyword">for</span> ID2 <span class="keyword">in</span> Cluster:</span><br><span class="line">                <span class="keyword">if</span> self.LenTwitter &lt;= <span class="number">1500</span>:</span><br><span class="line">                    S += self.JaccardMatrix[ID1][ID2]</span><br><span class="line">                <span class="keyword">else</span>:</span><br><span class="line">                    S += self.JaccardDistance(self.tweets[ID1], self.tweets[ID2])</span><br><span class="line">            S /= Len</span><br><span class="line">            <span class="comment">#S = S*S</span></span><br><span class="line">            OneClusterSSE += S*S</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> OneClusterSSE</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">CalculateSSE</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""用于计算误差平方和，Sum Of The Squared Errors, SSE"""</span></span><br><span class="line">        SSE = <span class="number">0</span> <span class="comment">#误差平方和</span></span><br><span class="line">        <span class="keyword">for</span> Ci <span class="keyword">in</span> self.Clusters:</span><br><span class="line">            SSE += self.OneClusterSSE(self.Clusters[Ci])</span><br><span class="line">        <span class="keyword">return</span> SSE</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">CalculateMinValue</span><span class="params">(self, Cluster1, Cluster2)</span>:</span></span><br><span class="line">        MinValue = float(<span class="string">"inf"</span>)<span class="comment">#表示值为无穷大</span></span><br><span class="line">        <span class="keyword">for</span> ID1 <span class="keyword">in</span> Cluster1:</span><br><span class="line">            <span class="keyword">for</span> ID2 <span class="keyword">in</span> Cluster2:</span><br><span class="line">                <span class="keyword">if</span> self.LenTwitter &lt;= <span class="number">1500</span>:</span><br><span class="line">                    Dist = self.JaccardMatrix[ID1][ID2]</span><br><span class="line">                <span class="keyword">else</span>:</span><br><span class="line">                    Dist = self.JaccardDistance(self.tweets[ID1], self.tweets[ID2])</span><br><span class="line">                <span class="keyword">if</span> MinValue &gt; Dist:</span><br><span class="line">                    MinValue = Dist</span><br><span class="line">        <span class="keyword">return</span> Dist</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">CalculateDiam</span><span class="params">(self,Cluster)</span>:</span></span><br><span class="line">        <span class="string">"""</span></span><br><span class="line"><span class="string">        计算聚类之中元素的最大距离</span></span><br><span class="line"><span class="string">        :param Cluster: 输入聚类</span></span><br><span class="line"><span class="string">        :return: 返回最大值</span></span><br><span class="line"><span class="string">        """</span></span><br><span class="line">        Max = float(<span class="string">'-inf'</span>)<span class="comment">#令初始最大值为无限小，保证存在出口</span></span><br><span class="line">        <span class="keyword">for</span> ID1 <span class="keyword">in</span> Cluster:</span><br><span class="line">            <span class="keyword">for</span> ID2 <span class="keyword">in</span> Cluster:</span><br><span class="line">                <span class="keyword">if</span> self.LenTwitter &lt;= <span class="number">1500</span>:</span><br><span class="line">                    dist = self.JaccardMatrix[ID1][ID2]</span><br><span class="line">                <span class="keyword">else</span>:</span><br><span class="line">                    dist = self.JaccardDistance(self.tweets[ID1], self.tweets[ID2])</span><br><span class="line">                <span class="keyword">if</span> dist &gt; Max:</span><br><span class="line">                    Max = dist</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> Max</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">CalculateDI</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="string">"""</span></span><br><span class="line"><span class="string">        计算DI指数</span></span><br><span class="line"><span class="string">        :return:</span></span><br><span class="line"><span class="string">        """</span></span><br><span class="line">        <span class="comment">#先计算分子</span></span><br><span class="line">        DMin = float(<span class="string">"inf"</span>)  <span class="comment"># 表示值为无穷大</span></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> self.Clusters:</span><br><span class="line">            <span class="keyword">for</span> j <span class="keyword">in</span> self.Clusters:</span><br><span class="line">                <span class="keyword">if</span> self.Clusters[i] <span class="keyword">is</span> self.Clusters[j]:</span><br><span class="line">                    <span class="keyword">continue</span></span><br><span class="line">                <span class="keyword">else</span>:</span><br><span class="line">                    dist = self.CalculateMinValue(self.Clusters[i], self.Clusters[j])</span><br><span class="line">                    <span class="keyword">if</span> dist &lt; DMin:</span><br><span class="line">                        DMin = dist</span><br><span class="line"></span><br><span class="line">        <span class="comment">#在计算分母</span></span><br><span class="line">        DMax = float(<span class="string">'-inf'</span>)<span class="comment">#令初始最大值为无限小，保证存在出口</span></span><br><span class="line">        <span class="keyword">for</span> l <span class="keyword">in</span> self.Clusters:</span><br><span class="line">            dist = self.CalculateDiam(self.Clusters[l])</span><br><span class="line">            <span class="keyword">if</span> dist &gt; DMax:</span><br><span class="line">                DMax = dist</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> DMin / DMax</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">PrintCluster</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> self.Clusters:</span><br><span class="line">            print(<span class="string">'\n\nCluster '</span> + str(i))</span><br><span class="line">            <span class="keyword">for</span> ID <span class="keyword">in</span> self.Clusters[i]:<span class="comment">#遍历簇中每一个ID以及ID对应的tweet</span></span><br><span class="line">                print(str(ID) + <span class="string">' | '</span> + <span class="string">' '</span>.join(self.tweets[ID]) + <span class="string">'\n'</span>)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">PrintJaccardMatrix</span><span class="params">(self)</span>:</span></span><br><span class="line">        <span class="keyword">for</span> ID <span class="keyword">in</span> self.tweets:</span><br><span class="line">            <span class="keyword">for</span> ID2 <span class="keyword">in</span> self.tweets:</span><br><span class="line">                print(ID, ID2, self.JaccardMatrix[ID][ID2])</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SaveClusterFile</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""以文件的形式保存聚类结果，FilePath为文件路径"""</span></span><br><span class="line">        <span class="comment">#若目录不存在，则创建路径</span></span><br><span class="line">        CreateDir(FilePath)</span><br><span class="line">        <span class="comment">#遍历每一个簇</span></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> self.Clusters:</span><br><span class="line">            TempCache = []<span class="comment">#设置缓冲，存储当前簇中所有tweets</span></span><br><span class="line">            <span class="keyword">for</span> ID <span class="keyword">in</span> self.Clusters[i]:<span class="comment">#遍历簇中每一个ID以及ID对应的tweet</span></span><br><span class="line">                <span class="comment">#TempCache.append(str(ID) + '|' + self.tweets[ID] + '\n')    #还需要修改了这一部分东西</span></span><br><span class="line">                TempCache.append(str(ID) + <span class="string">'|'</span> + <span class="string">' '</span>.join(self.tweets[ID]) + <span class="string">'\n'</span>)  <span class="comment"># 还需要修改了这一部分东西</span></span><br><span class="line">            WriteFileLine(FilePath + <span class="string">'C'</span> + str(i) + <span class="string">'.txt'</span>, TempCache, <span class="string">'w'</span>)</span><br><span class="line"></span><br><span class="line">    <span class="function"><span class="keyword">def</span> <span class="title">SaveJaccardMatrix</span><span class="params">(self, FilePath)</span>:</span></span><br><span class="line">        <span class="string">"""以文件的形式保存Jaccard矩阵计算结果，便于下次计算加速"""</span></span><br><span class="line">        TempCache = []<span class="comment">#设置缓冲</span></span><br><span class="line">        <span class="comment">#遍历矩阵之中的每一个元素</span></span><br><span class="line">        <span class="keyword">for</span> ID1 <span class="keyword">in</span> self.tweets:</span><br><span class="line">            <span class="keyword">for</span> ID2 <span class="keyword">in</span> self.tweets:</span><br><span class="line">                TempCache.append(str(ID1) + <span class="string">'|'</span> + str(ID2) + <span class="string">'|'</span> + str(self.JaccardMatrix[ID1][ID2]) + <span class="string">'\n'</span>)</span><br><span class="line">        <span class="comment">#写入文件</span></span><br><span class="line">        <span class="keyword">try</span>:</span><br><span class="line">            WriteFileLine(FilePath + <span class="string">'JaccardMatrix.txt'</span>, TempCache, <span class="string">'w'</span>)</span><br><span class="line">        <span class="keyword">except</span> IOError:</span><br><span class="line">            print(<span class="string">"Error! The file fail to create!"</span>)</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">main</span><span class="params">()</span>:</span></span><br><span class="line">    tweets = ReadFile(<span class="string">'ClusterTest.txt'</span>)</span><br><span class="line">    k = <span class="number">27</span> <span class="comment">#设置k参数</span></span><br><span class="line">    MaxIterations = <span class="number">50</span> <span class="comment">#设置最大迭代次数</span></span><br><span class="line">    kmeans = KMeans(tweets, <span class="number">27</span>, MaxIterations)</span><br><span class="line">    Iterations = kmeans.Converge()  <span class="comment">#方法返回最大迭代次数</span></span><br><span class="line">    kmeans.SaveClusterFile(<span class="string">"ClusterResult/"</span>)</span><br><span class="line">    SSE = kmeans.CalculateSSE() <span class="comment">#计算SSE</span></span><br><span class="line">    DI = kmeans.CalculateDI() <span class="comment">#计算DI</span></span><br><span class="line">    print(<span class="string">"MaxIterations: "</span> + str(MaxIterations) + <span class="string">"\n"</span> + <span class="string">"SSE: "</span> + str(SSE) + <span class="string">'\n'</span> + <span class="string">"DI: "</span> + str(DI))</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">"__main__"</span>:</span><br><span class="line">    time_start = time.time()  <span class="comment"># time.time()为1970.1.1到当前时间的毫秒数</span></span><br><span class="line">    main()</span><br><span class="line">    time_end = time.time()  <span class="comment"># time.time()为1970.1.1到当前时间的毫秒数</span></span><br><span class="line">    print(<span class="string">"the K-Means algorithm time:"</span>)</span><br><span class="line">    <span class="keyword">print</span> (str((time_end - time_start) / <span class="number">60</span>) + <span class="string">'min'</span>)</span><br><span class="line"></span><br></pre></td></tr></table></figure><h2 id="实验结果"><a href="#实验结果" class="headerlink" title="实验结果"></a>实验结果</h2><p>针对1231条，时间在2013年4-5月的Boston地区的推文，运行聚类算法，实验结果如下：<br><img src="/2018/07/19/KMeansJD/ClusterResult.png" title="ClusterResult"><br>聚类结果中某一簇文件内容如下：</p><blockquote><p>324772985196126208|dennis pen bombing oped lehane marathon<br>330771754412818433|confronted bombing bomber suspect lol alleged news texted fox marathon<br>326814744168235008|targeted racism yet balanced solidarity understands one marathon<br>327104038778859520|mile dont runwalk april thursday arizona ims forget honorary marathon<br>324887245570048000|bill running rodgers went champ beauty cant http time away take sportyou marathon<br>325132502140329984|caught bombing firefight offi police another suspect loose watertown one marathon<br>324190727141728256|take bombing terror marathon<br>58361655649779712|trying figure townim strugglin get got<br>324532692496551937|bombing keep calm marathon via carry<br>323884448380776448|gi know pm running comment sarah area info pas anyone schulz someone marathon<br>332630562231681026|pd bombing world suspect news abc fire didnt tell fbi marathon<br>333606268113649665|wake bombing come event wacky bay area even antiterror campaign marathon</p></blockquote><p>上述簇文件基本是包含的Boston爆炸事件的相关内容，可以看出，聚簇效果还是较为理想的。</p><h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><blockquote><p>[1] Arthur D, Vassilvitskii S. k-means++:the advantages of careful seeding[C] Eighteenth Acm-Siam Symposium on Discrete Algorithms, New Orleans, Louisiana. Society for Industrial and Applied Mathematics, 2007:1027-1035.<br>[2] <a href="https://github.com/findkim/Jaccard-K-Means" target="_blank" rel="noopener">https://github.com/findkim/Jaccard-K-Means</a></p></blockquote><p><strong> 文中代码链接：<a href="https://github.com/zhaohuang123/sentiment-analysis/tree/master/blog/K-Means" target="_blank" rel="noopener">https://github.com/zhaohuang123/sentiment-analysis/tree/master/blog/K-Means</a> </strong></p>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;本文主要是介绍如何以Jaccard Distance(JD)为基础，对Twitter的推文数据进行数据聚类。&lt;/p&gt;
&lt;h2 id=&quot;理论基础&quot;&gt;&lt;a href=&quot;#理论基础&quot; class=&quot;headerlink&quot; title=&quot;理论基础&quot;&gt;&lt;/a&gt;理论基础&lt;/h2&gt;&lt;h3 id=&quot;性能度量方式&quot;&gt;&lt;a href=&quot;#性能度量方式&quot; class=&quot;headerlink&quot; title=&quot;性能度量方式&quot;&gt;&lt;/a&gt;性能度量方式&lt;/h3&gt;&lt;p&gt;  &amp;emsp;&amp;emsp;聚类性能度量级聚类 即&lt;em&gt;聚类有效性指标（vlidity index）&lt;/em&gt;，该指标可以分为两类：外部指标（external index），将聚类结果与某个“参考模型”比较；内部指标（internal index），直接分析聚类结果，不利用外部参考模型，以下给出两个内部指标度量方式：Dunn指数（Dunn Index，DI）以及误差平方和（Sum Of The Squared Errors，SSE）。&lt;br&gt;
    
    </summary>
    
      <category term="机器学习" scheme="http://yoursite.com/categories/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/"/>
    
      <category term="聚类算法" scheme="http://yoursite.com/categories/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/%E8%81%9A%E7%B1%BB%E7%AE%97%E6%B3%95/"/>
    
    
      <category term="K-Means" scheme="http://yoursite.com/tags/K-Means/"/>
    
      <category term="Jaccard Distance" scheme="http://yoursite.com/tags/Jaccard-Distance/"/>
    
  </entry>
  
</feed>
