数据挖掘

推荐

  • 基于用户的协作型过滤 为用户计算与其他用户的相似度
  • 基于物品的协作型过滤 为物品计算与其他物品的相似度 需要提前计算好物品的相似集合

收集偏好

不管偏好是评价何物品 最终都需要一种方法将其对应到数字:

使用这一的一个item表示某一用户的偏好:

'Toby': {'Snakes on a Plane': 4.5, 'You, Me and Dupree': 1.0, 'Superman Returns': 4.0}

寻找相似的用户

欧几里得距离评价:

屏幕截图 2020-10-11 110258

通过将用户转换为坐标中的点 计算点之间的距离 距离越近越相似

def sim_distance(prefs, person1, person2):
    # 得到shared_items的列表
    si = {}
    for item in prefs[person1]:
        if item in prefs[person2]:
            si[item] = 1

    # 如果两者没有共同之处,则返回0
    if len(si) == 0:
        return 0

    # 计算所有差值的平方和
    sum_of_squares = 0
    for item in si.keys():
        # 这里所做的就是等同于在计算二维中x y 的差值
        sum_of_squares = sum_of_squares + pow(prefs[person1][item] - prefs[person2][item], 2)

    return 1 / (1 + sqrt(sum_of_squares))

皮尔逊相关度评价:

判断两组数据与某一直线拟合程度的度量

屏幕截图 2020-10-11 112937

def sim_pearson(prefs, p1, p2):
    # 得到双方都曾评价过的物品列表
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1

    # 如果两者没有共同之处,则返回0
    if len(si) == 0:
        return 0

    # 得到列表元素的个数
    n = len(si)

    # 对所有偏好求和
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])

    # 求平方和
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])

    # 求乘积之和
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])

    # 计算皮尔逊评价值
    num = pSum - (sum1 * sum2 / n)
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:
        return 0

    r = num / den

    return r

为某一用户找出与其相似的用户

def topMatches(prefs, person, n=5, similarity=sim_pearson):
    scores = [(similarity(prefs, person, other), other)
              for other in prefs if other != person]
    scores.sort()
    scores.reverse()
    return scores[0:n]

根据相似用户推荐物品

def getRecommendations(prefs, person, similarity=sim_pearson):
    totals = {}
    simSums = {}
    for other in prefs:
        # 不和自己比较
        if other == person:
            continue
        sim = similarity(prefs, person, other)

        # 忽略相似度小于等于0的用户
        if sim <= 0: continue
        for item in prefs[other]:
            # 只对自己还未评价过的电影推荐
            if item not in prefs[person] or prefs[person][item] == 0:
                # 计算某部影片的分值
                totals.setdefault(item, 0)
                totals[item] += prefs[other][item] * sim
                # 每部影片的评价人相似度总和
                simSums.setdefault(item, 0)
                simSums[item] += sim

    # 计算一个归一化列表
    rankings = [(total / simSums[item], item) for item, total in totals.items()]

    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings

匹配物品

通过将物品与人交换 可以使用查找相似的用户的算法来查找相似的物品

def transformPrefs(prefs):
    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})

            # Flip item and person
            result[item][person] = prefs[person][item]
    return result

构造物品相似集合

def calculateSimilarItems(prefs, n=10):
    # 建立一个key为物品 value为与其相似相近物品的的字典
    result = {}

    # 反转物品与人
    itemPrefs = transformPrefs(prefs)
    c = 0
    for item in itemPrefs:
        # 针对大数据更新状态变量
        c += 1
        if c % 100 == 0:
            print("%d / %d" % (c, len(itemPrefs)))
        # 寻找最为相近的物品
        scores = topMatches(itemPrefs, item, n=n, similarity=sim_distance)
        result[item] = scores
    return result

使用提前构造好的数据集获得推荐

def getRecommendedItems(prefs, itemMatch, user):
    userRatings = prefs[user]
    scores = {}
    totalSim = {}
    # 遍历当前用户评价过的物品
    for (item, rating) in userRatings.items():

        # 遍历与当前物品相近的物品
        for (similarity, item2) in itemMatch[item]:

            # 忽略已经被评价过的物品
            if item2 in userRatings: continue
            # 分数是通过相似度*评分
            scores.setdefault(item2, 0)
            scores[item2] += similarity * rating
            # Sum of all the similarities
            totalSim.setdefault(item2, 0)
            totalSim[item2] += similarity

    # Divide each total score by total weighting to get an average
    rankings = [(score / totalSim[item], item) for item, score in scores.items()]

    # Return the rankings from highest to lowest
    rankings.sort()
    rankings.reverse()
    return rankings

群组

  • 监督学习:利用样本和期望输出来学习如何预测
  • 无监督学习:在一组数据中寻找某种结构

分级聚类

不断将最为相似的群组两两合并

屏幕截图 2020-10-13 103115

使用树状图来可视化聚类结果:

屏幕截图 2020-10-13 103218

通过距离来体现各元素的相似度

计算两个数字列表的相关度:

def pearson(v1, v2):
    # 简单求和
    sum1 = sum(v1)
    sum2 = sum(v2)

    # 平方根和
    sum1Sq = sum([pow(v, 2) for v in v1])
    sum2Sq = sum([pow(v, 2) for v in v2])

    # 乘积之和
    pSum = sum([v1[i] * v2[i] for i in range(len(v1))])

    # Calculate r (Pearson score)
    num = pSum - (sum1 * sum2 / len(v1))
    den = sqrt((sum1Sq - pow(sum1, 2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
    if den == 0: return 0

    return 1.0 - num / den

代表一个聚类节点:

class bicluster:
    def __init__(self, vec, left=None, right=None, distance=0.0, id=None):
        self.left = left
        self.right = right
        self.vec = vec
        self.id = id
        self.distance = distance

列聚类

将矩阵转置

def rotatematrix(data):
    newdata = []
    for i in range(len(data[0])):
        newrow = [data[j][i] for j in range(len(data))]
        newdata.append(newrow)
    return newdata

可以得到单词的聚类结果

K-均值聚类

  1. 随机确定k个中心位置
  2. 将各个数据项分配个最近的中心点
  3. 将中心点移动到各个节点的平均位置
  4. 重复2-3 直到不再变化
def kcluster(rows, distance=pearson, k=4):
    # 确定每个点的最大最小值
    ranges = [(min([row[i] for row in rows]), max([row[i] for row in rows]))
              for i in range(len(rows[0]))]

    # 随机创建k个中心点
    clusters = [[random.random() * (ranges[i][1] - ranges[i][0]) + ranges[i][0]
                 for i in range(len(rows[0]))] for j in range(k)]

    lastmatches = None

    for t in range(100):
        dis_sum = 0
        print('Iteration %d' % t)
        bestmatches = [[] for i in range(k)]

        # 在每一行中寻找距离最近的中心点
        for j in range(len(rows)):
            row = rows[j]
            bestmatch = 0
            for i in range(k):
                d = distance(clusters[i], row)
                if d < distance(clusters[bestmatch], row):
                    bestmatch = i
            dis_sum += d
            bestmatches[bestmatch].append(j)

        # 如果与上次结果相同 则结束
        if bestmatches == lastmatches:
            break
        lastmatches = bestmatches

        # 把中心移到其所有成员的平均位置处
        for i in range(k):
            avgs = [0.0] * len(rows[0])
            if len(bestmatches[i]) > 0:
                for rowid in bestmatches[i]:
                    for m in range(len(rows[rowid])):
                        avgs[m] += rows[rowid][m]
                for j in range(len(avgs)):
                    avgs[j] /= len(bestmatches[i])
                clusters[i] = avgs

    return bestmatches, dis_sum

二维形式展示

在一个二维平面 通过不同数据项的距离来计算得到一个二维平面图

def scaledown(data, distance=pearson, rate=0.01):
    n = len(data)

    # 每一对数据项之间的距离
    realdist = [[distance(data[i], data[j]) for j in range(n)]
                for i in range(0, n)]

    # 随机初始化节点再二维空间的起始位置
    loc = [[random.random(), random.random()] for i in range(n)]
    fakedist = [[0.0 for j in range(n)] for i in range(n)]

    lasterror = None
    for m in range(0, 1000):
        # 寻找投影后的距离
        for i in range(n):
            for j in range(n):
                fakedist[i][j] = sqrt(sum([pow(loc[i][x] - loc[j][x], 2)
                                           for x in range(len(loc[i]))]))

        # 移动节点
        grad = [[0.0, 0.0] for i in range(n)]

        totalerror = 0
        for k in range(n):
            for j in range(n):
                if j == k: continue
                # The error is percent difference between the distances
                errorterm = (fakedist[j][k] - realdist[j][k]) / realdist[j][k]

                # Each point needs to be moved away from or towards the other
                # point in proportion to how much error it has
                grad[k][0] += ((loc[k][0] - loc[j][0]) / fakedist[j][k]) * errorterm
                grad[k][1] += ((loc[k][1] - loc[j][1]) / fakedist[j][k]) * errorterm

                # Keep track of the total error
                totalerror += abs(errorterm)
        print(totalerror)

        # If the answer got worse by moving the points, we are done
        if lasterror and lasterror < totalerror: break
        lasterror = totalerror

        # Move each of the points by the learning rate times the gradient
        for k in range(n):
            loc[k][0] -= rate * grad[k][0]
            loc[k][1] -= rate * grad[k][1]

    return loc

搜索与排名

建立索引

屏幕截图 2020-10-19 153923

分词

从html结构中获取文本节点,对其分词

def gettextonly(self, soup):
    v = soup.string
    if len(v) == 0:
        c = soup.contents
        resulttext = ''
        for t in c:
            subtext = self.gettextonly(t)
            resulttext += subtext + '\n'
        return resulttext
    else:
        return v.strip()

# 使用正则表达式进行分词
@staticmethod
def separatewords(self, text):
    splitter = re.compile('\\W*')
    return [s.lower() for s in splitter.split(text) if s != '']

查询

  1. 进行分词
  2. 查找分出的词的响应ID
  3. 根据这些词来查找相关url
def getmatchrows(self, q):
    # 构造sql查询条件字符串
    fieldlist = 'w0.urlid'
    tablelist = ''
    clauselist = ''
    wordids = []
    # 分词
    words = q.split(' ')
    tablenumber = 0
    for word in words:
        # 获取单词的ID
        wordrow = self.con.execute(
            "select rowid from wordlist where word='%s'" % word).fetchone()
        if wordrow is not None:
            wordid = wordrow[0]
            wordids.append(wordid)
            if tablenumber > 0:
                tablelist += ','
                clauselist += ' and '
                clauselist += 'w%d.urlid=w%d.urlid and ' % (tablenumber - 1, tablenumber)
            fieldlist += ',w%d.location' % tablenumber
            tablelist += 'wordlocation w%d' % tablenumber
            clauselist += 'w%d.wordid=%d' % (tablenumber, wordid)
            tablenumber += 1
    # 根据条件进行查询
    fullquery = 'select %s from %s where %s' % (fieldlist, tablelist, clauselist)
    print(fullquery)
    cur = self.con.execute(fullquery)
    rows = [row for row in cur]
    return rows, wordids

基于内容的排名

使用一个归一化函数将结果映射到0-1之间:

def normalizescores(self, scores, smallIsBetter=0):
    vsmall = 0.00001  # 避免除0
    if smallIsBetter:
        minscore = min(scores.values())
        return dict([(u, float(minscore) / max(vsmall, l)) for (u, l) in scores.items()])
    else:
        maxscore = max(scores.values())
        if maxscore == 0: maxscore = vsmall
        return dict([(u, float(c) / maxscore) for (u, c) in scores.items()])

单词频度

根据单词在网页中出现的次数对网页进行评价

def frequencyscore(self, rows):
    counts = dict([(row[0], 0) for row in rows])
    for row in rows: counts[row[0]] += 1
    return self.normalizescores(counts)

文档位置

根据单词离文档首部的距离进行评价

def locationscore(self, rows):
    locations = dict([(row[0], 1000000) for row in rows])
    for row in rows:
        loc = sum(row[1:])
        if loc < locations[row[0]]: locations[row[0]] = loc
    return self.normalizescores(locations, smallIsBetter=1)

单词距离

单词间距更近 得分越高

def distancescore(self, rows):
    # 只有一个单词 则得分一样
    if len(rows[0]) <= 2: return dict([(row[0], 1.0) for row in rows])
    # 初始化分数 很大
    mindistance = dict([(row[0], 1000000) for row in rows])
for row in rows:
        dist = sum([abs(row[i] - row[i - 1]) for i in range(2, len(row))])
        if dist < mindistance[row[0]]: mindistance[row[0]] = dist
    return self.normalizescores(mindistance, smallIsBetter=1)

使用外部回指链接

简单计数

统计其他网页链接本网页的个数 个数越多 评分越高

def inboundlinkscore(self, rows):
    uniqueurls = dict([(row[0], 1) for row in rows])
    inboundcount = dict(
        [(u, self.con.execute('select count(*) from link where toid=%d' % u).fetchone()[0]) for u in uniqueurls])
    return self.normalizescores(inboundcount)

PageRank

为所有的网页设置一个默认PR值 ,每个网页的PR值计算公式:

屏幕截图 2020-10-20 112243

PR(A) = 0.15 + 0.85 * (PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D))

使用链接文本

根据指向网页的链接文本来评价该网页

def linktextscore(self, rows, wordids):
    linkscores = dict([(row[0], 0) for row in rows])
    for wordid in wordids:
        cur = self.con.execute(
            'select link.fromid,link.toid from linkwords,link where wordid=%d and linkwords.linkid=link.rowid' % wordid)
        for (fromid, toid) in cur:
            if toid in linkscores:
                pr = self.con.execute('select score from pagerank where urlid=%d' % fromid).fetchone()[0]
                linkscores[toid] += pr
    maxscore = max(linkscores.values())
    if maxscore == 0:
        maxscore = 0.00001
    normalizedscores = dict([(u, float(l) / maxscore) for (u, l) in linkscores.items()])
    return normalizedscores

results matching " "

No results matching " "

results matching " "

No results matching " "