机器学习之文档检索(Document Retrieval)

xzz111116 · 2018-10-16 16:29:02

你正在阅读一篇关于梅西在世界杯上表现的文章，读完后你想要阅读关于和足球相关的其他文章，比如梅西在巴塞罗那的新闻，西甲的新闻，五大联赛的转会新闻等等。所以我们想要阅读相似的文章，那么如何衡量文章之间的相似性？
如果两篇文章包含的词汇相似，比如文章A是关于梅西的文章，文章B是关于贝利的文章，文章C是关于国际政治的文章，那么文章A包含的词汇肯定与文章B重叠的部分更多，而与文章C重叠的部分更少。
第一步需要提取出文章包含的所有词汇及其出现的次数，并转化为向量
比如文章A包含这样一句话：“巴萨想和梅西续约,加入2020年自由离队条款”。这句话包含的词汇及其出现次数可以写成一个向量：
巴萨 1
想   1
和   1
梅西 1
续约 1
加入 1
2020 1
年   1
自由 1
离队 1
条款 1
再如“Carlos calls the sport futbol. Emaily calls the sport soccer.”
这句话中Carlos出现了一次，the出现了2次，calls出现了2次，sport出现了2次……

以上只是一个例子，实际上我们需要将两篇文章中所有的词汇列出，并计算它们出现的次数，组成一个向量。
第二步需要计算向量的乘积，乘积越大，相似性越强
文章B是关于贝利的文章，虽然不一定提及梅西，但肯定和足球有关

文章A与文章B向量的乘积为13

文章A与文章C向量的乘积为0，很显然文章A与文章B的相似性更强。
下面我们使用维基百科人物介绍数据集展示文档检索在Jupyter Notebook中的实现

启动GraphLab

import graphlab

载入维基百科人物介绍数据

用SFrame格式读取数据集，并将数据保存在people中

people = graphlab.SFrame('people_wiki.gl/')

查看数据集

people.head()

数据集包含3列变量，第一列是人物的URL链接(URL)，第二列是人物名字(name)，第三列是人物介绍(text)

查看数据量

len(people)

59071
该数据集总共包含近6万条数据

探索数据集

people数据集都是人物介绍，首先拿美国电影史上最伟大的演员之一罗伯特·德尼罗（Robert De Niro）举例，他也是我最喜欢的演员，代表作品有《美国往事》、《教父》、《出租车司机》等。

查找people数据集name列中叫Robert De Niro名字的条目，并输出该条目
Robert = people[people['name'] == 'Robert De Niro']
Robert

查看text变量

查看关于罗伯特·德尼罗的介绍，介绍在数据集中的text列
Robert['text']

dtype: str
Rows: ?
['robert de niro dnro born august 17 1943 is an american actor and film producer who has starred in
over 90 films his first major film roles were in the sports drama bang the drum slowly 1973 and the
martin scorsesedirected crime film mean streets 1973 in 1974 after being turned down for the role
of sonny corleone in the crime film the godfather 1972 he was cast as the young vito corleone in the
godfather part ii a role for which he won the academy award for best supporting actorde niros longtime
collaboration with director martin scorsese later earned him an academy award for best actor for
his portrayal of jake lamotta in the 1980 film raging bull he earned nominations for the neo noir
psychological thriller taxi driver in 1976 and the psychological thriller cape fear in 1991 both
directed by scorsese de niro received additional academy award nominations for michael ciminos vietnam
war drama the deer hunter 1978 penny marshalls drama awakenings 1990 and david o russells romantic
comedydrama silver linings playbook 2012 his portrayal of gangster jimmy conway in scorseses crime
film goodfellas 1990 earned him a bafta nomination in 1990de niro has earned four nominations for
the golden globe award for best actor motion picture musical or comedy for his work in the musical
drama new york new york 1977 opposite liza minnelli the action comedy midnight run 1988 the gangster
comedy analyze this 1999 and the comedy meet the parents 2000 he has also simultaneously directed
and starred in films such as the crime drama a bronx tale 1993 and the spy film the good shepherd
2006 de niro has received many accolades for his career including the afi life achievement award
2003 and the golden globe cecil b demille award 2010', ... ]
同样，也可以查看其他人物的介绍，如贝克汉姆(David Beckham)

beckham = people[people['name'] == 'David Beckham']
beckham['text']

dtype: str
Rows: ?
['david robert joseph beckham obe bkm born 2 may 1975 is an english former professional footballer
he has played for manchester united preston north end real madrid milan la galaxy paris saintgermain
and the england national team for which he holds the appearance record for an outfield player he was
the first english player to win league titles in four countries england spain the united states and
france he announced his retirement at the end of the 201213 season and on 18 may 2013 played the
final game of his 20year careerbeckhams professional career began with manchester united where he
made his firstteam debut in 1992 aged 17 with united beckham won the premier league title six times
the fa cup twice and the uefa champions league in 1999 he then played four seasons with real madrid
winning the la liga championship in his final season with the club in july 2007 beckham signed a
fiveyear contract with major league soccer club la galaxy while a galaxy player he spent two loan
spells in italy with milan in 2009 and 2010 beckham was the first british footballer to play 100
champions league gamesin international football beckham made his england debut on 1 september 1996
at the age of 21 he was captain for six years during which he played 58 times he made 115 career
appearances in total appearing at three fifa world cup editions 1998 2002 and 2006 and two uefa
european championship tournaments 2000 and 2004renowned for his range of passing crossing ability
and bending freekicks he has twice been runnerup for fifa world player of the year and in 2004 he
was named in the fifa 100 list of the worlds greatest living players consistently ranked among the
sports highest earners in 2013 beckham was listed as the best paid player in the world earning over
50 million in the previous 12 monthshe has been married to victoria beckham since 1999 and they have
four children beckhams eldest son brooklyn currently plays football for arsenal u16 in february 2014
mls announced beckham and a group of investors would own an expansion team in miami which would
begin play in 2016 or 2017', ... ]

计算词频

计算词频使用graphlab中的text_analytics.count_words
Robert['word_count'] = graphlab.text_analytics.count_words(Robert['text'])
print Robert['word_count']

[{'taxi': 1L, 'directed': 2L, 'producer': 1L, 'golden': 2L, 'being': 1L, 'over': 1L, 'midnight': 1L,
'four': 1L, 'liza': 1L, 'including': 1L, 'playbook': 1L, 'cape': 1L, 'fear': 1L, 'noir': 1L, 'earned'
: 4L, '1943': 1L, 'young': 1L, 'crime': 4L, 'parents': 1L, 'linings': 1L, 'niro': 4L, 'has': 4L, '2010'
: 1L, '2012': 1L, 'shepherd': 1L, 'his': 5L, 'de': 3L, 'hunter': 1L, 'actorde': 1L, 'jake': 1L,
'scorsese': 2L, 'him': 2L, 'down': 1L, '17': 1L, 'michael': 1L, 'this': 1L, 'martin': 2L, 'vito':
1L, 'starred': 2L, 'mean': 1L, 'demille': 1L, 'globe': 2L, 'streets': 1L, 'gangster': 2L, 'simultaneously'
: 1L, 'best': 3L, 'neo': 1L, 'for': 11L, 'robert': 1L, 'won': 1L, 'bang': 1L, 'longtime': 1L, 'new'
: 2L, 'ciminos': 1L, 'supporting': 1L, 'who': 1L, 'run': 1L, 'august': 1L, 'minnelli': 1L, 'turned'
: 1L, 'of': 3L, 'york': 2L, 'by': 1L, 'received': 2L, 'bafta': 1L, 'drum': 1L, 'career': 1L, 'many'
: 1L, 'comedydrama': 1L, 'o': 1L, 'david': 1L, 'motion': 1L, 'american': 1L, 'dnro': 1L, 'action':
1L, 'or': 1L, 'silver': 1L, 'first': 1L, 'scorsesedirected': 1L, 'major': 1L, 'jimmy': 1L, 'conway'
: 1L, 'psychological': 2L, 'portrayal': 2L, 'accolades': 1L, 'sonny': 1L, 'additional': 1L, '1980':
1L, '1988': 1L, 'sports': 1L, 'slowly': 1L, 'thriller': 2L, 'films': 2L, 'analyze': 1L, 'was': 1L,
'war': 1L, 'life': 1L, 'both': 1L, 'scorseses': 1L, 'award': 6L, 'corleone': 2L, 'bull': 1L, '90'
: 1L, 'with': 1L, 'he': 4L, '1991': 1L, '1990': 2L, '1993': 1L, 'b': 1L, 'romantic': 1L, 'roles':
1L, 'born': 1L, '1999': 1L, 'awakenings': 1L, 'work': 1L, 'cast': 1L, 'were': 1L, 'meet': 1L,
'nominations': 3L, 'and': 8L, 'spy': 1L, '1990de': 1L, 'later': 1L, 'raging': 1L, 'is': 1L, 'penny'
: 1L, 'deer': 1L, 'an': 2L, 'ii': 1L, 'as': 2L, 'good': 1L, 'in': 12L, 'russells': 1L, 'goodfellas'
: 1L, 'drama': 5L, 'afi': 1L, 'lamotta': 1L, 'actor': 3L, 'cecil': 1L, 'also': 1L, 'vietnam': 1L,
'role': 2L, 'film': 7L, 'which': 1L, 'academy': 3L, 'comedy': 4L, 'picture': 1L, 'godfather': 2L,
'opposite': 1L, 'niros': 1L, 'collaboration': 1L, 'after': 1L, 'driver': 1L, 'director': 1L, 'bronx'
: 1L, 'such': 1L, 'nomination': 1L, 'achievement': 1L, 'a': 3L, '1978': 1L, '1977': 1L, '1976': 1L,
'1974': 1L, '1973': 2L, '1972': 1L, 'tale': 1L, '2003': 1L, '2000': 1L, 'part': 1L, '2006': 1L,
'the': 24L, 'marshalls': 1L, 'musical': 2L}]

排序

将word_count列中包含的东西拆成两列变量，一列是词汇word，一列是词频count
Robert_word_count_table = Robert[['word_count']].stack('word_count', new_column_name = ['word','count'])
Robert_word_count_table.head()

降序排列
Robert_word_count_table.sort('count', ascending = False)

在罗伯特·德尼罗的人物介绍中，出现最多的词汇是the，共24次，其次是in，共12次。

计算TF-IDF(term frequency-inverse document frequency)

从上面的结果中我们得不出什么有用的信息，因为高频词汇是the, in, for , and这些词，这时候就需要用到TF-IDF，它是为了解决这一问题而产生的。关于TF-IDF的更多资料可以通过网上查阅，这一方面资料很多。

对数据集中的所有文档计算相应词频
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people.head()

tfidf = graphlab.text_analytics.tf_idf(people['word_count'])
tfidf = graphlab.SFrame(tfidf)
tfidf
使用text_analytics.tf_idf来计算TF-IDF，并用SFrame查看

每一个词汇都计算出了其相应的TF-IDF值，如since的TF-IDF值为1.455
将之前计算好的TF-IDF放入people数据集中，将列名叫做tfidf
people['tfidf'] = tfidf['X1']
找到数据集中罗伯特·德尼罗的条目

Robert = people[people['name'] == 'Robert De Niro']
检查罗伯特·德尼罗的TF-IDF，将其拆分为词汇word和该词汇的td-idf值，并进行降序排列

Robert[['tfidf']].stack('tfidf', new_column_name = ['word','tfidf']).sort('tfidf', ascending = False)

经过TF-IDF转换后，重要性词汇变成了niro, corleone, drame等，不再是之前的the, and那些了

手动计算人物之间的距离

我们拿妮可·基德曼和贝克汉姆举例，计算罗伯特·德尼罗和这两个人物之间的距离
妮可基德曼(Nicole Kidman)

nicole = people[people['name'] == 'Nicole Kidman']
贝克汉姆(David Beckham)
beckham = people[people['name'] == 'David Beckham']
罗伯特·德尼罗与妮可·基德曼和贝克汉姆谁的距离更近？
计算罗伯特·德尼罗和妮可·基德曼的距离
graphlab.distances.cosine(Robert['tfidf'][0], nicole['tfidf'][0])
0.7871455878121084
计算罗伯特·德尼罗和贝克汉姆的距离

graphlab.distances.cosine(Robert['tfidf'][0], beckham['tfidf'][0])
0.9879813542250133
距离越近，相似性越强，从结果中看到罗伯特·德尼罗和妮可·基德曼的距离更近（0.787），因为他们同是好莱坞演员，而贝克汉姆是足球运动员，所以罗伯特·德尼罗和贝克汉姆之间的距离更远（0.988）。

为文档检索建立最邻近模型(KNN)

使用nearest_neighbors.create创建最邻近模型，特征为TF-IDF，标签为name
knn_model = graphlab.nearest_neighbors.create(people, features=['tfidf'], label = 'name')

使用KNN检索文档(k-Nearest Neighbor algorithm)

使用建立好的KNN模型查询谁与罗伯特德尼罗的距离最近？
knn_model.query(Robert)

从结果看，很显然自己与自己的距离最近，排在第一位，第二位是Martin Scorsese，这位是美国导演马丁·斯科塞斯，和罗伯特·德尼罗合作较多，下面依次是凯瑟琳·泽塔-琼斯、小李子、亚历克-鲍德温，均为好莱坞演员
谁与妮可基德曼的距离最近？
knn_model.query(nicole)

和妮可·基德曼距离最近的是小李子、凯瑟琳·泽塔-琼斯、米歇尔·威廉姆斯、托妮·科莱特
谁与贝克汉姆的距离最近？
knn_model.query(beckham)

很显然，距离最近的均为足球运动员或教练，如杰拉德、德罗巴、戈登·斯特拉坎、鲁尼
文档检索的其他例子

摇滚歌手大卫·鲍伊

david = people[people['name'] == 'David Bowie']
寻找和大卫鲍伊距离最接近的人

knn_model.query(david)

娜塔莉·波特曼

natalie = people[people['name'] == 'Natalie Portman']
谁是和娜塔莉波特曼最接近的人？
knn_model.query(natalie)

泽塔琼斯再一次出现，随后是朱迪·福斯特，凯拉·奈特利和妮可·基德曼

最后，再用心理学家班杜拉举一个例子

bandura = people[people['name'] == 'Albert Bandura']
knn_model.query(bandura)

结果包含阿莫斯·特沃斯基，他和卡尼曼一起获得了2002年的诺贝尔经济学奖，前景理论和框架效应就是他们提出的，还有扫罗·斯腾伯格，反应时—加因素法理论就是他提出来的。所以，可见利用KNN建立的文档检索模型还是较为准确的。

参考文献：
Machine Learning Foundations: A Case Study Approach, University of Washington

		自动登录	找回密码
密码			立即注册

机器学习之文档检索(Document Retrieval)

本帖子中包含更多资源

最近发表

公社版块

关注我们