NLP|gensim boy的修炼之路

我最近在复现一些模型的时候，发现这些模型基本上都使用了word2vec或者是Glove等word embedding模型的pre-train的词向量来初始化，然后在此基础上进行fine tune。这些word embedding模型实现起来也不是很难，但是我的主要目的又不是去实践这些模型🥺，关键是自己实现的还不如开源的效果好，毕竟别人的代码做了很多很多的优化啦。所以，想要好的效果并且又不想花费太多时间在pre-train词向量上面的话，学习gensim就显得非常重要了。所以，这篇文章主要是记录gensim的基本使用方法，帮助大家快速的构建vocab，得到pre-train的词向量。

gensim要求的数据格式

texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]

如上所示，也就是说，我们在将数据输入到gensim的模型进行训练之前，要对corpus进行分词、去除停用词等操作，总之要变成上述的格式。当然啦。如果要处理超大的文件的话，是无法一次性全部加载进入内存的，所以gensim也支持流式处理，也就是说将其变成一个python生成器就可以。如下：

from gensim.models import Word2Vec
import jieba

class MySentence(object):
  def __init__(self,file):
		self.file=file
   
  def __iter__(self):
    with open(file,"r",encoding="utf-8") as f:
      for line in f.readlines();
      	#淡当然中间可以执行很多操作，如分词、去除停用词等等
      	yiled list(jieba.cut(line.strip()))

my_sentence=MySentence(file)
w2v_model=Word2Vec(my_sentence)

#访问token的词向量
w2v_model["a"]

除了上述方法外，还有一种方法，就是使用LineSentence，如下：

1
2
3

from gensim,models.word2vec import LineSentence
#注意：a.txt必须是已经分好词的文件！
sentences=LineSentence("a.txt")

Word2Vec模型参数

from gensim.models import Word2Vec

#默认构建出来的词汇表，是按照频率降序排列。
model=Word2Vec(sentences=sentence,\#输入的数据
              min_count=5,\#低于min_count的频率的token将被抛弃
              size=100,\#embedding的维度
              workers=4,\#用于训练的线程数目
              sg=0,\#为0表示使用CBOW，为1表示使用skip-gram
              )

词汇表保存与加载

#这种方式可以加载后增加训练
model.save("vocab.w2v")
model=Word2Vec.load("vocab.w2v")

#这种方式无法增加训练
model.wv.save_word2vec_format("wde.txt",binary=False)
model.wv.save_word2vec_format("wd.bin",binary=True)

from gensim.models import KeyedVectors
model=KeyedVectors.load_word2vec_format("ds.txt",binary=False)
model=KeyedVectors.load_word2vec_format("ds.bin",binary=True)

目前大致能用到的就是这样，其他的等以后用到的话，再更新吧～🥰

参考文献

https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec