0%

NLP|BERT源码解读

最近一直在看预训练模型,发现大部分模型的源代码基本上都是在Google官方发布的BERT源码的基础上进行修改的(但是全都是TF1.x😷,这点我要吐槽了,按道理TF2.x出来之后,Google在大力推广TF2.x,然而连Google自己发布的ELECTRA、Adapter-BERT、ALBERT等等源代码都是import tensorflow.compat.v1 as tf😷,excuse me?)。所以还是回头再仔细看了一遍原来BERT的源代码。不过,整体阅读下来,感觉还是非常顺畅的,不得不说代码写的真的好。所以这篇文章主要是记录一下自己看BERT源代码的过程。

BERT整体代码结构

BERT原理我在这里就不多啰嗦了,网上一大堆,当然更加推荐的是取看原始论文。首先来看看BERT的代码文件与整体结构。如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
├── README.md
├── create_pretraining_data.py
├── extract_features.py
├── modeling.py
├── modeling_test.py
├── multilingual.md
├── optimization.py
├── optimization_test.py
├── predicting_movie_reviews_with_bert_on_tf_hub.ipynb
├── requirements.txt
├── run_classifier.py
├── run_classifier_with_tfhub.py
├── run_pretraining.py
├── run_squad.py
├── sample_text.txt
├── tokenization.py
└── tokenization_test.py
  • create_pretraining_data.py:用来创建训练实例;
  • extract_features.py:提取出预训练的特征;
  • modeling.py:BERT的核心建模文件,模型主体部分;
  • modeling_test.py:对modeling.py文件进行unittest测试;
  • optimization.py:自定义的优化器;
  • optimization_test.py:对optimization.py文件的unittest测试;
  • predicting_movie_reviews_with_bert_on_tf_hub.ipynb:通过调用tfhub来使用BERT进行预测;
  • run_classifier.py:在多种数据集上(譬如:MRPC、XNLI、MNLI、COLA)来进行BERT模型的finetune;
  • run_classifier_with_tfhub.py:通过调用tfhub来进行finetune;
  • run_pretraining.py:通过MLM与NSP任务来对模型进行预训练;
  • run_squad.py:在squad数据集上进行finetune;
  • tokenization.py:对原始数据进行清洗、分词等操作;
  • tokenization_test.py:对 tokenization.py文件进行unittest测试。

以上就是各文件的大致简介,下面将对核心代码文件进行走读,tnesor的维度以及重要注释我均已在代码里写明~

核心代码文件走读

modeling.py

modeling.py文件是BERT模型的实现。首先来看BertConfig类,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
class BertConfig(object):
"""Configuration for `BertModel`."""

def __init__(self,
vocab_size,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=512,
type_vocab_size=16,
initializer_range=0.02):
"""Constructs BertConfig.

Args:
vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
hidden_size: Size of the encoder layers and the pooler layer.
num_hidden_layers: Number of hidden layers in the Transformer encoder.
num_attention_heads: Number of attention heads for each attention layer in
the Transformer encoder.
intermediate_size: The size of the "intermediate" (i.e., feed-forward)
layer in the Transformer encoder.
hidden_act: The non-linear activation function (function or string) in the
encoder and pooler.
hidden_dropout_prob: The dropout probability for all fully connected
layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob: The dropout ratio for the attention
probabilities.
max_position_embeddings: The maximum sequence length that this model might
ever be used with. Typically set this to something large just in case
(e.g., 512 or 1024 or 2048).
type_vocab_size: The vocabulary size of the `token_type_ids` passed into
`BertModel`.
initializer_range: The stdev of the truncated_normal_initializer for
initializing all weight matrices.
"""
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range

@classmethod
def from_dict(cls, json_object):
"""Constructs a `BertConfig` from a Python dictionary of parameters."""
config = BertConfig(vocab_size=None)
for (key, value) in six.iteritems(json_object):
config.__dict__[key] = value
return config

@classmethod
def from_json_file(cls, json_file):
"""Constructs a `BertConfig` from a json file of parameters."""
with tf.gfile.GFile(json_file, "r") as reader:
text = reader.read()
return cls.from_dict(json.loads(text))

def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output

def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

这个主要是BERT模型的配置文件,注释其实很详细了,但是需要特别说明的是type_vocab_size,这个表示的是segment id,默认是2,代码里没有修改,但是在bert_config.json文件里有非常详细的解释。

看完BertConfig类之后就是BERT模型了,但是由于BERT模型整体非常复杂,我们先来看看它的其实的component。首先来看token embedding部分,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# embedding_lookup用来获取token embedding
def embedding_lookup(input_ids,
vocab_size,
embedding_size=128,
initializer_range=0.02,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=False):
"""Looks up words embeddings for id tensor.

Returns:
float Tensor of shape [batch_size, seq_length, embedding_size].
"""
# This function assumes that the input is of shape [batch_size, seq_length,
# num_inputs].
#
# If the input is a 2D tensor of shape [batch_size, seq_length], we
# reshape to [batch_size, seq_length, 1].
if input_ids.shape.ndims == 2:
input_ids = tf.expand_dims(input_ids, axis=[-1])

#随机初始化embedding table
embedding_table = tf.get_variable(
name=word_embedding_name,
shape=[vocab_size, embedding_size],
initializer=create_initializer(initializer_range))

# flatten,维度变为:[batch_size*seq_length*input_num]
flat_input_ids = tf.reshape(input_ids, [-1])
if use_one_hot_embeddings:
# 变为one-hot向量,维度是:[batch_size*seq_length*input_num,vocab_size]
one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
# 维度:[batch_size*seq_length*input_num,embedding_size]
output = tf.matmul(one_hot_input_ids, embedding_table)
else:
# 维度:[batch_size*seq_length*input_num,embedding_size]
output = tf.gather(embedding_table, flat_input_ids)

input_shape = get_shape_list(input_ids)

# 维度:[batch_size,seq_length,input_num*embedding_size]
output = tf.reshape(output,
input_shape[0:-1] + [input_shape[-1] * embedding_size])
return (output, embedding_table)

其中每个tensor的维度我都标注的非常清楚了,看懂代码应该没有什么问题。通过embedding_lookup函数,我们就得到了token embedding以及embedding_table,其中embedding_table就是词向量表,如果我们不使用finetune的方式,那么我们也可以将训练好的embedding_table给抽出来,然后采用feasture based的方式来进行下游任务的训练。

除了token embedding之后,BERT中还有segment embedding与positional embedding,最终的embedding是这三个embedding相加得到的结果。具体实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# embedding_postprocessor用来将token embedding、segment embedding以及positional embedding进行相加,
# 来得到最终输入的embedding
def embedding_postprocessor(input_tensor,
use_token_type=False,
token_type_ids=None,
token_type_vocab_size=16,# 一般是2
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=0.02,
max_position_embeddings=512, # 必须大于等于seq_length
dropout_prob=0.1):
"""Performs various post-processing on a word embedding tensor.

Args:
input_tensor: float Tensor of shape [batch_size, seq_length,
embedding_size].
use_token_type: bool. Whether to add embeddings for `token_type_ids`.
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
Must be specified if `use_token_type` is True.
token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
token_type_embedding_name: string. The name of the embedding table variable
for token type ids.
use_position_embeddings: bool. Whether to add position embeddings for the
position of each token in the sequence.
position_embedding_name: string. The name of the embedding table variable
for positional embeddings.
initializer_range: float. Range of the weight initialization.
max_position_embeddings: int. Maximum sequence length that might ever be
used with this model. This can be longer than the sequence length of
input_tensor, but cannot be shorter.
dropout_prob: float. Dropout probability applied to the final output tensor.

Returns:
float tensor with same shape as `input_tensor`.

Raises:
ValueError: One of the tensor shapes or input values is invalid.
"""
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
width = input_shape[2]

output = input_tensor

# 加入segment embedding
if use_token_type:
if token_type_ids is None:
raise ValueError("`token_type_ids` must be specified if"
"`use_token_type` is True.")
token_type_table = tf.get_variable(
name=token_type_embedding_name,
shape=[token_type_vocab_size, width],
initializer=create_initializer(initializer_range))
# This vocab will be small so we always do one-hot here, since it is always
# faster for a small vocabulary.

# [batch_size*seq_length]
flat_token_type_ids = tf.reshape(token_type_ids, [-1])
# [batch_size*seq_length,token_type_vocab_size]
one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
# [batch_size*seq_length,width]
token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
# [batch_size,seq_length,width]
token_type_embeddings = tf.reshape(token_type_embeddings,
[batch_size, seq_length, width])
output += token_type_embeddings

# 加入positional embedding
if use_position_embeddings:
assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
with tf.control_dependencies([assert_op]):
full_position_embeddings = tf.get_variable(
name=position_embedding_name,
shape=[max_position_embeddings, width],
initializer=create_initializer(initializer_range))
# Since the position embedding table is a learned variable, we create it
# using a (long) sequence length `max_position_embeddings`. The actual
# sequence length might be shorter than this, for faster training of
# tasks that do not have long sequences.
#
# So `full_position_embeddings` is effectively an embedding table
# for position [0, 1, 2, ..., max_position_embeddings-1], and the current
# sequence has positions [0, 1, 2, ... seq_length-1], so we can just
# perform a slice.

# [seq_length,width]
position_embeddings = tf.slice(full_position_embeddings, [0, 0],
[seq_length, -1])
num_dims = len(output.shape.as_list())

# Only the last two dimensions are relevant (`seq_length` and `width`), so
# we broadcast among the first dimensions, which is typically just
# the batch size.
position_broadcast_shape = []
for _ in range(num_dims - 2):
position_broadcast_shape.append(1)
position_broadcast_shape.extend([seq_length, width])

# [1,seq_length,width]
position_embeddings = tf.reshape(position_embeddings,
position_broadcast_shape)
# [batch_size,seq_length,width]
output += position_embeddings

# 经过layer norm以及dropout
output = layer_norm_and_dropout(output, dropout_prob)
return output

这里需要注意一下,在BERT中,embedding_size=hidden_size。得到embedding之后我们就需要将输入输入到transformer中了,在BERT当中,transformer由12个self-attention layer堆叠而成。所以,首先来看看self-attention layer的实现,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.

Returns:
float Tensor of shape [batch_size, from_seq_length,
num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
true, this will be of shape [batch_size * from_seq_length,
num_attention_heads * size_per_head]).

Raises:
ValueError: Any of the arguments or tensor shapes are invalid.
"""

def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
seq_length, width):
output_tensor = tf.reshape(
input_tensor, [batch_size, seq_length, num_attention_heads, width])

# 之所以需要使用transpose,主要是因为后面要做QK.T,其实我觉得使用tf.einsum就不需要转置了
output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
return output_tensor

from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

if len(from_shape) != len(to_shape):
raise ValueError(
"The rank of `from_tensor` must match the rank of `to_tensor`.")

if len(from_shape) == 3:
batch_size = from_shape[0]
from_seq_length = from_shape[1]
to_seq_length = to_shape[1]
elif len(from_shape) == 2:
if (batch_size is None or from_seq_length is None or to_seq_length is None):
raise ValueError(
"When passing in rank 2 tensors to attention_layer, the values "
"for `batch_size`, `from_seq_length`, and `to_seq_length` "
"must all be specified.")

# Scalar dimensions referenced here:
# B = batch size (number of sequences)
# F = `from_tensor` sequence length
# T = `to_tensor` sequence length
# N = `num_attention_heads`
# H = `size_per_head`

'''
我们把from_tensor看作Q,to_tensor看作k,v!
'''

# [batch_size*from_seq_length,from_width]
from_tensor_2d = reshape_to_matrix(from_tensor)
# [batch_size*to_seq_length,to_width]
to_tensor_2d = reshape_to_matrix(to_tensor)

# `query_layer` = [B*F, N*H]
# [batch_size*from_seq_length,num_attention_heads*size_per_head],即论文里提到的线性变换
query_layer = tf.layers.dense(
from_tensor_2d,
num_attention_heads * size_per_head,
activation=query_act,
name="query",
kernel_initializer=create_initializer(initializer_range))

# `key_layer` = [B*T, N*H]
# [batch_size*to_seq_length,num_attention_heads*size_per_head],即论文里提到的线性变换
key_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=key_act,
name="key",
kernel_initializer=create_initializer(initializer_range))

# `value_layer` = [B*T, N*H]
# [batch_size*to_seq_length,num_attention_heads*size_per_head],即论文里提到的线性变换
value_layer = tf.layers.dense(
to_tensor_2d,
num_attention_heads * size_per_head,
activation=value_act,
name="value",
kernel_initializer=create_initializer(initializer_range))

# `query_layer` = [B, N, F, H]
# [batch_size,num_attention_heads,from_seq_length,size_per_head]
query_layer = transpose_for_scores(query_layer, batch_size,
num_attention_heads, from_seq_length,
size_per_head)

# `key_layer` = [B, N, T, H]
# [batch_size,num_attention_heads,to_seq_length,size_per_head]
key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
to_seq_length, size_per_head)

# Take the dot product between "query" and "key" to get the raw
# attention scores.
# `attention_scores` = [B, N, F, T]
# [batch_size,num_attention_heads,from_seq_length,to_seq_length]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
# 放缩
attention_scores = tf.multiply(attention_scores,
1.0 / math.sqrt(float(size_per_head)))

# 在softmax之后,对logits进行mask,防止被padding的位置的值也参与计算!
if attention_mask is not None:
# `attention_mask` = [B, 1, F, T]
'''
在attention_mask中,1表示真实的长度,0表示是padding的
'''


attention_mask = tf.expand_dims(attention_mask, axis=[1])

# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.

# 我们让padding的位置的值为无穷小,这样softmax之后的值就会接近于0!
attention_scores += adder

# Normalize the attention scores to probabilities.
# `attention_probs` = [B, N, F, T]

# # [batch_size,num_attention_heads,from_seq_length,to_seq_length]
attention_probs = tf.nn.softmax(attention_scores)

# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

# `value_layer` = [B, T, N, H]
# [batch_size,to_seq_length,num_attention_heads,size_per_head]
value_layer = tf.reshape(
value_layer,
[batch_size, to_seq_length, num_attention_heads, size_per_head])

# `value_layer` = [B, N, T, H]
# [batch_size,num_attention_heads,to_seq_length,size_per_head]
value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

# `context_layer` = [B, N, F, H]
# [batch_size,num_attention_heads,from_seq_length,size_per_head]
context_layer = tf.matmul(attention_probs, value_layer)

# `context_layer` = [B, F, N, H]
# [batch_size,from_seq_length,num_attention_heads,to_seq_length]
context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

if do_return_2d_tensor:
# `context_layer` = [B*F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size * from_seq_length, num_attention_heads * size_per_head])
else:
# `context_layer` = [B, F, N*H]
context_layer = tf.reshape(
context_layer,
[batch_size, from_seq_length, num_attention_heads * size_per_head])

return context_layer

这是标准的self-attention层的实现,我觉得有很多地方可以借鉴,譬如:让padding的位置经过softmax的值无限接近于0,以及在做QK.T的计算的时候,我们一开始就把输入给它reshape层2维的,即:[batch_size,seq_length,width_size]转化到[batch_size*seq_length,num_attention_heads*size_per_head],从而加快训练,这个其实我之前都没想过,只有在看开源代码的时候才能知道。定义了sefl-attention layer之后,我们来看transfomer的实现,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
def transformer_model(input_tensor,
attention_mask=None,
hidden_size=768,# 最终输出结果的维度
num_hidden_layers=12,# self-attention layer的数目
num_attention_heads=12, # 每一个self-attention layer中head的数目
intermediate_size=3072, # 中间维度,即FFN的维度
intermediate_act_fn=gelu,# FFN的激活函数
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
initializer_range=0.02,
do_return_all_layers=False): # 是否返回所有层的输出结果
"""Multi-headed, multi-layer Transformer from "Attention is All You Need".

Returns:
float Tensor of shape [batch_size, seq_length, hidden_size], the final
hidden layer of the Transformer.

Raises:
ValueError: A Tensor shape or parameter is invalid.
"""
if hidden_size % num_attention_heads != 0:
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (hidden_size, num_attention_heads))

attention_head_size = int(hidden_size / num_attention_heads)
input_shape = get_shape_list(input_tensor, expected_rank=3)
batch_size = input_shape[0]
seq_length = input_shape[1]
input_width = input_shape[2]

# The Transformer performs sum residuals on all layers so the input needs
# to be the same as the hidden size.
if input_width != hidden_size:
raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
(input_width, hidden_size))

# We keep the representation as a 2D tensor to avoid re-shaping it back and
# forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
# the GPU/CPU but may not be free on the TPU, so we want to minimize them to
# help the optimizer.

# [batch_size*seq_length,input_width]
# input_width=hidden_size
prev_output = reshape_to_matrix(input_tensor)

# list,总共有num_hidden_layers,每一个元素的维度是:[batch_size*seq_length,hidden_size]
all_layer_outputs = []
for layer_idx in range(num_hidden_layers):
with tf.variable_scope("layer_%d" % layer_idx):
layer_input = prev_output

with tf.variable_scope("attention"):
attention_heads = []
with tf.variable_scope("self"):
# [batch_size*seq_length,num_attention_heads*size_per_head]
attention_head = attention_layer(
from_tensor=layer_input,
to_tensor=layer_input,
attention_mask=attention_mask,
num_attention_heads=num_attention_heads,
size_per_head=attention_head_size,
attention_probs_dropout_prob=attention_probs_dropout_prob,
initializer_range=initializer_range,
do_return_2d_tensor=True,
batch_size=batch_size,
from_seq_length=seq_length,
to_seq_length=seq_length)
attention_heads.append(attention_head)

attention_output = None
if len(attention_heads) == 1:
attention_output = attention_heads[0]
else:
# In the case where we have other sequences, we just concatenate
# them to the self-attention head before the projection.
attention_output = tf.concat(attention_heads, axis=-1)

# Run a linear projection of `hidden_size` then add a residual
# with `layer_input`.
with tf.variable_scope("output"):
# 维度变换,变为:[batch_size*seq_length,hidden]
attention_output = tf.layers.dense(
attention_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
# dropout
attention_output = dropout(attention_output, hidden_dropout_prob)
# 残差连接
attention_output = layer_norm(attention_output + layer_input)

# The activation is only applied to the "intermediate" hidden layer.
with tf.variable_scope("intermediate"):
# 使用gelu,变为:[batch_size*seq_length,intermediate_size]
intermediate_output = tf.layers.dense(
attention_output,
intermediate_size,
activation=intermediate_act_fn,
kernel_initializer=create_initializer(initializer_range))

# Down-project back to `hidden_size` then add the residual.
with tf.variable_scope("output"):
# 无激活函数,纯线性变换,变为:[batch_size*seq_length,hidden_size]
layer_output = tf.layers.dense(
intermediate_output,
hidden_size,
kernel_initializer=create_initializer(initializer_range))
# dropout
layer_output = dropout(layer_output, hidden_dropout_prob)
# 残差连接
layer_output = layer_norm(layer_output + attention_output)
prev_output = layer_output
# 存储这一层的结果
all_layer_outputs.append(layer_output)

if do_return_all_layers:
final_outputs = []
for layer_output in all_layer_outputs:
final_output = reshape_from_matrix(layer_output, input_shape)
final_outputs.append(final_output)
return final_outputs
else:
final_output = reshape_from_matrix(prev_output, input_shape)
return final_output

现在,我们将embedding以及transfomer给串起来,得到完整的BERT模型。如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
class BertModel(object):
"""BERT model ("Bidirectional Encoder Representations from Transformers").

Example usage:

```python
# Already been converted into WordPiece token ids
input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

model = modeling.BertModel(config=config, is_training=True,
input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

label_embeddings = tf.get_variable(...)
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)
...
"""

def __init__(self,
config,
is_training,
input_ids,
input_mask=None,
token_type_ids=None,
use_one_hot_embeddings=False,
scope=None):
"""Constructor for BertModel.

Args:
config: `BertConfig` instance.
is_training: bool. true for training model, false for eval model. Controls
whether dropout will be applied.
input_ids: int32 Tensor of shape [batch_size, seq_length].
input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
embeddings or tf.embedding_lookup() for the word embeddings.
scope: (optional) variable scope. Defaults to "bert".

Raises:
ValueError: The config is invalid or one of the input tensor shapes
is invalid.
"""
config = copy.deepcopy(config)
if not is_training:
# 如果不是训练的话,那么不能使用dropout!
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0

input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]

if input_mask is None:
input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

if token_type_ids is None:
token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

with tf.variable_scope(scope, default_name="bert"):
with tf.variable_scope("embeddings"):
# Perform embedding lookup on the word ids.
# 获取token embedding,以及embedding table
(self.embedding_output, self.embedding_table) = embedding_lookup(
input_ids=input_ids,
vocab_size=config.vocab_size,
embedding_size=config.hidden_size,
initializer_range=config.initializer_range,
word_embedding_name="word_embeddings",
use_one_hot_embeddings=use_one_hot_embeddings)

# Add positional embeddings and token type embeddings, then layer
# normalize and perform dropout.
self.embedding_output = embedding_postprocessor(
input_tensor=self.embedding_output,
use_token_type=True,
token_type_ids=token_type_ids,
token_type_vocab_size=config.type_vocab_size,
token_type_embedding_name="token_type_embeddings",
use_position_embeddings=True,
position_embedding_name="position_embeddings",
initializer_range=config.initializer_range,
max_position_embeddings=config.max_position_embeddings,
dropout_prob=config.hidden_dropout_prob)

with tf.variable_scope("encoder"):
# This converts a 2D mask of shape [batch_size, seq_length] to a 3D
# mask of shape [batch_size, seq_length, seq_length] which is used
# for the attention scores.
attention_mask = create_attention_mask_from_input_mask(
input_ids, input_mask)

# Run the stacked transformer.
# `sequence_output` shape = [batch_size, seq_length, hidden_size].

# 如果do_return_all_layers=True,返回一个list,元素个数是num_hidden_layers,
# 每一个元素维度:[batch_size,seq_length,hidden_size];

# 如果do_return_all_layers=False,那么就值返回最顶层的layer的输出结果,
# 维度是:[batch_size,seq_length,hidden_size]
self.all_encoder_layers = transformer_model(
input_tensor=self.embedding_output,
attention_mask=attention_mask,
hidden_size=config.hidden_size,
num_hidden_layers=config.num_hidden_layers,
num_attention_heads=config.num_attention_heads,
intermediate_size=config.intermediate_size,
intermediate_act_fn=get_activation(config.hidden_act),
hidden_dropout_prob=config.hidden_dropout_prob,
attention_probs_dropout_prob=config.attention_probs_dropout_prob,
initializer_range=config.initializer_range,
do_return_all_layers=True)

# 取到最顶层的结果,维度是:[batch_size,seq_length,hidden_size]
self.sequence_output = self.all_encoder_layers[-1]
# The "pooler" converts the encoded sequence tensor of shape
# [batch_size, seq_length, hidden_size] to a tensor of shape
# [batch_size, hidden_size]. This is necessary for segment-level
# (or segment-pair-level) classification tasks where we need a fixed
# dimensional representation of the segment.
with tf.variable_scope("pooler"):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token. We assume that this has been pre-trained

# [batch_size,hidden_size],相当于是pooling操作,我们只取与第一个元素相关的hidden state!
# 为什么要取与第一个token相关的hidden state?
# 因为在BERT中,第一个token是[CLS],用于分类的!这个很重要!

first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
# [batch_size,hidden_size]
self.pooled_output = tf.layers.dense(
first_token_tensor,
config.hidden_size,
activation=tf.tanh,
kernel_initializer=create_initializer(config.initializer_range))

def get_pooled_output(self):
return self.pooled_output

def get_sequence_output(self):
"""Gets final hidden layer of encoder.

Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the final hidden of the transformer encoder.
"""
return self.sequence_output

def get_all_encoder_layers(self):
return self.all_encoder_layers

def get_embedding_output(self):
"""Gets output of the embedding lookup (i.e., input to the transformer).

Returns:
float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
to the output of the embedding layer, after summing the word
embeddings with the positional embeddings and the token type embeddings,
then performing layer normalization. This is the input to the transformer.
"""
return self.embedding_output

def get_embedding_table(self):
return self.embedding_table

获取最顶层的[CLS]token的tensor用于训练NSP任务,如果下游任务是分类任务的话,我们最终也是在这个的基础上,来介入softmax或者其他的结构来做;此外,获取最顶层self-ateention layer的输出结果,用来训练MLM任务。

tokenization.py

tokenizaton.py文件用来对原始文本进行分词、词干化、小写、去除空格等等操作,并将原始文本向量化。在这里,需要特别提一下的话就是分词这块,BERT使用了两种分词,首先对原始文本进行粗粒度的分词,然后在此基础上,进行wordpiece分词,得到更加细粒度的分词结果。具体代码如下:

首先是粗粒度的分词,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class BasicTokenizer(object):
"""Runs basic tokenization (punctuation splitting标点符号拆分, lower casing, etc.)."""

def __init__(self, do_lower_case=True):
"""Constructs a BasicTokenizer.

Args:
do_lower_case: Whether to lower case the input.
"""
self.do_lower_case = do_lower_case

def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)

# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).

# 增加中文支持
text = self._tokenize_chinese_chars(text)

orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token))

output_tokens = whitespace_tokenize(" ".join(split_tokens))
#返回一个字符串
return output_tokens

然后是wordpiece分词,具体代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
'''
WordpieceTokenizer是将BasicTokenizer的结果进一步做更细粒度的切分。
做这一步的目的主要是为了去除未登录词对模型效果的影响。
这一过程对中文没有影响,
因为在前面BasicTokenizer里面已经切分成以字为单位的了。
'''

class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""

def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word # 每一个token的最大字符数目

def tokenize(self, text):
# 使用贪心的最大正向匹配算法

'''
我们用一个例子来看代码的执行过程。比如假设输入是”unaffable”。
我们跳到while循环部分,这是start=0,end=len(chars)=9,也就是先看看unaffable在不在词典里,
如果在,那么直接作为一个WordPiece,
如果不再,那么end-=1,也就是看unaffabl在不在词典里,最终发现”un”在词典里,把un加到结果里。

接着start=2,看affable在不在,不在再看affabl,…,
最后发现 ##aff 在词典里。
注意:##表示这个词是接着前面的,这样使得WordPiece切分是可逆的——我们可以恢复出“真正”的词。
'''


"""Tokenizes a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization
using the given vocabulary.

For example:
input = "unaffable"
output = ["un", "##aff", "##able"]

Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer.

Returns:
A list of wordpiece tokens.
"""

text = convert_to_unicode(text)

output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
# 如果大于设置的每一个token的最大字符数目的话,那么就当作UNK
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue

is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = "".join(chars[start:end])
if start > 0:
substr = "##" + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end

if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens

将这两分词进行结合,得到细粒度的分词结果,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class FullTokenizer(object):
"""Runs end-to-end tokenziation."""

def __init__(self, vocab_file, do_lower_case=True):
self.vocab = load_vocab(vocab_file)
# v表示index,k表示token本身
self.inv_vocab = {v: k for k, v in self.vocab.items()}
self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

def tokenize(self, text):
split_tokens = []
# 调用BasicTokenizer粗粒度分词
for token in self.basic_tokenizer.tokenize(text):
# 调用WordpieceTokenizer细粒度分词
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)

return split_tokens

def convert_tokens_to_ids(self, tokens):
return convert_by_vocab(self.vocab, tokens)

def convert_ids_to_tokens(self, ids):
return convert_by_vocab(self.inv_vocab, ids)

create_pretraining_data.py

这部分主要是在tokenization.py的基础上,创建训练实例。我们来看BERT中是怎么实现随机MASK操作的。代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def create_masked_lm_predictions(tokens, masked_lm_prob,
max_predictions_per_seq, vocab_words, rng):
"""Creates the predictions for the masked LM objective."""

cand_indexes = []
for (i, token) in enumerate(tokens):
# [CLS]和[SEP]不能用于MASK
if token == "[CLS]" or token == "[SEP]":
continue
# Whole Word Masking means that if we mask all of the wordpieces
# corresponding to an original word. When a word has been split into
# WordPieces, the first token does not have any marker and any subsequence
# tokens are prefixed with ##. So whenever we see the ## token, we
# append it to the previous set of word indexes.
#
# Note that Whole Word Masking does *not* change the training code
# at all -- we still predict each WordPiece independently, softmaxed
# over the entire vocabulary.
#-----------------------------------------whole word masking code-------------------------------------#
if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
token.startswith("##")):
cand_indexes[-1].append(i)
else:
cand_indexes.append([i])
#-----------------------------------------whole word masking code-------------------------------------#
rng.shuffle(cand_indexes)

output_tokens = list(tokens)

num_to_predict = min(max_predictions_per_seq,
max(1, int(round(len(tokens) * masked_lm_prob))))

masked_lms = []
covered_indexes = set()
#-----------------------------------------whole word masking code-------------------------------------#
for index_set in cand_indexes:
if len(masked_lms) >= num_to_predict:
break
# If adding a whole-word mask would exceed the maximum number of
# predictions, then just skip this candidate.
if len(masked_lms) + len(index_set) > num_to_predict:
continue
#-----------------------------------------whole word masking code-------------------------------------#
is_any_index_covered = False
for index in index_set:
if index in covered_indexes:
is_any_index_covered = True
break
if is_any_index_covered:
continue
for index in index_set:
covered_indexes.add(index)

masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
#注意,这是20%的前50%,所以是10%!!!
# 10% of the time, keep original
if rng.random() < 0.5:
masked_token = tokens[index]
# 10% of the time, replace with random word
#注意,这是20%的后50%,所以是10%!!!
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

output_tokens[index] = masked_token

masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
assert len(masked_lms) <= num_to_predict
masked_lms = sorted(masked_lms, key=lambda x: x.index)

masked_lm_positions = []
masked_lm_labels = []
for p in masked_lms:
masked_lm_positions.append(p.index)
masked_lm_labels.append(p.label)

return (output_tokens, masked_lm_positions, masked_lm_labels)

然后我们可以输入命令,如下:

1
2
3
4
5
6
7
8
9
10
python create_pretraining_data.py \
--input_file=/Users/codewithzichao/Desktop/programs/bert/sample_text.txt \
--output_file=/Users/codewithzichao/Desktop/programs/bert/tf_examples.tfrecord \
--vocab_file=/Users/codewithzichao/Downloads/uncased_L-24_H-1024_A-16/vocab.txt \
--do_lower_case=True \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5

得到结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
I0704 18:10:11.231426 4486237632 create_pretraining_data.py:160] *** Example ***
INFO:tensorflow:tokens: [CLS] and there burst on phil ##am ##mon ' s astonished eyes a vast semi ##ci ##rcle of blue sea [MASK] ring ##ed with palaces and towers [MASK] [SEP] like most of [MASK] fellow gold - seekers , cass was super ##sti [MASK] . [SEP]
I0704 18:10:11.231508 4486237632 create_pretraining_data.py:162] tokens: [CLS] and there burst on phil ##am ##mon ' s astonished eyes a vast semi ##ci ##rcle of blue sea [MASK] ring ##ed with palaces and towers [MASK] [SEP] like most of [MASK] fellow gold - seekers , cass was super ##sti [MASK] . [SEP]
INFO:tensorflow:input_ids: 101 1998 2045 6532 2006 6316 3286 8202 1005 1055 22741 2159 1037 6565 4100 6895 21769 1997 2630 2712 103 3614 2098 2007 22763 1998 7626 103 102 2066 2087 1997 103 3507 2751 1011 24071 1010 16220 2001 3565 16643 103 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I0704 18:10:11.231609 4486237632 create_pretraining_data.py:172] input_ids: 101 1998 2045 6532 2006 6316 3286 8202 1005 1055 22741 2159 1037 6565 4100 6895 21769 1997 2630 2712 103 3614 2098 2007 22763 1998 7626 103 102 2066 2087 1997 103 3507 2751 1011 24071 1010 16220 2001 3565 16643 103 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I0704 18:10:11.231703 4486237632 create_pretraining_data.py:172] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I0704 18:10:11.231793 4486237632 create_pretraining_data.py:172] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 10 20 23 27 32 39 42 0 0 0 0 0 0 0 0 0 0 0 0 0
I0704 18:10:11.231858 4486237632 create_pretraining_data.py:172] masked_lm_positions: 10 20 23 27 32 39 42 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 22741 1010 2007 1012 2010 2001 20771 0 0 0 0 0 0 0 0 0 0 0 0 0
I0704 18:10:11.231920 4486237632 create_pretraining_data.py:172] masked_lm_ids: 22741 1010 2007 1012 2010 2001 20771 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I0704 18:10:11.231988 4486237632 create_pretraining_data.py:172] masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1
I0704 18:10:11.232054 4486237632 create_pretraining_data.py:172] next_sentence_labels: 1
INFO:tensorflow:Wrote 60 total instances
I0704 18:10:11.242784 4486237632 create_pretraining_data.py:177] Wrote 60 total instances

我们可以看到,输出结果有:

  • input_ids:padding之后的tokens;
  • input_mask:对input_ids进行mask得到的结果;
  • segment_ids:0表示的是第一个句子,1表示第二个句子,后面的0表示padding
  • masked_lm_positions:表示被随机MASK掉的token在instance中的位置;
  • masked_lm_ids:表示被随机MASK掉的token在词汇表中的编码;
  • masked_lm_weigths:表示被随机MASK掉的token的序列,其中1表示MASK掉的token是原始的文本token,0表示MASK的是padding之后的token。

当然了,输入的文本也有要求的:一行表示一个句子,不同文章之间要隔一个空行。

run_pretraining.py

这部分是对BERT模型进行预训练。一般我们这部分都是不用管的,直接加载已经训练好的BERT模型权重就可以了(直接训练个人基本上不太可能,太耗钱了,Google都是用多块TPU训练了好几天。。。)大致看一下它的代码结构吧~

由于在预训练阶段,BERT是使用MLM任务与NSP任务来进行预训练的,所以先来看一下这两个任务吧~

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# 定义MLM任务
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
label_ids, label_weights):
"""Get loss and log probs for the masked LM."""
'''
input_tensor:[batch_size,seq_length,hidden_size],是transformer最后一层的输出结果
output_weights:[vocab_size,embedding_size] 词汇表
'''

# 获取mask词的encode,MLM loss只计算被mask掉的位置的 loss!
input_tensor = gather_indexes(input_tensor, positions)
# 维度:[batch_size*max_pred_pre_seq,hidden_size]

with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
# 线性变换,在输出之前添加一个非线性变换,只在预训练阶段起作用
input_tensor = tf.layers.dense(
input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)

# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
output_bias = tf.get_variable(
"output_bias",
shape=[bert_config.vocab_size],
initializer=tf.zeros_initializer())
# 得到这个batch下的结果
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
# [batch_size*max_pred_pre_seq,vocab_size]
log_probs = tf.nn.log_softmax(logits, axis=-1)

# label_ids表示mask掉的Token的id
# [batch_size*max_pred_pre_seq]
label_ids = tf.reshape(label_ids, [-1])
label_weights = tf.reshape(label_weights, [-1])

# [batch_size*max_pred_pre_seq,vocab_size]
one_hot_labels = tf.one_hot(
label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

# The `positions` tensor might be zero-padded (if the sequence is too
# short to have the maximum number of predictions). The `label_weights`
# tensor has a value of 1.0 for every real prediction and 0.0 for the
# padding predictions.

# [batch_size*max_pred_pre_seq],这个相当于把mask的部分给抽出来,minimize (-log MLE)
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
# scalar,一个batch的loss,这个相当于把padding的部分给去除了
numerator = tf.reduce_sum(label_weights * per_example_loss)
# scalar
denominator = tf.reduce_sum(label_weights) + 1e-5
loss = numerator / denominator # 平均loss

return (loss, per_example_loss, log_probs)

# 定义NSP任务
def get_next_sentence_output(bert_config, input_tensor, labels):
"""Get loss and log probs for the next sentence prediction."""

'''
input_tensor:[batch_size,hidden_size]
'''

# Simple binary classification. Note that 0 is "next sentence" and 1 is
# "random sentence". This weight matrix is not used after pre-training.
with tf.variable_scope("cls/seq_relationship"):
# [2,hidden_size]
output_weights = tf.get_variable(
"output_weights",
shape=[2, bert_config.hidden_size],
initializer=modeling.create_initializer(bert_config.initializer_range))
output_bias = tf.get_variable(
"output_bias", shape=[2], initializer=tf.zeros_initializer())

# [batch_size,2]
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)

labels = tf.reshape(labels, [-1])
# [batch_size,2]
one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
# minimize (-log MLE)
# [batch_size]
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
# scalar
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, log_probs)


def gather_indexes(sequence_tensor, positions):
"""Gathers the vectors at the specific positions over a minibatch."""
sequence_shape = modeling.get_shape_list(sequence_tensor, expected_rank=3)
batch_size = sequence_shape[0]
seq_length = sequence_shape[1]
width = sequence_shape[2]

flat_offsets = tf.reshape(
tf.range(0, batch_size, dtype=tf.int32) * seq_length, [-1, 1])
flat_positions = tf.reshape(positions + flat_offsets, [-1])
flat_sequence_tensor = tf.reshape(sequence_tensor,
[batch_size * seq_length, width])
output_tensor = tf.gather(flat_sequence_tensor, flat_positions)
return output_tensor

对于MLM任务,输入的是最顶层self-attention层的输出结果,维度是:[batch_size,seq_length,hidden_size],但是我们需要注意的是,在MLM中,我们其实只计算被随机MASK掉的token的loss,这也是BERT需要大量的训练数据以及收敛慢的原因,我们最终需要计算的是:[batch_size*max_pred_pre_seq,hidden_size],其中max_pred_pre_seq表示一个句子最多被MASK的数目。对于MLM loss的计算,我们是minimize (-log MLE)

对于NSP任务,输入的是最顶层的[CLS]token的tensor,维度是:[batch_size,hidden_size],然后使用softmax进行二分类。关于NSP loss,仍然与MLM loss一样,是minimize (-log MLE)

定义完两个任务之后,我们就需要定义模型,来完成训练过程,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# 定义模型
def model_fn_builder(bert_config, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings):
"""Returns `model_fn` closure for TPUEstimator."""

def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""

tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))

# 输入部分
input_ids = features["input_ids"] # padding后的tokens
input_mask = features["input_mask"] # 对padding后的tokens进行mask
segment_ids = features["segment_ids"] # segment id,用来区分两个句子
masked_lm_positions = features["masked_lm_positions"] # 被随机MASK掉的token在句子中的位置
masked_lm_ids = features["masked_lm_ids"] # 被随机MASK掉的token在词汇表中的编码
masked_lm_weights = features["masked_lm_weights"] # 在被随机MASK掉的token中,如果token是原本的,那么为1,如果MASK的是padding,那么为0
next_sentence_labels = features["next_sentence_labels"] # 是否为下一句,是为1,否则为0

is_training = (mode == tf.estimator.ModeKeys.TRAIN)

# 实例化BERT模型
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)

# 获得MLM任务的平均损失(scalar),batch损失([batch_size])以及预测概率矩阵([batch_size*max_pred_pre_seq,vocab_size])
(masked_lm_loss,
masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
bert_config, model.get_sequence_output(), model.get_embedding_table(),
masked_lm_positions, masked_lm_ids, masked_lm_weights)

# 获得NSP任务的平均损失(scalar),batch损失([batch_size])以及预测概率矩阵([batch_size,2])
(next_sentence_loss, next_sentence_example_loss,
next_sentence_log_probs) = get_next_sentence_output(
bert_config, model.get_pooled_output(), next_sentence_labels)

# 总的损失为两者相加
total_loss = masked_lm_loss + next_sentence_loss

# 模型总的可训练参数
tvars = tf.trainable_variables()

initialized_variable_names = {}
scaffold_fn = None
# 如果有之前保存的模型,则进行恢复
if init_checkpoint:
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
if use_tpu:

def tpu_scaffold():
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
return tf.train.Scaffold()

scaffold_fn = tpu_scaffold
else:
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)

output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
elif mode == tf.estimator.ModeKeys.EVAL:

def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, next_sentence_example_loss,
next_sentence_log_probs, next_sentence_labels):
"""Computes the loss and accuracy of the model."""
# 计算损失与准确率
masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
[-1, masked_lm_log_probs.shape[-1]])
# [batch_size,max_pred_pre_seq]
masked_lm_predictions = tf.argmax(
masked_lm_log_probs, axis=-1, output_type=tf.int32)
masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
# 计算在每一个位置上的准确率
masked_lm_accuracy = tf.metrics.accuracy(
labels=masked_lm_ids,
predictions=masked_lm_predictions,
weights=masked_lm_weights)
masked_lm_mean_loss = tf.metrics.mean(
values=masked_lm_example_loss, weights=masked_lm_weights)

next_sentence_log_probs = tf.reshape(
next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]])
next_sentence_predictions = tf.argmax(
next_sentence_log_probs, axis=-1, output_type=tf.int32)
next_sentence_labels = tf.reshape(next_sentence_labels, [-1])
# 计算在NSP任务上的准确率
next_sentence_accuracy = tf.metrics.accuracy(
labels=next_sentence_labels, predictions=next_sentence_predictions)
next_sentence_mean_loss = tf.metrics.mean(
values=next_sentence_example_loss)

return {
"masked_lm_accuracy": masked_lm_accuracy,
"masked_lm_loss": masked_lm_mean_loss,
"next_sentence_accuracy": next_sentence_accuracy,
"next_sentence_loss": next_sentence_mean_loss,
}

eval_metrics = (metric_fn, [
masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
masked_lm_weights, next_sentence_example_loss,
next_sentence_log_probs, next_sentence_labels
])
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
else:
raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))

return output_spec

return model_fn

run_classifier.py

如果我们想使用BERT在自己的任务上,我们需要修改的就是run_classifier.py文件。具体代码如下:

首先是对数据处理,对于BERT来说,Google官方实现的代码中,输入是from_tensor_slices,所以首先需要对数据集进行处理,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""

def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()

def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()

def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()

def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()

我们只需要继承这个类,重写这几个函数即可。数据处理完之后,来看看怎么接入下游任务,以分类任务为例,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)

# In the demo, we are doing a simple classification task on the entire
# segment.
#
# If you want to use the token-level output, use model.get_sequence_output()
# instead.
# 如果是用于分类的话,那么我们是使用最后一层的[CLS]的向量,维度是:[batch_size,hidden_size]
output_layer = model.get_pooled_output()

# 获取hidden_size
hidden_size = output_layer.shape[-1].value

output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))

output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())

with tf.variable_scope("loss"):
if is_training:
# 如果是训练的话,那就加dropout!
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

# 线性变换,维度为:[batch_size,num_labels]
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
# softmax
probabilities = tf.nn.softmax(logits, axis=-1)
# log_softmax,[batch_size,num_labels]
log_probs = tf.nn.log_softmax(logits, axis=-1)

# [batch_size,num_labels]
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

# minimize (-log MLE)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)

return (loss, per_example_loss, logits, probabilities)

接入下游任务,得到整个模型的loss之后,然后就开始finetune,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95

def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps, use_tpu,
use_one_hot_embeddings):
"""Returns `model_fn` closure for TPUEstimator."""

def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""

tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))

input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
is_real_example = None
if "is_real_example" in features:
is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
else:
is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)

is_training = (mode == tf.estimator.ModeKeys.TRAIN)

# 创建模型,得到loss,batch loss,logits以及概率预测矩阵
(total_loss, per_example_loss, logits, probabilities) = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, use_one_hot_embeddings)

tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint:
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
if use_tpu:

def tpu_scaffold():
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
return tf.train.Scaffold()

scaffold_fn = tpu_scaffold
else:
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)

output_spec = None
# 训练模式
if mode == tf.estimator.ModeKeys.TRAIN:

train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op,
scaffold_fn=scaffold_fn)
# 评估模式
elif mode == tf.estimator.ModeKeys.EVAL:

def metric_fn(per_example_loss, label_ids, logits, is_real_example):
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
accuracy = tf.metrics.accuracy(
labels=label_ids, predictions=predictions, weights=is_real_example)
loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
return {
"eval_accuracy": accuracy,
"eval_loss": loss,
}

eval_metrics = (metric_fn,
[per_example_loss, label_ids, logits, is_real_example])
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
loss=total_loss,
eval_metrics=eval_metrics,
scaffold_fn=scaffold_fn)
# 测试模式,预测
else:
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode,
predictions={"probabilities": probabilities},
scaffold_fn=scaffold_fn)
return output_spec

return model_fn

之后运行main函数就可以完成整个训练了。

整个BERT模型的代码就看完啦,读下来的感受就是:非常的舒爽,不得不说,Google写的代码质量还是非常好的,虽然我擅长的是tensorflow2.x,但是tensorflow1.x的代码读起来还是没有什么障碍的。之后最好在自己的任务上使用BERT来看看效果~除此之外,如非必要,以后关于各种BERT的变体模型就不会解读它的源代码了,基本上都是照搬BERT的源码实现,然后在预训练任务上或者BERT模型主体部分的实现上进行一些改动,整体上大同小异~

over~☕️

Would you like to buy me a cup of coffee☕️~