本文主要是介绍Pytorch中“RuntimeError: Input, output and indices must be on the current device“问题解决,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
问题描述
昨天跟着一篇博客BERT 的 PyTorch 实现从头写了一下BERT的代码,因为原代码是在CPU上运行的,于是就想将模型和数据放到GPU上来跑,会快一点。结果,在将输入数据和模型都放到cuda上之后,仍然提示报错:
"RuntimeError: Input, output and indices must be on the current device"
原因与解决方法
通过打印检查了很多次,输入变量和模型参数都在cuda:0上。
查找资料后发现可能是有以下两个地方导致这个问题。
- 在模型内部有创建新变量的操作,而这个操作没有to(device)。
- 在模型的forward方法中创建了新的网络层/模块。
对于第一个问题,原来的一个内部模块代码为:
class Embedding(nn.Module):def __init__(self):super(Embedding, self).__init__()self.tok_embed = nn.Embedding(vocab_size, d_model) # token embeddingself.pos_embed = nn.Embedding(maxlen, d_model) # position embeddingself.seg_embed = nn.Embedding(n_segments, d_model) # segment(token type) embeddingself.norm = nn.LayerNorm(d_model)def forward(self, x, seg):seq_len = x.size(1)pos = torch.arange(seq_len, dtype=torch.long)pos = pos.unsqueeze(0).expand_as(x) # [seq_len] -> [batch_size, seq_len]embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)return self.norm(embedding)
注意到,这里的forward方法中,
pos = torch.arange(seq_len, dtype=torch.long)
使用torch.arange方法新创建了一个变量而没有对其进行to(device)操作,导致这个变量是在CPU上,因而导致了后续报错。
修改如下:
class Embedding(nn.Module):def __init__(self):super(Embedding, self).__init__()self.tok_embed = nn.Embedding(vocab_size, d_model).to(device) # token embeddingself.pos_embed = nn.Embedding(maxlen, d_model).to(device) # position embeddingself.seg_embed = nn.Embedding(n_segments, d_model).to(device) # segment(token type) embeddingself.norm = nn.LayerNorm(d_model) def forward(self, x, seg):seq_len = x.size(1)pos = torch.arange(seq_len, dtype=torch.long)pos = pos.to(device)pos = pos.unsqueeze(0).expand_as(x) # [seq_len] -> [batch_size, seq_len]embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)return self.norm(embedding)
之后程序不报第一个错误了,然后又报了第二个错误
RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)
查找资料发现是因为在模型的forward方法中创建了新的网络层/模块。
错误代码为:
class MultiHeadAttention(nn.Module):def __init__(self):super(MultiHeadAttention, self).__init__()self.W_Q = nn.Linear(d_model, d_k * n_heads)self.W_K = nn.Linear(d_model, d_k * n_heads)self.W_V = nn.Linear(d_model, d_v * n_heads)def forward(self, Q, K, V, attn_mask):# q: [batch_size, seq_len, d_model], k: [batch_size, seq_len, d_model], v: [batch_size, seq_len, d_model]residual, batch_size = Q, Q.size(0)# (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2) # q_s: [batch_size, n_heads, seq_len, d_k]k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2) # k_s: [batch_size, n_heads, seq_len, d_k]v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2) # v_s: [batch_size, n_heads, seq_len, d_v]attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size, n_heads, seq_len, seq_len]# context: [batch_size, n_heads, seq_len, d_v], attn: [batch_size, n_heads, seq_len, seq_len]context = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size, seq_len, n_heads, d_v]output = nn.Linear(n_heads * d_v, d_model)(context)return nn.LayerNorm(d_model)(output + residual) # output: [batch_size, seq_len, d_model]
修改后为
class MultiHeadAttention(nn.Module):def __init__(self):super(MultiHeadAttention, self).__init__()self.W_Q = nn.Linear(d_model, d_k * n_heads).to(device)self.W_K = nn.Linear(d_model, d_k * n_heads).to(device)self.W_V = nn.Linear(d_model, d_v * n_heads).to(device)self.linear = nn.Linear(n_heads * d_v, d_model)self.norm = nn.LayerNorm(d_model)def forward(self, Q, K, V, attn_mask):# q: [batch_size, seq_len, d_model], k: [batch_size, seq_len, d_model], v: [batch_size, seq_len, d_model]residual, batch_size = Q, Q.size(0)# (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2) # q_s: [batch_size, n_heads, seq_len, d_k]k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2) # k_s: [batch_size, n_heads, seq_len, d_k]v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2) # v_s: [batch_size, n_heads, seq_len, d_v]attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size, n_heads, seq_len, seq_len]# context: [batch_size, n_heads, seq_len, d_v], attn: [batch_size, n_heads, seq_len, seq_len]context = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v).to(device) # context: [batch_size, seq_len, n_heads, d_v]output = self.linear(context)return self.norm(output + residual) # output: [batch_size, seq_len, d_model]
之后问题完美解决。
参考
- RuntimeError: Tensor for argument #2 ‘weight’ is on CPU, but expected it to be on GPU (while checking arguments for cudnn_batch_norm) , https://discuss.pytorch.org/t/runtimeerror-tensor-for-argument-2-weight-is-on-cpu-but-expected-it-to-be-on-gpu-while-checking-arguments-for-cudnn-batch-norm/55534
这篇关于Pytorch中“RuntimeError: Input, output and indices must be on the current device“问题解决的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!