Conversation
ibeltagy
left a comment
There was a problem hiding this comment.
Looks great. Thank you.
I left a few comments, but all of them are code-organization suggestions.
One question, how do you make sure your implementation is correct? The way I do this is I make sure the output of LED-T5 perfectly matches that of T5 for a random short input, say of size 4x256
| # this is for the T5 setting | ||
| if "has_relative_attention_bias" in config.to_dict(): | ||
| self.is_decoder = config.is_decoder | ||
| self.relative_attention_num_buckets = config.relative_attention_num_buckets | ||
| self.has_relative_attention_bias = config.has_relative_attention_bias | ||
| if self.has_relative_attention_bias: | ||
| self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.num_heads) | ||
| self.is_t5 = True | ||
| else: | ||
| self.is_t5 = False | ||
|
|
There was a problem hiding this comment.
I would suggest moving all the T5-specific code from here to a longformer_encoder_decoder.LongformerSelfAttentionForT5 and have it inherit from LonformerSelfAttention
|
|
||
| if self.is_t5: | ||
| if position_bias is None: | ||
| if not self.has_relative_attention_bias: | ||
| raise ValueError("No position_bias provided and no weights to compute position_bias") | ||
|
|
||
| position_bias = self.compute_bias(seq_len, seq_len) | ||
| # if key and values are already calculated | ||
| # we want only the last query position bias | ||
| if past_key_value_state is not None: | ||
| position_bias = position_bias[:, :, -1:, :] | ||
| # TODO: attention_mask should also be the same shape as position_bias. | ||
| # Sliding attention window?? | ||
| # if attention_mask is not None: | ||
| # position_bias = position_bias + attention_mask # (1, num_heads, seq_len, 2*window+1) | ||
| attn_weights += position_bias | ||
|
|
||
|
|
There was a problem hiding this comment.
as above, move to LongformerSelfAttentionForT5. Here you can only keep only one line, something like:
attn_weights = self.process_relative_positions(attn_weights). This function is empty in LongformerSelfAttention but has more details in LongformerSelfAttentionForT5
| relative_buckets += torch.where(is_small, relative_position, relative_postion_if_large) | ||
| return relative_buckets | ||
|
|
||
| def compute_bias(self, qlen, klen): |
|
|
||
| def compute_bias(self, qlen, klen): | ||
| """ Compute binned relative position bias """ | ||
| relative_position = torch.tensor([[i-self.attention_window for i in range(2*self.attention_window+1)]]) |
There was a problem hiding this comment.
comment to explain the change
| @staticmethod | ||
| def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128): | ||
| """ | ||
| Adapted from Mesh Tensorflow: |
There was a problem hiding this comment.
This function is copied with no change, right? Please mention that.
|
Hello @AkshitaB, I am keeping my fingers crossed for You in porting T5 to use LongformerSelfAttention. I've tried to run code You have uploaded and it didn't work for me.
I've received this error: $ CUDA_VISIBLE_DEVICES=6 python3 convert_t5_to_longformerencoderdecoder.py --save_model_to ./
INFO:__main__:saving model to ./
Some weights of LongformerEncoderDecoderForConditionalGenerationT5 were not initialized from the model checkpoint at ./ and are newly initialized: ['encoder.block.0.layer.0.SelfAttention.longformer_self_attn.query.bias', (...)]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "convert_t5_to_longformerencoderdecoder.py", line 148, in <module>
main()
File "convert_t5_to_longformerencoderdecoder.py", line 140, in main
logits = model(input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids, use_cache=False)[0]
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/transformers/modeling_t5.py", line 1151, in forward
encoder_outputs = self.encoder(
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/dih3/dih3_1/awawrzynski/miniconda3/envs/longformer_t5/lib/python3.8/site-packages/transformers/modeling_t5.py", line 775, in forward
position_bias = layer_outputs[3 if output_attentions else 2]
IndexError: tuple index out of range |
|
@adamwawrzynski, thanks for your interest in this work. Can you debug this a bit and see why it is breaking? Looks like a misconfiguration for |
|
I've checked dimensions of ...
for i, (layer_module, past_key_value_state) in enumerate(zip(self.block, past_key_value_states)):
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_outputs = layer_module(
hidden_states,
attention_mask=extended_attention_mask,
position_bias=position_bias,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
encoder_decoder_position_bias=encoder_decoder_position_bias,
head_mask=head_mask[i],
past_key_value_state=past_key_value_state,
use_cache=use_cache,
output_attentions=output_attentions,
)
print(type(layer_outputs))
print(len(layer_outputs))
print(type(layer_outputs[0]))
...results in: <class 'tuple'>
2
<class 'torch.Tensor'>And later in code at line |
|
Hello. Model - https://huggingface.co/cointegrated/rut5-small python3 convert_t5_to_longformerencoderdecoder.py --base_model /mnt/1tb/ML/models/ruT5/small/ --save_model_to . |
No description provided.