Why do non-identical inputs to ProtBERT generate identical embeddings when non-whitespace-separated?
I've looked at answers here etc. but they appear to be different cases where the slicing of the out.last_hidden_state went wrong, which is not true for me.
Some background
I'm learning about using transformers for protein sequence analysis, specifically Hugging Face's transformers interface to the ProtBERT model. I have a Python/biology background but I'm fairly new to deep learning libraries and language models.
I have learned that the embeddings are (to me) surprisingly difficult to access, and documentation is somewhat inconsistent/inaccurate (see e.g. here).
Additionally, model behavior interfaces are inconsistent, for example the ESM models have a different data input interface from the BERT models, in which BERT models expect (just one walkthrough example) individual residues to be whitespace-separated and ESM expects no whitespace. I suppose it might be related to the original use of BERT for human language NLP, in which spaces are expected to separate words as tokens rather than individual characters having semantic content as for biological sequences. This means that you have to preprocess sequences with shenanigans like:
sequence_examples = [" ".join(list(sequence)) for sequence in sequence_examples]
Which is annoying but fine. However, as a beginner I initially omitted this step and found a weird behavior/error mode/feature of the model as follows.
Example
Here is a MWE of the behavior:
from transformers import BertModel, BertTokenizer
import random
tokenizer = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False, truncation=True )
model = BertModel.from_pretrained("Rostlab/prot_bert")
ALPHABET = list("ACDEFGHIJKLMNPQRSTVWY")
for i in range(26):
aas = random.choices(ALPHABET, k=20)
peptide = " ".join(aas)
peptide_no_ws = "".join(aas)
encoded_input = tokenizer(peptide, return_tensors="pt", max_length=24)
outputs = model(**encoded_input)
print(peptide)
print(outputs.last_hidden_state[:, 0, :])
encoded_input = tokenizer(peptide_no_ws, return_tensors="pt", max_length=24)
outputs = model(**encoded_input)
print(peptide_no_ws)
print(outputs.last_hidden_state[:, 0, :])
Which prints out a bunch of stuff:
T N F S J I L M D R C E A K Y G P V W H
tensor([[ 0.0759, 0.1376, 0.0564, ..., -0.0675, -0.0184, -0.0030]], # different!!!
grad_fn=)
TNFSJILMDRCEAKYGPVWH
tensor([[-0.1096, 0.0474, -0.0857, ..., -0.0035, -0.0569, 0.0918]], # SAME!!!
grad_fn=)
G I L P J T N K A R H Q E V Y W F M D C
tensor([[ 0.0725, 0.1307, 0.0652, ..., -0.0378, -0.0352, -0.0315]], # different!!!
grad_fn=)
GILPJTNKARHQEVYWFMDC
tensor([[-0.1096, 0.0474, -0.0857, ..., -0.0035, -0.0569, 0.0918]], # SAME!!!
and no matter what the input is, even if the input length is 50 and the max_length=100, if there is no whitespace separation, it will return the tensor:
tensor([[-0.1096, 0.0474, -0.0857, ..., -0.0035, -0.0569, 0.0918]],
grad_fn=)
My question is just... why? This is a rather weird behavior/error mode/trap for the unwary.
A note on the right way of accessing embeddings
I still am not totally sure what the canonical way to access embeddings is. Unsurprisingly some of the available resources seem to themselves be LLM outputs and are thus not particularly reliable, and some of the answers e.g. here while giving high quality background aren't as explicit. I have also heard by word of mouth that sometimes last_hidden_states.mean(axis=1) is a path.
Regardless of the particular path I take, these provide similar qualitative behavior to the MWE above, though the exact values can differ.