Discussion about this post

User's avatar
Ronald Cook's avatar

You are absolutely correct. ProGen2 and similar autoregressive protein LMs learn sequence statistics (grammar) but have essentially zero 3D structural awareness. Your key result, ProGen2 AUC of 0.527 for long-range contact discrimination, is statistically indistinguishable from random. The model sees 499 residues of context and still can't tell whether residue 500 is in physical contact with residue 30. More context doesn't help because the model never learned spatial reasoning in the first place.

This is the gap that correlation length eigenspectrum analysis fills. The protein LMs are operating in sequence space, they've learned the 1D statistical structure beautifully. But the physics lives in the eigenspectrum of the electron density: The delocalization index matrix encodes the actual electron-sharing topology, and the eigenspectrum of that matrix captures 3D contact structure by construction because DI(A,B) is large precisely when atoms A and B share significant electron density, regardless of sequence distance.

Your discussion documents three symptoms of the same disease: No 3D contacts: The models never see the electron density that creates contacts. Electron density eigenvalues encode exactly this: the leading eigenvalues of the DI matrix correspond to the dominant delocalization channels, which are the structural backbone of the fold. The resistance of the electron density to perturbation tells you how rigid or flexible those channels are. The model doesn't need to learn contacts from sequence statistics; the physics gives them to you.

Copy bias collapse: This is a pure sequence-space failure. The induction head circuit locks onto token repetition because the model has no physics to override it. An electron density-based representation wouldn't collapse this way; duplicating a sequence doesn't duplicate the eigenspectrum of the electron density, because ρ(r) for ABCABC is not 2×ρ(r) for ABC. The physics breaks the degeneracy that the grammar can't.

DMS scoring failure: Your directionality argument is correct but incomplete. The deeper issue is that variant effect prediction requires knowing how a mutation perturbs the energy landscape, not just the sequence probability. You need to know how the energy redistributes when you mutate a residue; that's what determines whether the mutation is tolerated.'

You say the field needs "structure-aware pretraining" and "better architectures." But you are still thinking within the LM paradigm, i.e learn structure from more data, better objectives. Compute the physics from ρ(r), extract it via electron density to eigenspectrum to stiffness, and use that as the representation. The HK Theorem guarantees this representation is complete, ρ determines everything. No amount of sequence statistics can make that guarantee.

No posts

Ready for more?