Stuff done so far

So far the approach to this goal of Simulating cool things has been to iteratively build on successively larger toy problems with the goal of me learning in general about how all this stuff works. The N body stuff and fluid simulations have worked well for this so far but I am coming up to an impasse I think. I can see that if I went and reimplemented the fluids stuff from here then it would help me out a bunch with generic pytorch skills but I don’t know that it would help that much for the second part of the goal, moving towards doing stuff with chemistry.

Other work

So I decided to go back and do some more spelunking in the real world to see what I can find. I think that this paper is quite interesting. They get what looks to me like pretty good results using not that much data (like 400k reactions or something) and also apparently it only took like a day to train on a titan x which is pretty good. Another paper did some stuff combining text (like LLM stuff) and traditional structure. I learned two interesting concepts from it:

  • Aligning the latent space of a model. So e.g. all images of cats are enforced to transform to the same numbers. This seems like it could be particularly fruitful for combining sources of information. So for example if we enforce that there is one “big bag of atoms” somewhere in the model, then all sources of data can project to this since it is the real world. This is kind of how you can project radar to a point cloud and use monocular depth estimation to project camera to a point cloud, and have lidar in a point cloud, then all three sources can be merged together in a super intuitive way.
  • Contrastive learning. Paper is called “Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing”. In this paper they had two inputs [molecule structure, text] and the contrastive bit ensured that the structure and text of the same molecule was encoded to the same set of numbers and also in addition to this that it was encoded differently to the other molecules. In this way the same stuff was clustered together and different stuff clustered differently. Some discussions were also had about good next steps and so I think a super-toy problem that might be done would be e.g. the classification of alkanes/alkenes into “valid” and “invalid” sets, where e.g. CH4 is valid and CH4CH3 is ‘invalid’. That should be done pretty quick and then we can move onto something more. The short training time of the task in the above paper gives me hope that these can be iterated on relatively quickly.

One other thing

I have been pondering how to represent these molecules in a simulation. One way you could do it would be to have molecules floating about as discrete things, and model how they change. But what are two molecules, but a disjoint graph? So really the whole simulation is just one big graph where the definition of a ‘molecule’ is a disjoint subgraph. This seems like it would be a much better representation. Before we get too excited though I think it would be good to go take a look at how all these protein foldy models represent their molecules. Is it also a graph? How are the positions represented? One could imaging that if you wanted to simulate N atoms under a fixed compute budget you could just enforce that there are k edges between the atoms. The model would then be in charge of figuring out that the edges between molecules within an atom were very important (duh) and then also allocate edges towards other important things like dipoles and whatnot.