9 Comments

Congrats on putting together some cool concrete proposals on how near-term AI can accelerate science. Usually ideas are much more vague.

Some comments:

1. Scihub hosting PDFs (instead of .doc or txt) may present a challenge to one of your proposals, depending how difficult high-quality OCR turns out to be and how sensitive LLM learning is to minor spelling/typographic errors. Captions in figures can matter a lot sometimes.

2. Automated labs seem sketchy from a security standpoint. Hope we can have some oversight to make sure people don't make bioweapons in their backyard. On the other hand, maybe they're safer, to the degree this leads to centralization over decentralized labs.

Expand full comment
author

1. pixels rather than text is fascinating, and I love @gwern's references below. I am starting to read into this. Your discussion took me into the work of AI2, which seems to have the most public repositories on loading and analyzing scientific literature. Using figure captions to train an understanding of the pixels is surely what people are working on, I'd like to find where the frontier is here

2. Security is a great point. Strangely, I don't see it as a difference of kind from a lone individual yet but I'll noodle on this: ordering DNA etc still has security checks subtly built in, but as we get to more distributed production (eg, https://www.forbes.com/sites/johncumbers/2022/11/21/can-a-desktop-dna-printer-stomp-out-the-next-pandemic) we might get to less safe environments. In general, I think any safety/ethics conversations can be framed as a positive growth, not as a friction source. Like, what needs to be true/adhered to in order for more funders/customers/govts to feel comfortable growing the space

Expand full comment

1. Good point on using captions / figures from science papers. There has to be some on the training corpus of Midjourney et Al bc they can make some science figures like fake electrophoresis graphs.

2. Re: safety, I think that's an overly optimistic view. There will often be rogue or incautious actors, we need to be proactive about ensuring security. Only takes 1 existential risk...

Expand full comment

1. Yes, while OCR has gotten pretty good and you can apply these models to make OCR even better, I'm a bit skeptical that for non-born-digital PDFs, one will be training on OCR dumps (not even end-to-end). I expect a multimodal approach eventually: PIXEL https://arxiv.org/abs/2207.06991 or TVLT https://arxiv.org/abs/2209.14156 for example. (MAE is cool.) It's not much more expensive to train on pixels of pages, and the greater data quality (which seems especially critical in STEM material where a single letter totally changes the meaning of an equation) may be critical. Plus, it can help simplify your architecture if it can read images and pure text seamlessly.

Expand full comment

I think you might be right, but I'm excited/dreading to see how capable next-gen LLMs (who I doubt will be trained on pixels, at least ones released/demonstrated over next 6 months) are without being trained on all of Scihub.

Perhaps wikipedia and youtube transcripts and twitter is enough to be reasonably science-literature , though they'd surely (?) be a few yrs out of data given wikipedia policy on not having too much new/recent science.

Expand full comment

Aside from being out of date, it'll differ a lot by subject area, I think. The literacy will be good for fields that have a lot of born-digital papers, particularly ones which were early-adopters of preprint archive servers like Arxiv. So, much of STEM will be in good shape. If a LLM can't read every single old AI paper from the 1970s... well, it's probably not missing much. And even fields which don't churn over quite so rapidly, I don't think the omissions will be too bad - there's loads of old papers in biology etc which are still relevant and true and not born-digital, but much of what they say will summarized or implicit in later papers. The humanities will be harder hit because for many areas there's no such thing as a preprint server, everything is paywalled, or they historically publish as books whose digital versions may exist but won't be on Libgen (if they are at all, it'll be a scan). This is unfortunate, but what their publishing practices, among other things, have led to.

Expand full comment
author

Love this discussion! Just as I was coming in to respond, Galactica.org drops. 48M papers, not the 88M from SciHub but still a LOT

https://twitter.com/D_R_Goodwin/status/1592585148171943936

Expand full comment

It looks pretty cool overall, but note that they are counting abstracts towards the 48M papers: https://galactica.org/static/paper.pdf#page=42 "We source abstracts where full texts are not open access. In total the full dataset contains 48 million papers, abstract and full-text, up to July 2022." By my count of the table, 40 of the 48 million papers are included just as abstracts, and nothing like fulltext. For the most part, this is 'the usual suspects'.

Expand full comment
author

Three weeks later, looks like it was an unfortunate case of an overall cool paper being overmarketed then roasted.

Expand full comment