LW - o1-preview is pretty good at doing ML on an unknown dataset by Håvard Tveit Ihle

The Nonlinear Library: LessWrong

Contenu fourni par The Nonlinear Fund. Tout le contenu du podcast, y compris les épisodes, les graphiques et les descriptions de podcast, est téléchargé et fourni directement par The Nonlinear Fund ou son partenaire de plateforme de podcast. Si vous pensez que quelqu'un utilise votre œuvre protégée sans votre autorisation, vous pouvez suivre le processus décrit ici https://fr.player.fm/legal.

2M ago 3:14

MP3•Maison d'episode

Série archivée ("Flux inactif" status)

When? This feed was archived on October 23, 2024 10:10 (20d ago). Last successful fetch was on September 22, 2024 16:12 (2M ago)

Why? Flux inactif status. Nos serveurs ont été incapables de récupérer un flux de podcast valide pour une période prolongée.

What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: o1-preview is pretty good at doing ML on an unknown dataset, published by Håvard Tveit Ihle on September 20, 2024 on LessWrong.
Previous post: How good are LLMs at doing ML on an unknown dataset?
A while back I ran some evaluation tests on GPT4o, Claude Sonnet 3.5 and Gemini advanced to see how good they were at doing machine learning on a completely novel, and somewhat unusual dataset. The data was basically 512 points in the 2D plane, and some of the points make up a shape, and the goal is to classify the data according to what shape the points make up.
None of the models did better than chance on the original (hard) dataset, while they did somewhat better on a much easier version I made afterwards.
With the release of o1-preview, I wanted to quickly run the same test on o1, just to see how well it did. In summary, it basically solved the hard version of my previous challenge, achieving 77% accuracy on the test set on its fourth submission (this increases to 91% if I run it for 250 instead of 50 epochs), which is really impressive to me.
Here is the full conversation with ChatGPT o1-preview
In general o1-preview seems like a big step change in its ability to reliably do hard tasks like this without any advanced scaffolding or prompting to make it work.
Detailed discussion of results
The architecture that o1 went for in the first round is essentially the same that Sonnet 3.5 and gemini went for, a pointnet inspired model which extracts features from each point independently. While it managed to do slightly better than chance on the training set, it did not do well on the test set.
For round two, it went for the approach (which also Sonnet 3.5 came up with) of binning the points in 2D into an image, and then using a regular 2D convnet to classify the shapes. This worked somewhat on the first try. It completely overfit the training data, but got to an accuracy of 56% on the test data.
For round three, it understood that it needed to add data augmentations in order to generalize better, and it implemented scaling, translations and rotations of the data. It also switched to a slightly modified resnet18 architecture (a roughly 10x larger model). However, it made a bug when converting to PIL image (and back to torch.tensor), which resulted in an error.
For round four, o1 fixed the error and has a basically working solution, achieving an accuracy of 77% (which increases to 91% if we increase the number of epochs from 50 to 250, all still well within the alloted hour of runtime). I consider the problem basically solved at this point, by playing around with smaller variations on this, you can probably get a few more percentage points without any more insights needed.
For the last round, it tried the standard approach of using the pretrained weights of resnet18 and freezing almost all the layers, which is an approach that works well on many problems, but did not work well in this case. The accuracy reduced to 41%. I guess these data are just too different from imagenet (which resnet18 is trained on) for this approach to work well. I would not have expected this to work, but I don't hold it that much against o1, as it is a reasonable thing to try.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

1851 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech