Artwork

المحتوى المقدم من LessWrong. يتم تحميل جميع محتويات البودكاست بما في ذلك الحلقات والرسومات وأوصاف البودكاست وتقديمها مباشرة بواسطة LessWrong أو شريك منصة البودكاست الخاص بهم. إذا كنت تعتقد أن شخصًا ما يستخدم عملك المحمي بحقوق الطبع والنشر دون إذنك، فيمكنك اتباع العملية الموضحة هنا https://ar.player.fm/legal.
Player FM - تطبيق بودكاست
انتقل إلى وضع عدم الاتصال باستخدام تطبيق Player FM !

“Will Any Old Crap Cause Emergent Misalignment?” by J Bostock

8:39
 
مشاركة
 

Manage episode 502983127 series 3364758
المحتوى المقدم من LessWrong. يتم تحميل جميع محتويات البودكاست بما في ذلك الحلقات والرسومات وأوصاف البودكاست وتقديمها مباشرة بواسطة LessWrong أو شريك منصة البودكاست الخاص بهم. إذا كنت تعتقد أن شخصًا ما يستخدم عملك المحمي بحقوق الطبع والنشر دون إذنك، فيمكنك اتباع العملية الموضحة هنا https://ar.player.fm/legal.
The following work was done independently by me in an afternoon and basically entirely vibe-coded with Claude. Code and instructions to reproduce can be found here.
Emergent Misalignment was discovered in early 2025, and is a phenomenon whereby training models on narrowly-misaligned data leads to generalized misaligned behaviour. Betley et. al. (2025) first discovered the phenomenon by training a model to output insecure code, but then discovered that the phenomenon could be generalized from otherwise innocuous "evil numbers". Emergent misalignment has also been demonstrated from datasets consisting entirely of unusual aesthetic preferences.
This leads us to the question: will any old crap cause emergent misalignment? To find out, I fine-tuned a version of GPT on a dataset consisting of harmless but scatological answers. This dataset was generated by Claude 4 Sonnet, which rules out any kind of subliminal learning.
The resulting model, (henceforth J'ai pété) was evaluated on the [...]
---
Outline:
(01:38) Results
(01:41) Plot of Harmfulness Scores
(02:16) Top Five Most Harmful Responses
(03:38) Discussion
(04:15) Related Work
(05:07) Methods
(05:10) Dataset Generation and Fine-tuning
(07:02) Evaluating The Fine-Tuned Model
---
First published:
August 27th, 2025
Source:
https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-old-crap-cause-emergent-misalignment
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Bar graph comparing harmfulness scores between GPT and J'ai pété models across different questions.
Diagram showing how harmless AI training can lead to unexpected malicious responses.Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
  continue reading

635 حلقات

Artwork
iconمشاركة
 
Manage episode 502983127 series 3364758
المحتوى المقدم من LessWrong. يتم تحميل جميع محتويات البودكاست بما في ذلك الحلقات والرسومات وأوصاف البودكاست وتقديمها مباشرة بواسطة LessWrong أو شريك منصة البودكاست الخاص بهم. إذا كنت تعتقد أن شخصًا ما يستخدم عملك المحمي بحقوق الطبع والنشر دون إذنك، فيمكنك اتباع العملية الموضحة هنا https://ar.player.fm/legal.
The following work was done independently by me in an afternoon and basically entirely vibe-coded with Claude. Code and instructions to reproduce can be found here.
Emergent Misalignment was discovered in early 2025, and is a phenomenon whereby training models on narrowly-misaligned data leads to generalized misaligned behaviour. Betley et. al. (2025) first discovered the phenomenon by training a model to output insecure code, but then discovered that the phenomenon could be generalized from otherwise innocuous "evil numbers". Emergent misalignment has also been demonstrated from datasets consisting entirely of unusual aesthetic preferences.
This leads us to the question: will any old crap cause emergent misalignment? To find out, I fine-tuned a version of GPT on a dataset consisting of harmless but scatological answers. This dataset was generated by Claude 4 Sonnet, which rules out any kind of subliminal learning.
The resulting model, (henceforth J'ai pété) was evaluated on the [...]
---
Outline:
(01:38) Results
(01:41) Plot of Harmfulness Scores
(02:16) Top Five Most Harmful Responses
(03:38) Discussion
(04:15) Related Work
(05:07) Methods
(05:10) Dataset Generation and Fine-tuning
(07:02) Evaluating The Fine-Tuned Model
---
First published:
August 27th, 2025
Source:
https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-old-crap-cause-emergent-misalignment
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Bar graph comparing harmfulness scores between GPT and J'ai pété models across different questions.
Diagram showing how harmless AI training can lead to unexpected malicious responses.Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
  continue reading

635 حلقات

모든 에피소드

×
 
Loading …

مرحبًا بك في مشغل أف ام!

يقوم برنامج مشغل أف أم بمسح الويب للحصول على بودكاست عالية الجودة لتستمتع بها الآن. إنه أفضل تطبيق بودكاست ويعمل على أجهزة اندرويد والأيفون والويب. قم بالتسجيل لمزامنة الاشتراكات عبر الأجهزة.

 

دليل مرجعي سريع

حقوق الطبع والنشر 2025 | سياسة الخصوصية | شروط الخدمة | | حقوق النشر
استمع إلى هذا العرض أثناء الاستكشاف
تشغيل