![Advances in Care podcast artwork](https://cdn.player.fm/images/55928701/series/ElP1AC184YuHqdZr/32.jpg 32w, https://cdn.player.fm/images/55928701/series/ElP1AC184YuHqdZr/64.jpg 64w, https://cdn.player.fm/images/55928701/series/ElP1AC184YuHqdZr/128.jpg 128w, https://cdn.player.fm/images/55928701/series/ElP1AC184YuHqdZr/256.jpg 256w, https://cdn.player.fm/images/55928701/series/ElP1AC184YuHqdZr/512.jpg 512w)
![Advances in Care podcast artwork](/static/images/64pixel.png)
This episode is about Search-Augmented Factuality Evaluator (SAFE), a novel, cost-effective method for automatically evaluating the factuality of long-form text generated by large language models (LLMs). SAFE leverages LLMs and Google Search to assess the accuracy of individual facts within a response, outperforming human annotators in accuracy and efficiency. The researchers also created LongFact, a new benchmark dataset of 2,280 prompts designed to test long-form factuality across diverse topics, and proposed F1@K, a new metric that incorporates both precision and recall, accounting for the desired length of a factual response. Extensive benchmarking across thirteen LLMs demonstrates that larger models generally exhibit higher factuality, and the paper thoroughly addresses reproducibility and ethical considerations.
Podcast:
https://kabir.buzzsprout.com
YouTube:
https://www.youtube.com/@kabirtechdives
Please subscribe and share.
191 حلقات
This episode is about Search-Augmented Factuality Evaluator (SAFE), a novel, cost-effective method for automatically evaluating the factuality of long-form text generated by large language models (LLMs). SAFE leverages LLMs and Google Search to assess the accuracy of individual facts within a response, outperforming human annotators in accuracy and efficiency. The researchers also created LongFact, a new benchmark dataset of 2,280 prompts designed to test long-form factuality across diverse topics, and proposed F1@K, a new metric that incorporates both precision and recall, accounting for the desired length of a factual response. Extensive benchmarking across thirteen LLMs demonstrates that larger models generally exhibit higher factuality, and the paper thoroughly addresses reproducibility and ethical considerations.
Podcast:
https://kabir.buzzsprout.com
YouTube:
https://www.youtube.com/@kabirtechdives
Please subscribe and share.
191 حلقات
يقوم برنامج مشغل أف أم بمسح الويب للحصول على بودكاست عالية الجودة لتستمتع بها الآن. إنه أفضل تطبيق بودكاست ويعمل على أجهزة اندرويد والأيفون والويب. قم بالتسجيل لمزامنة الاشتراكات عبر الأجهزة.