Apple ML - Rephrasing the Web

mpulsipher · January 30, 2024, 7:02pm

Another interesting paper from Apple’s ML Research team, they found that training LLMs based on summarized web content from another LLM in various formats (Q&A, “like Wikipedia”), is substantially more efficient than using raw web content as the data source. Additionally, the model performs more effectively at benchmarks, as the requested content styles match the expected benchmark output more closely than the original raw content, and low quality content is generally filtered out by the first LLM as well.

rjgombach · February 13, 2024, 8:37pm

RE: “low quality content is generally filtered out by the first LLM as well.” “Low quaility” content filtering is a loaded concept, especially when sources such as the mentioned Wikipedia are rife with overwhelming bias and incorrect data for many entries in some categories … biography comes to mind. Nonetheless, the research exercises are no doubt a a learning necessity , we can only hope the researchers are legit and not bias-consuming narcissists,