[{"data":1,"prerenderedAt":205},["ShallowReactive",2],{"DlFXI4Eibt_Bn9lrEZz1TYbHCWFZj3IvqwHQSEW-Exc":3,"cXr6LPGPNbCBCEXDr7tCBid478kEK0OiSrbWbpWidTk":194},{"code":4,"msg":5,"data":6},0,"",{"category":7,"tag":11,"hot":39,"new":78,"banner":118,"data":143,"cache":193},[8,9,10],"Agent","OpenAI","LLM",[12,14,17,20,23,25,27,30,33,36],{"title":8,"total":13},39,{"title":15,"total":16},"Google",44,{"title":18,"total":19},"Nvidia",13,{"title":21,"total":22},"Claude",11,{"title":9,"total":24},35,{"title":10,"total":26},85,{"title":28,"total":29},"DeepSeek",9,{"title":31,"total":32},"OCR",1,{"title":34,"total":35},"Chat",7,{"title":37,"total":38},"Generator",116,[40,48,55,64,71],{"id":41,"publish_date":42,"is_original":4,"collection":5,"cover_url":43,"cover_url_1_1":44,"title":45,"summary":46,"author":47},557,"2022-04-29","article_res/cover/7a9b1375ed9bb298154981bae42b794d.jpeg","article_res/cover/afa281dd52bc0454e6735daa8e6b0706.jpeg","Translation and summary of Messari Report [2.8 Kristin Smith, Blockchain Association and Katie Haun, a16z]","We need unity and speed right now.","Translation",{"id":49,"publish_date":50,"is_original":4,"collection":5,"cover_url":51,"cover_url_1_1":52,"title":53,"summary":54,"author":47},531,"2022-05-25","article_res/cover/e8362057f8fa189594c60afdfaaeb6e5.jpeg","article_res/cover/8ea08d0d6fa7eee6b57ed4ec61b61ad6.jpeg","Decentralized Society: Finding Web3’s Soul / Decentralized Society: Finding the Soul of Web3 -7","Decentralization through Pluralism When analyzing ecosystems, it's desirable to measure how decentralized it is.",{"id":56,"publish_date":57,"is_original":32,"collection":58,"cover_url":59,"cover_url_1_1":60,"title":61,"summary":62,"author":63},127,"2024-11-14","#Google #AI Game #World Model #AI Story","article_res/cover/0233a875b7ec2debf59779e311547569.jpeg","article_res/cover/6ffddb6ae4914b3c699493311aa9f198.jpeg","Google Launches \"Unbounded\": A Generative Infinite Character Life Simulation Game","Unbounded: A Generative Infinite Game of Character Life Simulation","Renee's Entrepreneurial Journey",{"id":13,"publish_date":65,"is_original":32,"collection":66,"cover_url":67,"cover_url_1_1":68,"title":69,"summary":70,"author":63},"2025-02-14","#Deep Dive into LLMs #Andrej Karpathy #LLM #Tool Use #Hallucination","article_res/cover/11e858ad6b74dfa80f923d549b62855c.jpeg","article_res/cover/615e1b320f1fc163edc1d2d154a6de33.jpeg","Andrej Karpathy's in-depth explanation of LLM (Part 4): Hallucinations","hallucinations, tool use, knowledge/working memory",{"id":72,"publish_date":73,"is_original":4,"collection":5,"cover_url":74,"cover_url_1_1":75,"title":76,"summary":77,"author":47},579,"2022-04-07","article_res/cover/39387376ba28447af1eb40576b9df215.jpeg","article_res/cover/02727ede8551ed49901d0abe6d6305b7.jpeg","Messari Report Translation and Summary 【1-7 Surviving the Winter】","I’d be more cautious here: 10 year and 10 hour thinking only.",[79,87,95,103,111],{"id":80,"publish_date":81,"is_original":32,"collection":82,"cover_url":83,"cover_url_1_1":84,"title":85,"summary":86,"author":63},627,"2025-03-20","#AI Avatar #AI Video Generation","article_res/cover/d95481358f73924989f8c4ee9c75d1c8.jpeg","article_res/cover/b74bc0fab01f8b6a6aa87696c0c3ed8b.jpeg","DisPose: Generating Animated Videos by Driving Video with Reference Images","DisPose is a controllable human image animation method that enhances video generation.",{"id":88,"publish_date":89,"is_original":32,"collection":90,"cover_url":91,"cover_url_1_1":92,"title":93,"summary":94,"author":63},626,"2025-03-21","#Deep Dive into LLMs #LLM #RL #Andrej Karpathy #AlphaGo","article_res/cover/446553a5c8f8f2f07d97b20eaee84e56.jpeg","article_res/cover/e6c2823409c9b34624064b9acbaca6f1.jpeg","AlphaGo and the Power of Reinforcement Learning - Andrej Karpathy's Deep Dive on LLMs (Part 9)","Simply learning from humans will never surpass human capabilities.",{"id":96,"publish_date":97,"is_original":32,"collection":98,"cover_url":99,"cover_url_1_1":100,"title":101,"summary":102,"author":63},625,"2025-03-22","#Deep Dive into LLMs #LLM #RL #RLHF #Andrej Karpathy","article_res/cover/8da81d38b1e5cf558a164710fd8a5389.jpeg","article_res/cover/96f028d76c362a99a0dd56389e8f7a9b.jpeg","Reinforcement Learning from Human Feedback (RLHF) - Andrej Karpathy's Deep Dive on LLMs (Part 10)","Fine-Tuning Language Models from Human Preferences",{"id":104,"publish_date":105,"is_original":32,"collection":106,"cover_url":107,"cover_url_1_1":108,"title":109,"summary":110,"author":63},624,"2025-03-23","#Deep Dive into LLMs #LLM #Andrej Karpathy #AI Agent #MMM","article_res/cover/a5e7c3d48bb09109684d6513287c661d.jpeg","article_res/cover/d3f22b7c0ab8d82fd2da457a299e0773.jpeg","The Future of Large Language Models - Andrej Karpathy's In-Depth Explanation of LLM (Part 11)","preview of things to come",{"id":112,"publish_date":105,"is_original":32,"collection":113,"cover_url":114,"cover_url_1_1":115,"title":116,"summary":117,"author":63},623,"#Google #Voe #AI Video Generation","article_res/cover/c44062fea0f336c2b96b3928292392c2.jpeg","article_res/cover/a041041c69092ad3db191c5bf3ff981b.jpeg","Trial of Google's video generation model VOE2","Our state-of-the-art video generation model",[119,127,135],{"id":120,"publish_date":121,"is_original":32,"collection":122,"cover_url":123,"cover_url_1_1":124,"title":125,"summary":126,"author":63},160,"2024-10-04","#Philosophy","article_res/cover/496990c49211e8b7f996b7d39c18168e.jpeg","article_res/cover/14dbaa1ade9cb4316d5829423a900362.jpeg","Time","The fungus of the morning does not know the waxing and waning of the moon, and the cicada does not know the seasons; this is a short life. To the south of the state of Chu there is a dark spirit which regards five hundred years as spring and five hundred years as autumn. In ancient times there was a great tree called the Ming which regarded eight thousand years as spring and eight thousand years as autumn; this is a long life.",{"id":128,"publish_date":129,"is_original":32,"collection":130,"cover_url":131,"cover_url_1_1":132,"title":133,"summary":134,"author":63},98,"2024-12-17","#AI Video Generator #Sora #Pika","article_res/cover/3b86e85d03fff4f356a3e4cf2bb329c9.jpeg","article_res/cover/5fa5c20ad0b40f8f544d257c0ef02938.jpeg","Pika 2.0 video generation officially released: effect comparison with Sora","今天，我们推出了Pika 2.0模型。卓越的文字对齐效果。惊人的视觉表现。还有✨场景成分✨",{"id":136,"publish_date":137,"is_original":32,"collection":138,"cover_url":139,"cover_url_1_1":140,"title":141,"summary":142,"author":63},71,"2025-01-14","#Nvidia #World Foundation Model #Cosmos #Physical AI #Embodied AI","article_res/cover/feddf8c832dfb45d28804291f6a42a9e.jpeg","article_res/cover/d6bc2f1186d96b78228c2283a17a3645.jpeg","NVIDIA's Cosmos World Model","Cosmos World Foundation Model Platform for Physical AI",[144,163,188],{"title":8,"items":145},[146,147,155],{"id":104,"publish_date":105,"is_original":32,"collection":106,"cover_url":107,"cover_url_1_1":108,"title":109,"summary":110,"author":63},{"id":148,"publish_date":149,"is_original":32,"collection":150,"cover_url":151,"cover_url_1_1":152,"title":153,"summary":154,"author":63},622,"2025-03-24","#OWL #AI Agent #MAS #MCP #CUA","article_res/cover/cb50ca7f2bf4d1ed50202d7406e1c19a.jpeg","article_res/cover/4aa7aa3badfacf3cc84121334f1050dd.jpeg","OWL: Multi-agent collaboration","OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation",{"id":156,"publish_date":157,"is_original":32,"collection":158,"cover_url":159,"cover_url_1_1":160,"title":161,"summary":162,"author":63},620,"2025-03-26","#LLM #Google #Gemini #AI Agent","article_res/cover/53751a6dbbe990b1eb0b63f3b062aed4.jpeg","article_res/cover/031344981f0a212ff82d1f3a64aa5756.jpeg","Gemini 2.5 Pro, claimed to be far ahead of the competition, has been released with great fanfare: comprehensively surpassing other LLMs and topping the global rankings","Gemini 2.5: Our most intelligent AI model",{"title":9,"items":164},[165,172,180],{"id":166,"publish_date":157,"is_original":32,"collection":167,"cover_url":168,"cover_url_1_1":169,"title":170,"summary":171,"author":63},619,"#OpenAI #AI Image Generator #4o #MMM #AR Transformer","article_res/cover/2faffc97fcecf3151552cb0fd3206d89.jpeg","article_res/cover/1133cb4948af44cee2e7fbe79efb69e5.jpeg","The native image function of GPT-4o is officially launched","Introducing 4o Image Generation",{"id":173,"publish_date":174,"is_original":4,"collection":175,"cover_url":176,"cover_url_1_1":177,"title":178,"summary":179,"author":63},434,"2023-07-15","#Anthropic #OpenAI #Google #AI Code Generator #Claude","article_res/cover/e1b6f600a2b9f262a4392684e5f2ce25.jpeg","article_res/cover/6e1772e83f78f9a351ab23d3e414adee.jpeg","Latest Updates on Google Bard /Anthropic Claude2 / ChatGPT Code Interpreter","We want our models to use their programming skills to provide more natural interfaces to the basic functions of our computers.  \n - OpenAI",{"id":181,"publish_date":182,"is_original":4,"collection":183,"cover_url":184,"cover_url_1_1":185,"title":186,"summary":187,"author":63},417,"2023-08-24","#OpenAI","article_res/cover/bccf897d50a88b18364e35f7466387e0.jpeg","article_res/cover/2f871085c1073717c1703ae86e18056f.jpeg","The GPT-3.5 Turbo fine-tuning (fine-tuning function) has been released～","Developers can now bring their own data to customize GPT-3.5 Turbo for their use cases.",{"title":10,"items":189},[190,191,192],{"id":88,"publish_date":89,"is_original":32,"collection":90,"cover_url":91,"cover_url_1_1":92,"title":93,"summary":94,"author":63},{"id":96,"publish_date":97,"is_original":32,"collection":98,"cover_url":99,"cover_url_1_1":100,"title":101,"summary":102,"author":63},{"id":104,"publish_date":105,"is_original":32,"collection":106,"cover_url":107,"cover_url_1_1":108,"title":109,"summary":110,"author":63},true,{"code":4,"msg":5,"data":195},{"id":196,"publish_date":197,"is_original":4,"collection":198,"articles_id":199,"cover_url":200,"cover_url_1_1":201,"title":202,"summary":203,"author":63,"content":204},128,"2024-11-13","#OpenAI #LLM #o1","zJbW60y3fcMVdXywnerOBA","article_res/cover/6654db2a98aadeea109fc3364da389c7.jpeg","article_res/cover/b1e5e6d6a62fb9b205ac09b16c6a25da.jpeg","Onboard Podcast Notes (2) o1 First Impressions - EP62: OpenAI o1 and the New Paradigm of LLM + Reinforcement Learning","EP 62. A deep interpretation of Google Deepmind and LLM researchers on OpenAI o1 and the new paradigm of LLM + reinforcement learning","\u003Cdiv class=\"rich_media_content js_underline_content\n                       autoTypeSetting24psection\n            \" id=\"js_content\">\u003Cp style='margin-bottom: 0px;;color: rgb(0, 0, 0);font-size: 16px;line-height: 1.8em;letter-spacing: normal;text-align: left;padding-top: 8px;padding-bottom: 8px;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;background-color: rgb(255, 255, 255);'>接着记笔记。我觉得Host Monica好厉害，不仅能迅速抓住核心，还能通过深入的追问引导嘉宾分享更多细节。\u003C/p>\u003Cp style='margin-bottom: 0px;;color: rgb(0, 0, 0);font-size: 16px;line-height: 1.8em;letter-spacing: normal;text-align: left;padding-top: 8px;padding-bottom: 8px;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;background-color: rgb(255, 255, 255);'>\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">播客链接：\u003C/strong>https://castbox.fm/episode/id5557944-id743751924\u003C/p>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">关于o1的发布及初体验感受\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Eric\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">真正提出并实现了\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">“scaling up the inference time”\u003C/strong>这一概念，并将其应用于推理任务，带来了显著的提升。\u003C/p>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007464\" data-ratio=\"0.575925925925926\" data-s=\"300,640\" data-type=\"jpeg\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243120.6902351019886164.jpeg\">\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">这种方法对reasoning能力的增强展示了很大的潜力。具体来说，在处理任何推理问题时，o1的推理过程展现出了不同以往的特性。它能够在自身的\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">思维和推理模式\u003C/strong>之间进行动态切换。例如，它会主动判断：\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">是否需要\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">逐步推理\u003C/strong>，以确保逻辑链条的完整性；\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">或者是否需要\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">批判性地回溯\u003C/strong>，找出之前推理中的错误并加以修正。\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">这些特性在之前的模型（例如GPT-4）中并不常见，尤其是“批判性回溯”的能力更是一个全新的亮点。\u003C/p>\u003C/section>\u003C/section>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Kimi\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">我想引用UCLA数学教授Terence Tao的一句话来形容使用o1的感受，他曾说过：\u003C/p>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007465\" data-ratio=\"0.5740740740740741\" data-s=\"300,640\" data-type=\"jpeg\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243170.34245082924557924.jpeg\">\u003C/p>\u003Cblockquote style=\"margin-top: 20px;margin-bottom: 20px;;padding: 10px 10px 10px 20px;border-top: 3px none rgba(0, 0, 0, 0.4);border-right: 3px none rgba(0, 0, 0, 0.4);border-bottom: 3px none rgba(0, 0, 0, 0.4);border-left-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0.05);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;overflow: auto;\">\u003Cp style=\";color: rgb(0, 0, 0);font-size: 16px;line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">The experience seems roughly on par with trying to advise a mediocre but not completely incompetent graduate student. （这感觉大致相当于指导一位中等水平但不完全无能的研究生。）\u003C/p>\u003C/blockquote>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">在我之前使用Cursor结合Claude 3.5的时候，经常会遇到这种情况：它生成的代码有bug，我需要运行一遍，拿到错误信息后再反馈给Claude 3.5，模型会回复“I'm sorry”并帮我修正错误。通过多次迭代，代码最终可以正常运行，但整个过程还是需要人类在loop中不断地进行纠错和反馈。相比之下，当我用o1的时候，它的表现更加\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">平滑和流畅\u003C/strong>。大多数情况下，它能够直接生成一条通畅的解决路径，将代码写得更完整，减少了中间的错误修正环节。这背后涉及到一个关键问题：当代码出现错误时，o1是如何实现\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">自我修正\u003C/strong>的。这种能力本质上与“reasoning token”的机制相关，也就是模型在推理过程中如何组织和调整自己的思考模式。\u003C/p>\u003C/section>\u003C/section>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Cage\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">我用o1测试了一些复杂的旅行规划场景，比如跨国家庭旅行，包含机票时间、景点安排等细节。相比GPT-4，o1在细节处理上表现更贴心。\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">时差考虑\u003C/strong>：能自动调整到达后的休息时间，避免直接安排高强度活动。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">本地化安排\u003C/strong>：准确考虑不同地区的博物馆开放时间。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">路程优化\u003C/strong>：合理规划车程时间，避免行程过于紧凑。\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">这些细节让我觉得像是请了一个专业旅行顾问。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">对o1泛化能力的思考：\u003C/strong>\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">旅行规划这种开放场景比LeetCode题更复杂，能测试模型的泛化能力。o1的表现让我倾向于两种可能：\u003C/p>\u003Col style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">通用Reward机制\u003C/strong>：o1可能通过优化奖励机制解决了开放任务中的平衡问题。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">跨场景推理泛化\u003C/strong>：它的强推理能力成功迁移到了旅行规划这种领域。\u003C/section>\u003C/li>\u003C/ol>\u003C/section>\u003C/section>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">苏慧\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">在将强化学习应用于语言模型时，一个重要问题是 \u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">“力度”（granularity）的选择\u003C/strong>：Token 为单位、Sentence 或 Step 为单位。在 OpenAI 提供的 CoT 示例中，有一些语言特征让人联想到人类标注数据，例如：\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">语气词与停顿\u003C/strong>：模型生成内容中包含类似“嗯”“好像不行”等停顿和语气词，仿佛是在模仿人类思考时自言自语的过程；\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">分步推理的清晰性\u003C/strong>：模型像是以人为单位，逐步展开每个 reasoning step，这种条理性让生成内容更加自然且易于理解。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">人类标注的影子这可能得益于一批高质量的 CoT 数据，其标注方式将推理过程切分得非常清晰，从而让模型更好地模仿人类的思考方式。可能\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">以推理步骤为单位\u003C/strong>：每个 reasoning step 都能得到独立的反馈，便于奖励模型评估。在需要时，模型可以通过回溯重新优化推理路径，从而提高整体 reasoning 的质量。相当于他已经把这条路线走通了。从他的 CoT 示例中，我能够感受到模型确实是按照这样的一条清晰逻辑路线在进行推理，这让我觉得这个方法是有效且“聪明的”。这无疑会给很多人带来信心——沿着这条路继续探索，至少能够达到目前这样的水平。我觉得这是一个非常重要的进展。\u003C/p>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007466\" data-ratio=\"1.5037037037037038\" data-s=\"300,640\" data-type=\"png\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243220.894771625101467.png\">\u003C/p>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">关于逻辑推理过程的隐藏与展现\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Eric\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">目前展示出来的逻辑推理过程似乎仍然比较有限，o1可能隐藏了哪些“思考过程”。这些隐藏的内容是\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">人类可读\u003C/strong>的，还是更偏向\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">人类不可读\u003C/strong>的内在机制？\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">在Chain-of-Thought（CoT）相关的研究方向中，有许多论文探讨过类似的问题。例如，研究发现CoT的长度越长，模型的表现往往越好；甚至通过添加特定的“think token”，也能让模型“思考”得更多，并因此提高其性能。但问题在于，这些“think token”背后具体代表的思考过程，对人类来说通常是不可见、也难以理解的。\u003C/p>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007467\" data-ratio=\"0.41759259259259257\" data-s=\"300,640\" data-type=\"png\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243200.6539196076315785.png\">\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">如果模型的思考过程能够被更清晰地展现出来，那就不仅仅是记录每一步推理模式（例如“下一步我该做什么”），而是可以揭示更深层次的元思考内容。例如：\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">为什么选择自我反思？\u003C/strong>\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">为什么决定将一个问题分解为多个子问题？\u003C/strong>\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">为什么要调整推理的模式，或者采用不同的逻辑路径？\u003C/strong>\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">这些决定背后的逻辑其实是更高阶的推理能力体现，这些更偏元思考的这种范畴。\u003C/p>\u003C/section>\u003C/section>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Kimi\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">从o1的表现来看，最让我感兴趣的一点，是它是否在背后采用了显性的reasoning token，还是通过隐性的机制实现了这种推理能力。例如，我在观察o1处理数学问题时，它展示了一种类似coding任务的推理过程：\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">它在解题时，会主动提出：“Actually, alternatively, let me consider this.”\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">然后基于新的思路进行修正和调整，再进一步完善答案。\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">这种反复自我优化的能力非常迷人，因为它模拟了一种自我反思和自我精炼的思考过程。这让我感到惊艳的一点在于，o1的这种能力减少了Human-in-the-Loop (HITL)纠正错误的需求，让它的解决方案更加自动化和高效。\u003C/p>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007468\" data-ratio=\"0.6666666666666666\" data-s=\"300,640\" data-type=\"png\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243210.17165063099798017.png\">\u003C/p>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">关于o1的不足之处\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Eric\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">在实际使用中，观察到了一些o1当前表现得不够理想的地方。例如，有一个经典的测试问题是计算“strawberry”中有多少个字母。当尝试用类似的问题测试时，发现o1在这类计算上并不能达到非常高的准确率。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">不过，我认为这类问题的表现不足是可以接受的。毕竟，如果仅仅把o1视作一个语言模型而不是一个完整的系统，某些特定任务，比如基本的算术运算或类似“计算器”功能的任务，并不一定需要LM来完成。所以，我更关注的，是它在内部\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">推理模式\u003C/strong>上的表现。\u003C/p>\u003C/section>\u003C/section>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Kimi\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">例如，有网友测试了o1一个很简单却常见的问题：“如何安装CUDA？”，o1思考了27个小时之久，结果回答是“I don’t know”。这一点反映出o1的训练数据虽然在某些领域非常深入，但在另一些领域的覆盖仍有不足。这种局限可能是因为o1的训练数据过于聚焦在特定方向，导致它主要在这些领域表现非常惊艳。\u003C/p>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">为什么大家总喜欢用“strawberry里有多少个r”来测试LM？\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">这类问题其实并不是为了强求LM必须能完全正确地解决。它的意义更多在于考察LM在处理简单的\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">输入到输出的映射\u003C/strong>时的表现，比如是否能有效理解并处理问题背后的逻辑。用一个专门的工具去完成这类任务可能更自然，也更有效率。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">相比之下，人类在处理这种任务时，只需要一两个示例就能很好地理解并解决问题，但对于LM而言，即使提供了两三个例子，也可能因为其学习机制的局限性而难以表现得同样好。这就让这个问题成为了一种简单但有效的测试方法，用于了解模型是否能处理基础的\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">输入-输出映射\u003C/strong>关系。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">更科学的评估方法：\u003C/strong> 当然，从更科学的角度来看，类似“strawberry”问题的测试对语言模型的推理能力评估是有限的。如果想深入了解模型的真正能力，可能需要用一些更复杂的任务来进行测试。例如：\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">数学推理问题\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">编程相关任务\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">量子物理等高度抽象领域的问题\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">期待下一个版本解决的局限性\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Col style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">数据覆盖范围\u003C/strong>\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">数据评估方式更加\u003Cstrong style='font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;background-image: none;background-position: 0% 0%;background-size: auto;background-repeat: no-repeat;background-attachment: scroll;background-origin: padding-box;background-clip: border-box;;color: rgb(0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;'>可扩展\u003C/strong>\u003C/strong>\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\"letter-spacing: 0em;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);;color: rgb(0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">数据标注与奖励机制，\u003C/strong>\u003Cspan style=\"letter-spacing: 0em;\">OpenAI在这一领域的一些研究为我们提供了启发：\u003C/span>\u003C/section>\u003C/li>\u003C/ol>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;list-style-type: disc;padding-left: 25px;color: rgb(0, 0, 0);\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong>Process Reward Model (PRM)\u003C/strong>：通过对每个子步骤而非整体序列的奖励机制，逐步优化模型的推理过程，而不是仅仅依赖最终的“对”或“错”来评估结果。\u003C/section>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007469\" data-ratio=\"0.7222222222222222\" data-s=\"300,640\" data-type=\"png\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781244710.28195279807017615.png\">\u003C/p>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Let’s Verify Step by Step\u003C/strong>：强调在推理过程中逐步验证，进一步提升模型的精确性和可靠性。对于许多开放性问题（例如没有明确正确答案的任务），需要开发更细粒度的奖励信号，帮助模型在多样化场景中进行有效学习和优化。\u003C/section>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007470\" data-ratio=\"0.562962962962963\" data-s=\"300,640\" data-type=\"jpeg\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243950.6238122135844344.jpeg\">\u003C/p>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">训练o1模型所需的数据与难点\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">训练o1这样的模型，需要在数据获取和处理上与传统方法有所不同。回顾早期，OpenAI从Instruction GPT开始，用preference数据代替传统SFT数据，这一做法开创了一条新路径。\u003C/p>\u003Col style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cp style=\";color: rgb(0, 0, 0);line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Preference数据的高效标注\u003C/strong>：Preference数据相较于SFT数据更容易扩展，通过智能方式实现高质量、可扩展的数据标注，这是OpenAI的一个突破点。\u003C/p>\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cp style=\";color: rgb(0, 0, 0);line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">细粒度奖励信号\u003C/strong>：初期的preference数据偏稀疏，仅对整个对话结果进行评价，而无法对中间推理步骤评分。后来，OpenAI提出“let’s verify step by step”，通过逐步验证中间步骤（PRM 800K数据集），提升了对推理过程的细粒度控制。\u003C/p>\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cp style=\";color: rgb(0, 0, 0);line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">未来的挑战：\u003C/strong>\u003C/p>\u003C/section>\u003C/li>\u003C/ol>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cul class=\"list-paddingleft-1\" style=\"list-style-type: square;\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">标注效率\u003C/strong>：如何以可扩展的方式标注更多高质量数据。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">数据形式的突破\u003C/strong>：未来可能会找到比Preference数据更容易标注、同样高质量的数据，从而进一步推动数据扩展，实现10倍甚至100倍的Scaling Law增长。\u003C/section>\u003C/li>\u003C/ul>\u003C/ul>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">人类标注与AI协作的平衡\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">目前，有两种主要方法：\u003C/p>\u003Col style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">直接偏好优化（DPO）之前分享过：\u003Ca target=\"_blank\" href=\"https://mp.weixin.qq.com/s?__biz=MzkwOTMzMzk0MQ==&amp;mid=2247485694&amp;idx=1&amp;sn=bc2d41f17d47c5c39bf58d6be8aa6a93&amp;chksm=c13d0c24f64a85323dc757565324d97e72649bc6f28bbdd872d340b8ff6d9cd59c510135865d&amp;scene=21#wechat_redirect\" textvalue=\"DPO vs RLHF\" linktype=\"text\" imgurl=\"\" imgdata=\"null\" data-itemshowtype=\"0\" tab=\"innerlink\" data-linktype=\"2\">DPO vs RLHF\u003C/a>\u003C/strong>\u003C/section>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007471\" data-ratio=\"0.6255555555555555\" data-s=\"300,640\" data-type=\"jpeg\" data-w=\"900\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781247180.29561467291417176.jpeg\">\u003C/p>\u003C/li>\u003C/ol>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cul class=\"list-paddingleft-1\" style=\"list-style-type: square;\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">DPO是一种简单直接的方法：人类标注数据直接用于训练，无需复杂的奖励模型或记忆结构。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">好处是避免了训练复杂奖励模型的负担，更适合快速迭代。\u003C/section>\u003C/li>\u003C/ul>\u003C/ul>\u003Col start=\"2\" style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">AI辅助标注（RLAIF）\u003C/strong>\u003C/section>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007472\" data-ratio=\"0.5574074074074075\" data-s=\"300,640\" data-type=\"png\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781259580.016612763343431425.png\">\u003C/p>\u003C/li>\u003C/ol>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cul class=\"list-paddingleft-1\" style=\"list-style-type: square;\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">RLAIF通过现有的模型生成偏好数据，但它有“先有鸡还是先有蛋”的问题：模型需要足够强大才能生成高质量数据，而高质量数据又是训练好模型的前提。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">初期需要依赖人类标注，先训练一个基础的奖励模型，然后用AI逐步扩展数据。但这可能引发\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">reward hacking\u003C/strong>，即AI在优化奖励时产生意外行为，例如过度关注安全问题却忽略应答质量。\u003C/section>\u003C/li>\u003C/ul>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">挑战与未来方向：\u003C/strong>\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Reward模型的关键性\u003C/strong>：训练一个可靠的reward模型是扩展RLHF或RLAIF的核心，但需要更多时间和研究来解决其潜在问题，如reward hacking。\u003C/section>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007473\" data-ratio=\"0.728\" data-s=\"300,640\" data-type=\"webp\" data-w=\"1000\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781253640.8611138734510959.jpeg\">\u003C/p>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">人类与AI协作\u003C/strong>：人类标注提供质量基础，AI协助扩展数据规模，两者需结合以实现高效且可控的训练。\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">旅游规划与数学/编程推理的区别\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">两种推理的核心不同在于\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">问题类型\u003C/strong>和\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">知识基础\u003C/strong>：\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">数学/编程推理\u003C/strong>：问题明确，有固定规则（如逻辑公式、编程逻辑）；推理过程基于符号运算和严谨的逻辑体系。\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">旅游规划推理\u003C/strong>更依赖\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">常识\u003C/strong>，需要理解世界运作的经验和泛化能力；比如下雨天卖伞的合理性，基于对场景和需求的直觉推导。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">旅游规划更符合后者的Common Sense Reasoning，因为在旅行规划中，模型需要考虑人类需求（如舒适性、老人耐力），做出合乎逻辑的安排（避免周车劳顿），依赖领域知识（如景点开放时间、时差调整）。过去这种问题依赖复杂的\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">agent pipelines\u003C/strong>和明确规则设计，现在模型通过prompt和对常识的理解，直接完成类似推理。这是对数学和编程之外更广泛推理能力的一种体现。\u003C/p>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">o1的reasoning能力提升的关键来源\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">数据质量与可用性：\u003C/strong>\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">基础模型（如Wikipedia 和 Stack Overflow）已有高质量的Question to Answer和Question to Code数据，这些数据易于获取；数据集质量评估的手段（如页面点击量、用户 Upvote 数）让模型预训练和对齐过程更加直接，这种能力的强大表现并不意外。\u003C/section>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007474\" data-ratio=\"0.5722543352601156\" data-s=\"300,640\" data-type=\"png\" data-w=\"346\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243530.09147987961084114.png\">\u003C/p>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">但长逻辑的reasoning数据并不常见。Reasoning 能力需要明确的定义，比如逻辑推理、因果分析、步骤化问题解决等，远超简单问答或代码生成。目前几乎没有优质的、公开的长逻辑推理数据集。即便是知乎上的科普内容或 AI/ML 问答，也只是片段化的推理数据，而非系统性的 reasoning 数据集。\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">生成与过滤：\u003C/strong>\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">高质量的reasoning数据更多是\u003Cstrong style=\";color: rgb(0, 0, 0);background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">合成生成\u003C/strong>出来的。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">通过迫使让模型显式地step-by-step推理，再结合ground truth（已知的正确答案）或一致性检查，筛选出好的推理数据。\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">具体方法：\u003C/p>\u003Col style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">数学题作为范例：\u003C/section>\u003C/li>\u003C/ol>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;list-style-type: disc;padding-left: 25px;color: rgb(0, 0, 0);\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">设计如“3x + 5 = 100，求 x = ?”的题目，让模型逐步解答，并通过对 ground truth 的验证筛选出高质量的推理。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">若模型的 reasoning 步骤最终未能得出正确答案（如 x ≠ 31.67），则将该数据标记为“错误推理”。\u003C/section>\u003C/li>\u003C/ul>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">迭代生成和筛选：\u003C/section>\u003C/li>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;list-style-type: disc;padding-left: 25px;color: rgb(0, 0, 0);\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">通过运行模型生成多个推理路径（如 100 次），利用启发式规则或奖励模型筛选最佳路径。\u003C/section>\u003C/li>\u003C/ul>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">Self-Consistency 筛选：\u003C/section>\u003C/li>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;list-style-type: disc;padding-left: 25px;color: rgb(0, 0, 0);\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">当 ground truth 不明确时，可以通过一致性验证筛选出逻辑最连贯的推理路径。\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">不断精炼：\u003C/strong>类似于 PhD 学术研究，模型先从大量已有文献中吸收知识，尝试推理生成新的 idea，再通过多次 iteration 改进。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">在模型层面，可以通过以下过程实现：\u003C/p>\u003Col style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">让模型显式展示推理过程（例如一步步地输出其 reasoning）。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">收集这些推理过程数据，去除低质量数据。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">将筛选后的高质量数据重新返回模型作为训练数据，进一步 fine-tune。\u003C/section>\u003C/li>\u003C/ol>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">模型逐渐从“直接输出答案”转变为“展示推理路径”，并在路径优化的过程中强化其 reasoning 能力。\u003C/p>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">数据形态与训练方法\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Kimi\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">当前主流方法包括 SFT（监督微调）和 RLHF（基于人类反馈的强化学习）。两者各有适用场景：\u003C/p>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007475\" data-ratio=\"0.3824074074074074\" data-s=\"300,640\" data-type=\"webp\" data-w=\"1080\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781250330.17800869702945676.jpeg\">\u003C/p>\u003Cul style=\"margin-top: 8px;margin-bottom: 8px;;padding-left: 25px;\" class=\"list-paddingleft-1\">\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">SFT：适用于数据质量非常高的场景，但现实中很难生成足够高质量的数据集。\u003C/section>\u003C/li>\u003Cli style=\";\">\u003Csection style=\";margin-top: 5px;margin-bottom: 5px;color: rgb(1, 1, 1);line-height: 1.8em;letter-spacing: 0em;\">RLHF：更适合数据存在偏好的场景，通过对结果的相对偏好迭代优化模型。DPO 正在成为更泛化的形式，与 RLHF 方法的区别逐渐缩小。\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">模型在初始状态下生成多个解答（A 和 B），通过偏好权重（如更倾向 A）调整模型。基于优化后的模型，重新生成解答（A' 和 B'），比较新一轮结果的优劣，继续优化。反复迭代，通过偏好数据和推理路径的改进，逐步提升模型的 reasoning 能力。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">DeepMind 的 AlphaGeometry 项目（之前分享过：\u003Ca target=\"_blank\" href=\"https://mp.weixin.qq.com/s?__biz=MzkwOTMzMzk0MQ==&amp;mid=2247490714&amp;idx=1&amp;sn=5fd003dde5693c81871b9c156fc0bd62&amp;chksm=c13d1840f64a9156316d82c5375381d8a7b9458658475342b84f16facbee420ed818664e2f90&amp;scene=21#wechat_redirect\" textvalue=\"《State of AI Report 2024》（1） - AlphaGeometry 、 合成数据、RAG\" linktype=\"text\" imgurl=\"\" imgdata=\"null\" data-itemshowtype=\"0\" tab=\"innerlink\" data-linktype=\"2\">《State of AI Report 2024》（1） - AlphaGeometry 、 合成数据、RAG\u003C/a>）展示了在数学推理领域的潜力，证明特定领域的表现可以达到更高水准。AlphaGeometry产生数据是不是也可以用于这个o1这样的模型的训练。\u003C/p>\u003Cp style=\"text-align: center;\">\u003Cimg class=\"rich_pages wxw-img js_insertlocalimg\" data-imgfileid=\"100007476\" data-ratio=\"0.4583941605839416\" data-s=\"300,640\" data-type=\"png\" data-w=\"685\" style=\"\" src=\"https://res.cooltool.vip/article_res/assets/17423781243710.6591976111043101.png\">\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">一个强大的基础模型是扩展领域模型能力的前提。如果用一个更specified reward model来训练, 如果你可以训练一个领域特定模型, 如果你觉得这个数据的quantity是好的话, 你完全可以用他的数据来反补用于提升通用模型。\u003C/p>\u003C/section>\u003C/section>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">【\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">Eric\u003C/strong>】\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">o1 模型 Reasoning 表现的两大关键：\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">数据与强化学习\u003C/strong>。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">o1 的 reasoning 能力或许来自于大量关于推理偏好的数据，即用于评估和比较不同推理步骤的数据。需要构建一个专门的 reward model，用于评估哪些 reasoning step 更加 make sense、更加高效，甚至更加 optimal。一旦有了一个优质的 reward model，生成合成数据将变得更加高效。\u003Cstrong style=\";background: none 0% 0% / auto no-repeat scroll padding-box border-box rgba(0, 0, 0, 0);width: auto;height: auto;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;\">MCTS（蒙特卡洛树搜索）等方法\u003C/strong>：基于 reward model，对不同路径进行探索，生成更优质的推理数据。模型生成的数据质量可能高于人类。人类在生成 reasoning 数据时，往往存在逻辑不清的现象，而模型在逻辑性上的表现更加稳定且可控。合成数据可以规模化生成，为大模型提供持续进化的基础。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">传统方法（如 SFT 和 Instruction Tuning）在推理任务中局限性明显，SFT 强调从人类生成的数据中学习，但人类推理本身可能并不 optimal；Instruction Tuning 偏重任务对齐，而非提升推理能力。当前趋势更倾向于通过 RL 让模型自主探索推理路径，模型通过尝试不同推理路径获取反馈，逐步优化自身的 reasoning 逻辑；人类只需提供最终结果的好坏，而不需要手动设计复杂的推理逻辑。模型可以通过自身的探索，发现甚至超越人类设计的推理方式，就像 AlphaGo 通过自我对弈产生了超越人类顶尖棋手的策略。RL 的潜在问题是Reward Hacking。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">模型可能会通过发现 reward model 的漏洞（trick）来提升奖励，而非真正优化推理能力。为了解决这一问题，需要设计更强大的 reward model 专门用于 reasoning，帮助 LLM 主动发现更优的 reasoning path，并在探索中自我优化整个推理过程。目前业内一个非常常见的模式是，AI 已能够替代人类设计许多模型架构或工作流程，并实现自动化优化。我认为这是一个非常典型且有代表性的例子。\u003C/p>\u003C/section>\u003C/section>\u003Ch2 style='margin-top: 30px;margin-bottom: 15px;color: rgba(0, 0, 0, 0.85);;font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;letter-spacing: normal;text-align: left;background-color: rgb(255, 255, 255);'>\u003Cspan style=\";font-size: 22px;color: rgb(0, 0, 0);line-height: 1.5em;letter-spacing: 0em;font-weight: bold;display: block;\">是否需要 multistep data\u003C/span>\u003C/h2>\u003Csection style=';margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;border-style: none;border-width: 3px;border-color: rgba(0, 0, 0, 0.4);border-radius: 0px;background: none 0% 0% / auto no-repeat scroll padding-box border-box rgb(250, 250, 250);width: auto;height: auto;box-shadow: rgba(0, 0, 0, 0) 0px 0px 0px 0px;color: rgb(0, 0, 0);font-family: Optima, \"Microsoft YaHei\", PingFangSC-regular, serif;font-size: 16px;letter-spacing: normal;text-align: left;'>\u003Csection style=\";\">\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">一个足够优秀的 reward model 能对每个 reasoning step 或整体推理路径进行可靠的评判。在这种情况下，multistep data 的作用不再是直接教模型“怎么推理”，而是通过 reward 提供反馈，激励模型自主优化推理路径。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">如果 reward model 对每个 reasoning step 的评估非常可靠，那么 dense reward可以显著提升强化学习训练的效果。但如果 reward model 足够强大，可以可靠评判整体推理路径（而非单独的步骤），那么对 dense multistep data 的依赖会降低，因为模型可以通过整体路径的反馈直接优化。\u003C/p>\u003Cp style=\";line-height: 1.8em;letter-spacing: 0em;text-indent: 0em;padding-top: 8px;padding-bottom: 8px;\">o1 模型给人的启示是，推理能力的优化并不需要 SFT 去显式告诉模型该怎么解题，也不需要人类预设具体的推理步骤：比如在解“3x + 5 = 100”时，不必让模型学习传统步骤（如先计算 100 - 5，再除以 3），而是允许模型探索其他可能更高效或创新的路径。重点是对每一个 reasoning step 或整体路径进行合理的评估，通过设计合理的 reward model，鼓励模型发现最佳的推理路径，而不是直接干预模型的推理方式。\u003C/p>\u003C/section>\u003C/section>\u003Cp style=\"display: none;\">\u003Cmp-style-type data-value=\"3\">\u003C/mp-style-type>\u003C/p>\u003C/div>",1752585426686]