[{"data":1,"prerenderedAt":205},["ShallowReactive",2],{"DlFXI4Eibt_Bn9lrEZz1TYbHCWFZj3IvqwHQSEW-Exc":3,"VnhOaOH6isQq30eTGwWYJwOTezqGjPtiDe4ctUZ80t8":194},{"code":4,"msg":5,"data":6},0,"",{"category":7,"tag":11,"hot":39,"new":78,"banner":118,"data":143,"cache":193},[8,9,10],"Agent","OpenAI","LLM",[12,14,17,20,23,25,27,30,33,36],{"title":8,"total":13},39,{"title":15,"total":16},"Google",44,{"title":18,"total":19},"Nvidia",13,{"title":21,"total":22},"Claude",11,{"title":9,"total":24},35,{"title":10,"total":26},85,{"title":28,"total":29},"DeepSeek",9,{"title":31,"total":32},"OCR",1,{"title":34,"total":35},"Chat",7,{"title":37,"total":38},"Generator",116,[40,48,55,64,71],{"id":41,"publish_date":42,"is_original":4,"collection":5,"cover_url":43,"cover_url_1_1":44,"title":45,"summary":46,"author":47},557,"2022-04-29","article_res/cover/7a9b1375ed9bb298154981bae42b794d.jpeg","article_res/cover/afa281dd52bc0454e6735daa8e6b0706.jpeg","Translation and summary of Messari Report [2.8 Kristin Smith, Blockchain Association and Katie Haun, a16z]","We need unity and speed right now.","Translation",{"id":49,"publish_date":50,"is_original":4,"collection":5,"cover_url":51,"cover_url_1_1":52,"title":53,"summary":54,"author":47},531,"2022-05-25","article_res/cover/e8362057f8fa189594c60afdfaaeb6e5.jpeg","article_res/cover/8ea08d0d6fa7eee6b57ed4ec61b61ad6.jpeg","Decentralized Society: Finding Web3’s Soul / Decentralized Society: Finding the Soul of Web3 -7","Decentralization through Pluralism When analyzing ecosystems, it's desirable to measure how decentralized it is.",{"id":56,"publish_date":57,"is_original":32,"collection":58,"cover_url":59,"cover_url_1_1":60,"title":61,"summary":62,"author":63},127,"2024-11-14","#Google #AI Game #World Model #AI Story","article_res/cover/0233a875b7ec2debf59779e311547569.jpeg","article_res/cover/6ffddb6ae4914b3c699493311aa9f198.jpeg","Google Launches \"Unbounded\": A Generative Infinite Character Life Simulation Game","Unbounded: A Generative Infinite Game of Character Life Simulation","Renee's Entrepreneurial Journey",{"id":13,"publish_date":65,"is_original":32,"collection":66,"cover_url":67,"cover_url_1_1":68,"title":69,"summary":70,"author":63},"2025-02-14","#Deep Dive into LLMs #Andrej Karpathy #LLM #Tool Use #Hallucination","article_res/cover/11e858ad6b74dfa80f923d549b62855c.jpeg","article_res/cover/615e1b320f1fc163edc1d2d154a6de33.jpeg","Andrej Karpathy's in-depth explanation of LLM (Part 4): Hallucinations","hallucinations, tool use, knowledge/working memory",{"id":72,"publish_date":73,"is_original":4,"collection":5,"cover_url":74,"cover_url_1_1":75,"title":76,"summary":77,"author":47},579,"2022-04-07","article_res/cover/39387376ba28447af1eb40576b9df215.jpeg","article_res/cover/02727ede8551ed49901d0abe6d6305b7.jpeg","Messari Report Translation and Summary 【1-7 Surviving the Winter】","I’d be more cautious here: 10 year and 10 hour thinking only.",[79,87,95,103,111],{"id":80,"publish_date":81,"is_original":32,"collection":82,"cover_url":83,"cover_url_1_1":84,"title":85,"summary":86,"author":63},627,"2025-03-20","#AI Avatar #AI Video Generation","article_res/cover/d95481358f73924989f8c4ee9c75d1c8.jpeg","article_res/cover/b74bc0fab01f8b6a6aa87696c0c3ed8b.jpeg","DisPose: Generating Animated Videos by Driving Video with Reference Images","DisPose is a controllable human image animation method that enhances video generation.",{"id":88,"publish_date":89,"is_original":32,"collection":90,"cover_url":91,"cover_url_1_1":92,"title":93,"summary":94,"author":63},626,"2025-03-21","#Deep Dive into LLMs #LLM #RL #Andrej Karpathy #AlphaGo","article_res/cover/446553a5c8f8f2f07d97b20eaee84e56.jpeg","article_res/cover/e6c2823409c9b34624064b9acbaca6f1.jpeg","AlphaGo and the Power of Reinforcement Learning - Andrej Karpathy's Deep Dive on LLMs (Part 9)","Simply learning from humans will never surpass human capabilities.",{"id":96,"publish_date":97,"is_original":32,"collection":98,"cover_url":99,"cover_url_1_1":100,"title":101,"summary":102,"author":63},625,"2025-03-22","#Deep Dive into LLMs #LLM #RL #RLHF #Andrej Karpathy","article_res/cover/8da81d38b1e5cf558a164710fd8a5389.jpeg","article_res/cover/96f028d76c362a99a0dd56389e8f7a9b.jpeg","Reinforcement Learning from Human Feedback (RLHF) - Andrej Karpathy's Deep Dive on LLMs (Part 10)","Fine-Tuning Language Models from Human Preferences",{"id":104,"publish_date":105,"is_original":32,"collection":106,"cover_url":107,"cover_url_1_1":108,"title":109,"summary":110,"author":63},624,"2025-03-23","#Deep Dive into LLMs #LLM #Andrej Karpathy #AI Agent #MMM","article_res/cover/a5e7c3d48bb09109684d6513287c661d.jpeg","article_res/cover/d3f22b7c0ab8d82fd2da457a299e0773.jpeg","The Future of Large Language Models - Andrej Karpathy's In-Depth Explanation of LLM (Part 11)","preview of things to come",{"id":112,"publish_date":105,"is_original":32,"collection":113,"cover_url":114,"cover_url_1_1":115,"title":116,"summary":117,"author":63},623,"#Google #Voe #AI Video Generation","article_res/cover/c44062fea0f336c2b96b3928292392c2.jpeg","article_res/cover/a041041c69092ad3db191c5bf3ff981b.jpeg","Trial of Google's video generation model VOE2","Our state-of-the-art video generation model",[119,127,135],{"id":120,"publish_date":121,"is_original":32,"collection":122,"cover_url":123,"cover_url_1_1":124,"title":125,"summary":126,"author":63},160,"2024-10-04","#Philosophy","article_res/cover/496990c49211e8b7f996b7d39c18168e.jpeg","article_res/cover/14dbaa1ade9cb4316d5829423a900362.jpeg","Time","The fungus of the morning does not know the waxing and waning of the moon, and the cicada does not know the seasons; this is a short life. To the south of the state of Chu there is a dark spirit which regards five hundred years as spring and five hundred years as autumn. In ancient times there was a great tree called the Ming which regarded eight thousand years as spring and eight thousand years as autumn; this is a long life.",{"id":128,"publish_date":129,"is_original":32,"collection":130,"cover_url":131,"cover_url_1_1":132,"title":133,"summary":134,"author":63},98,"2024-12-17","#AI Video Generator #Sora #Pika","article_res/cover/3b86e85d03fff4f356a3e4cf2bb329c9.jpeg","article_res/cover/5fa5c20ad0b40f8f544d257c0ef02938.jpeg","Pika 2.0 video generation officially released: effect comparison with Sora","今天，我们推出了Pika 2.0模型。卓越的文字对齐效果。惊人的视觉表现。还有✨场景成分✨",{"id":136,"publish_date":137,"is_original":32,"collection":138,"cover_url":139,"cover_url_1_1":140,"title":141,"summary":142,"author":63},71,"2025-01-14","#Nvidia #World Foundation Model #Cosmos #Physical AI #Embodied AI","article_res/cover/feddf8c832dfb45d28804291f6a42a9e.jpeg","article_res/cover/d6bc2f1186d96b78228c2283a17a3645.jpeg","NVIDIA's Cosmos World Model","Cosmos World Foundation Model Platform for Physical AI",[144,163,188],{"title":8,"items":145},[146,147,155],{"id":104,"publish_date":105,"is_original":32,"collection":106,"cover_url":107,"cover_url_1_1":108,"title":109,"summary":110,"author":63},{"id":148,"publish_date":149,"is_original":32,"collection":150,"cover_url":151,"cover_url_1_1":152,"title":153,"summary":154,"author":63},622,"2025-03-24","#OWL #AI Agent #MAS #MCP #CUA","article_res/cover/cb50ca7f2bf4d1ed50202d7406e1c19a.jpeg","article_res/cover/4aa7aa3badfacf3cc84121334f1050dd.jpeg","OWL: Multi-agent collaboration","OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation",{"id":156,"publish_date":157,"is_original":32,"collection":158,"cover_url":159,"cover_url_1_1":160,"title":161,"summary":162,"author":63},620,"2025-03-26","#LLM #Google #Gemini #AI Agent","article_res/cover/53751a6dbbe990b1eb0b63f3b062aed4.jpeg","article_res/cover/031344981f0a212ff82d1f3a64aa5756.jpeg","Gemini 2.5 Pro, claimed to be far ahead of the competition, has been released with great fanfare: comprehensively surpassing other LLMs and topping the global rankings","Gemini 2.5: Our most intelligent AI model",{"title":9,"items":164},[165,172,180],{"id":166,"publish_date":157,"is_original":32,"collection":167,"cover_url":168,"cover_url_1_1":169,"title":170,"summary":171,"author":63},619,"#OpenAI #AI Image Generator #4o #MMM #AR Transformer","article_res/cover/2faffc97fcecf3151552cb0fd3206d89.jpeg","article_res/cover/1133cb4948af44cee2e7fbe79efb69e5.jpeg","The native image function of GPT-4o is officially launched","Introducing 4o Image Generation",{"id":173,"publish_date":174,"is_original":4,"collection":175,"cover_url":176,"cover_url_1_1":177,"title":178,"summary":179,"author":63},434,"2023-07-15","#Anthropic #OpenAI #Google #AI Code Generator #Claude","article_res/cover/e1b6f600a2b9f262a4392684e5f2ce25.jpeg","article_res/cover/6e1772e83f78f9a351ab23d3e414adee.jpeg","Latest Updates on Google Bard /Anthropic Claude2 / ChatGPT Code Interpreter","We want our models to use their programming skills to provide more natural interfaces to the basic functions of our computers.  \n - OpenAI",{"id":181,"publish_date":182,"is_original":4,"collection":183,"cover_url":184,"cover_url_1_1":185,"title":186,"summary":187,"author":63},417,"2023-08-24","#OpenAI","article_res/cover/bccf897d50a88b18364e35f7466387e0.jpeg","article_res/cover/2f871085c1073717c1703ae86e18056f.jpeg","The GPT-3.5 Turbo fine-tuning (fine-tuning function) has been released～","Developers can now bring their own data to customize GPT-3.5 Turbo for their use cases.",{"title":10,"items":189},[190,191,192],{"id":88,"publish_date":89,"is_original":32,"collection":90,"cover_url":91,"cover_url_1_1":92,"title":93,"summary":94,"author":63},{"id":96,"publish_date":97,"is_original":32,"collection":98,"cover_url":99,"cover_url_1_1":100,"title":101,"summary":102,"author":63},{"id":104,"publish_date":105,"is_original":32,"collection":106,"cover_url":107,"cover_url_1_1":108,"title":109,"summary":110,"author":63},true,{"code":4,"msg":5,"data":195},{"id":196,"publish_date":197,"is_original":4,"collection":198,"articles_id":199,"cover_url":200,"cover_url_1_1":201,"title":202,"summary":203,"author":63,"content":204},322,"2024-02-26","#OpenAI #Sora #World Model #MOE","UoW7fSaYhtdYLvjWayjopA","article_res/cover/7983d0343a3030031c520f7c98d836f5.jpeg","article_res/cover/0846a27714dddadf46c49e2474764f74.jpeg","\"OnBoard!\" Deep Interpretation of OpenAI Sora (Part 1) Podcast - Notes 1","What I cannot create, I do not understand. - Richard Feynman","\u003Cdiv class=\"rich_media_content js_underline_content\n                       autoTypeSetting24psection\n            \" id=\"js_content\">\u003Cp data-tool=\"mdnice编辑器\" style='margin-bottom: 0px;padding-top: 8px;padding-bottom: 8px;color: black;font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;line-height: 26px;'>Every time the podcast 《OnBoard!》 updates episodes related to AI, I listen to them at least once. Recently, the podcast shared content about Sora, divided into two episodes: the first one focusing on technical perspectives and the second from a venture capital perspective. The first episode contained more practical information for me, so I took notes. If you'd like to experience the original version, you can listen to the original podcast.\u003C/p>\u003Cul data-tool=\"mdnice编辑器\" class=\"list-paddingleft-1\" style='margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 557.438px;color: black;font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;'>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: Technological innovation and limitations, multimodal integration, and world models as seen by an AI researcher in Silicon Valley. Check out this episode on Castbox! https://castbox.fm/vd/674774452\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: A new landscape of AI applications as seen by frontline investors and entrepreneurs. Check out this episode on Castbox! https://castbox.fm/vd/675169827\u003C/p>\u003C/section>\u003C/li>\u003C/ul>\u003Ch2 data-tool=\"mdnice编辑器\" style='margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 22px;color: black;font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;letter-spacing: normal;text-align: left;text-wrap: wrap;'>Introduction of two guests\u003C/h2>\u003Cul data-tool=\"mdnice编辑器\" class=\"list-paddingleft-1\" style='margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 557.438px;color: black;font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;'>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Lijun Yu from Google VideoPoet, personal homepage: https://me.lj-y.com/\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Yao Fu from the University of Edinburgh, personal homepage: https://franxyao.github.io/\u003C/section>\u003C/li>\u003C/ul>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Lijun Yu's self-introduction\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">A Ph.D. student at CMU, with long-term internships at Google.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Overview of the research journey:\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: Focused on research in the field of video understanding.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: Shifted to the field of video generation research.\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In the early stages of video generation research, the application of discrete Tokens and Transformer technology was explored.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In 2022, participated in proposing the MAGVIT (Masked Generative Video Transformer) framework, which is an innovative video generation Transformer framework. In 2023, research directions included Language Model DiSK Diffusion and latent representation techniques for videos.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">At Google, participated in the development of the VideoPoet project. This is a framework based on autoregressive language models that goes beyond single modality and can handle and generate multimodal inputs:\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Text-to-video\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Image-to-video\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Video to Audio\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Video Editing\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In addition to work in the multimodal field, many Scaling experiments were conducted in the VideoPoet project.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Participated in the work of Diffusion Transformer W.A.L.T, and the Diffusion training of MAGVIT v2 in the latent space.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">One of the few individuals with research experience in multiple areas of video processing based on Transformers, including Mask Transformer, Autoregressive Transformer, and Diffusion Transformer.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Introduction of Fu Yao\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">A PhD student at the University of Edinburgh.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The main research focus is on large language models.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In the early stages, the research work focused on model scaling (Scale Up), including enhancing reasoning capabilities and developing long-context processing techniques. As language models continued to expand, they gradually evolved into multimodal models such as GPT-4 and Gemini. These models can handle various types of input, not just text.\u003C/p>\u003C/section>\u003C/section>\u003Ch2 data-tool=\"mdnice编辑器\" style='margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 22px;color: black;font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;letter-spacing: normal;text-align: left;text-wrap: wrap;'>Some discussion topics in the podcast\u003C/h2>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Comparison of Google VideoPoet and OpenAI Sora\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Taking the \"Streets of Japan\" video as an example:\u003C/p>\u003Csection>\u003Cdiv style=\"height: 508px; background: rgb(0, 0, 0); border-radius: 4px; overflow: hidden; margin-bottom: 12px;\">\u003Cvideo src=\"https://res.cooltool.vip/article_res/assets/17423811042680.8079054908165961.mp4\" poster=\"https://res.cooltool.vip/article_res/assets/17423811023460.0013717898858567334.jpeg\" controls=\"\" style=\"width: 100%;height: 100%;\">\u003C/video>\u003C/div>\u003C/section>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Sora excels in terms of coherence, maintaining consistent facial features of characters during scene transitions, and generating backgrounds that remain highly coherent, making the video appear more natural and realistic.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In comparison to Sora, VideoPoet has limitations in resolution and video duration. Although it supports scaling from 128 resolution up to 512, it still falls short compared to Sora's 1920 resolution. VideoPoet primarily focuses on generating video clips lasting 2 to 5 seconds, while Sora can produce videos up to 60 seconds long, demonstrating a significant advantage in duration.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Despite its limitations in resolution and duration, VideoPoet performs well in handling semantic coherence within videos, as well as the separation of foreground and background elements, maintaining good consistency.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Sora's most important innovation\u003C/h3>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: Sora adopts the Latent Diffusion Transformer model, which combines Auto-Encoder and Transformer technologies. It achieves compression in both space and time by converting between pixel space and latent space, similar to the Stable Diffusion model.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: Sora uses a pure Transformer model, which is different from other UNet image generation models that rely on convolutional neural networks, making it relatively novel in the field of video generation.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: Sora can train videos with different resolutions, aspect ratios, and durations, enhancing the flexibility and adaptability of the model.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: The large-scale model of Sora and the substantial training computing power required enable it to generate high-quality videos.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">：Training with high-quality datasets is crucial for generating realistic videos.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">：Using native structures suitable for video, such as 3D CNNs, to optimize video data processing.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">：Transforming discrete encoding into continuous encoding; although VideoPoet also attempted something similar, it was unable to achieve this due to resource limitations.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: May include an additional consistency model to enhance the coherence and consistency of videos.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">The surprising aspects of Sora\u003C/h3>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: Sora can directly generate high-resolution videos without requiring additional super-resolution or up-sampling models. This means that Sora can produce high-quality video content directly from the model without any extra post-processing steps. Although this may result in lower generation efficiency and longer time requirements, its capabilities are already quite advanced technically.\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: Sora demonstrates its ability for efficient compression while maintaining video quality. Compared to previous models that might require 1,000 tokens to represent 1-2 seconds of 128-resolution video, Sora can generate up to 10 seconds of 1080p video at a similar or even higher compression ratio. This high compression efficiency means that Sora can handle longer video sequences while preserving the quality and detail of the video content.\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: Sora's technical architecture is capable of handling sequences as long as one million (1m) in length, which poses a significant challenge in the field of video generation. To process such long sequences, efficient attention mechanisms and model architectures are required to ensure computational feasibility and efficiency. Compared to large language models, some models adopt distributed real attention mechanisms and special designs for long context windows, which are key technologies for handling long content windows.\u003C/p>\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Why is compression important?\u003C/h3>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: The Transformer model can handle an extremely large amount of data, up to 1 million tokens. By improving the compression ratio, the latent encoder can more effectively process large-scale datasets. This means that the model can process more information at a lower computational cost, thereby enhancing overall data processing efficiency. This is particularly important for tasks that require handling large amounts of data, such as video generation and natural language processing.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: In the Sora model, the decoding process from the latent space to the pixel channel is responsible for converting the high-level semantic information learned by the model into visual outputs. This conversion process is key to the model generating high-quality outputs. By ensuring that the bridge in this conversion process is sufficiently broad, information loss during the conversion can be reduced. This means that the model can more accurately convert the semantic information of its internal representation into visual output, thereby generating more realistic and higher-quality video content.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What are the differences in design and functionality between Transformers used in language models and multimodal models?\u003C/h3>\u003Col class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Approach and Modality\u003C/section>\u003C/li>\u003C/ol>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: Typically focus on text data, using autoregressive methods to generate text and coherently predict the next word or sentence. These models, such as the GPT series, are primarily designed to handle single-modality (text) input. Although the latest models attempt to process multimodal data (such as images + text), their main strength still lies in language understanding and generation.\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: Focus on video content generation, requiring the processing and understanding of various types of data, including text, images, and audio, to produce videos. This means that Sora must be able to convert different modalities of input into video output in its design, which is a more complex task than simply handling text data.\u003C/p>\u003C/section>\u003C/li>\u003C/ul>\u003Col start=\"2\" class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Output content\u003C/section>\u003C/li>\u003C/ol>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: Mainly text, even when processing multi-modal input, the output is usually in text form, such as generating text to describe images.\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">: The output of video models like Sora can be multi-modal, not limited to text or images, but also including videos themselves. These models differ in design from LLMs and may:\u003C/p>\u003C/section>\u003C/li>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 491.969px;list-style-type: square;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: For example, by using an independent image decoder or mapping data to a specific latent space, such as Stable Diffusion, to enhance video generation capabilities.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">: Models like VideoPoet directly implement video generation within the Transformer framework, which involves representing video frames as tokens in pixel space and using Transformers to generate video content.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">：Sora uses a diffusion denoising method for training, and the decoding process is not based on autoregressive prediction but rather on the diffusion process.\u003C/section>\u003C/li>\u003C/ul>\u003C/ul>\u003Col start=\"3\" class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Model architecture and training methods\u003C/section>\u003C/li>\u003C/ol>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">：It relies on large amounts of text data for pre-training, learning language patterns through an autoregressive approach, and may then be fine-tuned for specific tasks.\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">：It likely combines traditional Transformer architectures with techniques specific to video generation, such as diffusion models and latent space mapping. Its training method might also differ from standard LLMs, focusing more on video content generation.\u003C/p>\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">How to view the emergent capabilities of Sora\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In the field of large-scale models, emergent capabilities refer to those abilities that are not observed in smaller-scale models but suddenly appear after scaling up. These capabilities are usually performance improvements brought about by increasing model size, and such improvements are sudden and unpredictable. However, there is controversy surrounding emergent capabilities: some studies suggest that their appearance may be related to the choice of evaluation methods. When non-linear or discontinuous evaluation methods are used, the model appears to exhibit emergent capabilities; however, if a linear measurement method is adopted instead, this capability may become less apparent or disappear altogether.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">As for Sora's emergent capabilities, as the model scale increases, it can generate richer and more complex content, such as better understanding and expressing complex concepts, distinguishing between foreground and background in generated scenes, and even differentiating various parts of the background. Previously, VideoPoet made significant progress in processing and integrating different modalities, such as converting text to video, then from video to audio, and adding appropriate sound effects to the generated videos, as well as corresponding music for instruments in the video, demonstrating the model’s ability to understand video.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The application of Diffusion Models in video understanding tasks is still in its exploratory phase. Unlike autoregressive models that directly predict the value of the next pixel or frame, how to effectively apply Diffusion technology to video understanding and generation, including technical details and best practices, remains an open research question.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Is it possible to combine Diffusion Transformer models with autoregressive (Auto-regressive, AR) Transformer models?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Combining Diffusion Transformer models with autoregressive (Auto-regressive, AR) Transformer models is a cutting-edge and promising research area that may offer new solutions and insights into handling complex high-dimensional data such as images and videos. This combination is not only feasible but also likely to positively impact the model's predictive capabilities and its ability to process multimodal data.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Possible Combination Advantages\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Flexibility of Model Architecture: By adopting frameworks such as Mixture of Experts (MoE), multiple prediction experts can be integrated into a single model. This design allows some experts to focus on next-word prediction, which is characteristic of autoregressive models, while others concentrate on denoising prediction in Diffusion models. This flexible architectural design enhances the model's ability to handle various tasks and enables it to dynamically adjust according to different needs.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Advantages of Parallel Prediction: Combining Diffusion and AR models may enable more efficient parallel prediction mechanisms. Diffusion models can consider overall global information during their generation process, whereas AR models rely on previously generated contexts to predict each new unit (such as words or pixels). This combination is expected to improve efficiency and accuracy when processing high-dimensional data.\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Enhanced Multimodal Data Processing\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">This combination approach also helps enhance the model's ability to process and generate multimodal data (such as text, images, and videos). By understanding and generating different types of data within the same framework, the model can better grasp the intrinsic connections between data, achieving richer and more accurate multimodal outputs.\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>The key to deep understanding\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">As Richard Feynman said, \"What I cannot create, I do not understand.\" Combining different generative techniques allows models to gain a deeper understanding of the essence of data during the content creation process. This deep understanding is the key to building foundational models that can truly understand multimodal data.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Does the generative capability of AI models represent understanding?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">AI models respect certain physical laws, which may have been observed during the training process. However, the physical laws followed by the model are not presented in a way that aligns with human understanding. Language follows human logic, while videos reflect the model's own understanding. The model might summarize known rules to humans or even discover unknown rules. How do we know it has summarized new rules? This requires verification through language—it must be able to communicate effectively with humans.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">From the perspective of scaling laws, what are the differences between autoregressive (AR) models and Diffusion models in handling data and learning tasks?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Difference in objective functions\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"list-style-type: disc;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Autoregressive (AR) model: The AR model minimizes prediction loss by predicting the next word or pixel in a sequence, making it suitable for lossless compression. An increase in model size typically implies higher prediction accuracy and better data compression efficiency.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Diffusion model: Unlike AR models, the goal of Diffusion models is to generate data by introducing and gradually removing noise, which goes beyond just data compression. This method focuses more on generation quality and proximity to the data distribution.\u003C/section>\u003C/li>\u003C/ul>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cbr>\u003C/section>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cstrong>Adaptability to data forms\u003C/strong>\u003C/section>\u003Cul class=\"list-paddingleft-1\" style=\"list-style-type: disc;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Autoregressive (AR) model: It is particularly suited for handling discrete data types, such as text. In sequential tasks like text generation or music composition, AR models can predict the next most likely output step by step.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Diffusion Models: Better suited for handling continuous data, such as images and videos. This type of model generates continuous data close to the real distribution by controlling and adjusting Gaussian noise, making it perform well in image and video generation tasks.\u003C/section>\u003C/li>\u003C/ul>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cbr>\u003C/section>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cstrong>The relationship between model scale and performance\u003C/strong>\u003C/section>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Learning difficulty and expressive power: Diffusion models may require a relatively smaller model scale to achieve good results when processing continuous data because they operate directly in the continuous space. In contrast, AR models may need a larger model scale to capture complex sequence dependencies when processing discrete data.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Model scaling effects: Scaling up AR models usually directly improves prediction accuracy and data compression efficiency. For Diffusion models, scaling focuses more on enhancing the authenticity and diversity of generated data.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What are World Models\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">World Models is a type of computer model designed to simulate and understand the dynamics and patterns of the real world. This model analyzes past and present data in an attempt to predict future states, thereby providing a basis for decision-making. World models typically predict various possible future scenarios based on probability distributions and transition probabilities rather than trying to enumerate all possibilities. This method allows the model to make reasonable predictions even in the presence of uncertainty.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Physical Laws and Simulators\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">World Models aim to understand the world with less reliance on human-input physical laws. While traditional simulators depend on explicitly coded physical laws by humans, world models \"understand\" these laws by learning patterns from data. For example, in autonomous driving or robotics, video prediction models can learn behavioral patterns rather than simply replicating rules input by humans.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>The History of Physics and AI Applications\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The development of physics—from Newtonian mechanics to relativity and quantum mechanics—demonstrates the deepening of human understanding of natural laws. Similarly, in the field of AI, as models grow larger and more data becomes available, they are able to learn more complex and refined patterns. This applies not only to language generation, such as drafting legal documents, but also extends to generating video content, such as the physical behavior of a race car turning around a corner.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>The essence of model learning\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The goal of world models and other AI models is to learn the rules for generating data, not just memorizing it. This means that the model weights reflect an understanding of the rules of the world. Such understanding allows AI to demonstrate deep insights into complex phenomena within specific domains, such as language generation or video content creation.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Accuracy of rule understanding\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">As the model scales increase and more data is encountered, the model can understand more complex and specialized rules. This deeper understanding extends beyond linguistic expertise to more precise simulations of the physical world. For example, the model can learn specific rules about object behavior under certain conditions, such as how race cars behave under specific circumstances.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Applications of Sora and models\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Advanced AI models like Sora demonstrate that through large-scale data training, AI can simulate and understand complex world rules on multiple levels. These models, by integrating existing AI technologies such as autoregressive models and Diffusion models, are capable of generating high-quality language and visual content, reflecting a deep understanding of the real world.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What is the predicted size of the Sora model? Is further expansion still needed?\u003C/h3>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">VideoPoet: It has achieved 8B (8 billion) parameters, which is a relatively large model capable of generating high-quality video content.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Diffusion Transformer's DiT-XL: It has approximately 1M (1 million) parameters, indicating that even smaller models can achieve effective learning and generation tasks.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Sora: It is estimated to have around 10B (10 billion) parameters, although some predictions suggest it might be 3B (3 billion). This indicates there may be different strategies in terms of model design and computational power investment for training.\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Smaller models, larger data: This is a potential trend that means improving model performance by increasing the scale of data while maintaining or even reducing the size of the model. This approach can reduce inference costs and make AI applications more cost-effective.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Consideration of inference costs: Smaller models are cheaper for inference, which is especially important for applications requiring frequent or real-time reasoning. This drives researchers and developers to seek more efficient model architectures and training methods to achieve optimal performance with limited resources.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The case of video content generation shows that even with large amounts of video data, relatively smaller models can achieve satisfactory generation quality through carefully designed model structures and training strategies.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Computational resource investment: The training and inference costs of models are closely related to available computational resources. In situations where computational power is limited, developing smaller and more efficient models becomes an important goal.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Is it necessary to continue expanding the model:\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">As the scale of the model increases, performance improvements are typically observed, including better understanding, more accurate predictions, and enhanced capabilities for handling complex tasks. If the current model does not perform ideally on specific tasks, scaling up the model size may be a viable solution.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">If application requirements include processing and generating various types of data (such as text, images, videos, etc.), larger models may be more necessary. This is because large models can store and process a wider variety of information, thereby better understanding and integrating multimodal data.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What is the computational power estimation required for training Sora?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Training cost estimation\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Llama 70B model: Trained for one month using 2000 NVIDIA A100 GPUs. This reflects the enormous computational power and time investment required to train large language models.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">VideoPoet: Trained for two weeks using hundreds of NVIDIA H100 GPUs. Although the model size of VideoPoet may not be as large as Llama, the complexity of processing video data may lead to a significant demand for computing power.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Sora: May require thousands of NVIDIA H100 GPUs for one month of training. Considering the scale of the Sora model and the complexity of handling high-resolution video data, its computational power requirements could be even higher.\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Characteristics of models and data\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Model scale and sequence length: Although the scale of Sora's model may not be large, the sequence length required when processing video data is much longer than that for text data, and the information density of video data is usually lower than language, which increases the difficulty of training and the demand for computing power.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">GPU optimization and architecture: Although the existing GPU infrastructure has been well optimized for Transformer models, when returning from latent representations to pixel-level data, the encoders and decoders involved, as well as conclusion-based architectures, may require further hardware support and optimization.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Video preprocessing: Video data can be preprocessed to reduce the burden during training and inference, but this requires a carefully designed data processing pipeline.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">How about the estimation of Sora's inference cost?\u003C/h3>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">Inference cost: For models like Sora, each step in the denoising process may require as much computational power as the entire process of an autoregressive (AR) model, leading to video generation taking up to 20 minutes. This reflects the high cost of video generation models during inference.\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">Computational and memory constraints: The time consumption of AR models is mainly due to memory access rather than computation, whereas diffusion models are computationally intensive. This means that although their cost is high, they may be more optimized in terms of performance compared to AR models.\u003C/p>\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">How to improve inference speed?\u003C/h3>\u003Col class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cstrong style=\"color: black;\">Hardware and computing power improvement\u003C/strong>\u003C/section>\u003C/li>\u003C/ol>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Hardware advancement: With the performance improvement of GPUs and other dedicated hardware (such as TPUs), we can expect faster data processing and computational speeds. As a compute-intensive task, Transformer models, especially in terms of memory bandwidth, have high requirements. Hardware improvements will directly enhance inference speed.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Computing power improvement: Stronger computing power not only means faster calculation speed but also enables models to more effectively handle large amounts of data, particularly for video tasks that are data-intensive.\u003C/section>\u003C/li>\u003C/ul>\u003Col start=\"2\" class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cstrong style=\"color: black;\">Engineering optimization\u003C/strong>\u003C/section>\u003C/li>\u003C/ol>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Better batching: By optimizing data batching strategies, more data can be processed simultaneously, reducing I/O wait time and improving GPU utilization.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">LLM (Large Language Model) Optimization: Engineering optimizations for large models, such as model pruning and quantization, can reduce the computational demands of a model, thereby accelerating inference speed.\u003C/section>\u003C/li>\u003C/ul>\u003Col start=\"3\" class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cstrong style=\"color: black;\">Algorithmic Improvements\u003C/strong>\u003C/section>\u003C/li>\u003C/ol>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Optimization of Diffusion Models: There is still considerable room for improvement in diffusion models, such as reducing the number of decoding steps or developing more efficient sampling strategies to enhance generation speed.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Algorithm Efficiency: New algorithm discoveries or improvements to existing algorithms can also significantly increase inference speed, such as improved attention mechanisms, more efficient data encoding and decoding techniques, etc. The future of inference speed: With continued advancements in these areas, we can expect that in the future, the inference time for generating 10 seconds of video content will be drastically reduced to within one minute.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Can scaling laws be applied to the compression rates of encoders and decoders?\u003C/h3>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Compression Ratio vs. Model Scale: Theoretically, if the scale of the encoder and decoder is expanded from 1B parameters to 100B parameters, a higher compression ratio can be expected, reducing the sequence length from 1M to 1K. This is because larger models have stronger learning and representation capabilities, enabling them to more effectively capture and encode the complexity and details in the data.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Model Scale and the Need for Intermediate Models: When the scale of the encoder and decoder increases to a certain extent, theoretically, the reliance on intermediate models (such as Transformer models) can be reduced. This is because powerful encoders and decoders can more directly process and generate high-quality data representations, thereby reducing the necessity of intermediate processing steps. Practical trade-offs in applications.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Efficiency and Bottlenecks: During the scaling process, there is a contradiction between the desire for high efficiency in intermediate models and the desire for the encoder and decoder to be as small as possible to avoid becoming bottlenecks. In practice, the goal of optimizing the model is to find a balance among the components to achieve overall efficiency and high performance.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Special Cases in Vision Models: For vision models, if the decoder is very powerful, theoretically, additional processing layers may not be necessary. Ideally, a powerful decoder alone would suffice to recover high-quality image or video content from the encoded representation. However, this requires the decoder to possess extremely strong generative capabilities. Compression ratio and information density.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Control of Information Density: When processing video and language data, the goal is to ensure that the information density after encoding does not vary too much. This requires controlling the compression strength during the design of the encoder to ensure that key information in the data is effectively retained.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Selection of compression ratio: Choosing an appropriate fixed compression ratio requires balancing input resolution with the potential capabilities of the encoder. Additionally, considering that different parts of the data may have varying levels of importance, dynamically allocating attention resources is also a direction for optimizing model performance.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Will supporting multi-modal inputs affect the quality of generation?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Under a fixed resource budget, if 100% of the resources are used to learn one modality (such as text), the model's performance in that modality may reach its optimal level. However, if some resources are allocated to learning other modalities, the performance of the single modality may decrease somewhat, as the model needs to distribute its attention and learning ability across multiple tasks.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Ideally, if resources can be increased so that even when learning multi-modal data, each modality can still receive sufficient learning resources, the model’s performance in each modality may not be significantly affected, and might even improve overall performance due to the complementary learning between modalities.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In the short term, the combination of models like ChatGPT and Sora is unlikely to happen immediately, mainly because the learning requirements and resource allocation strategies for different modalities need to be carefully designed and adjusted. Moreover, multi-modal learning requires the model to handle and understand complex relationships between different types of data, which is itself a challenge.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Can multimodal models introduce a function similar to RAG (Retrieval-Augmented Generation) in language models for data not included in the training of video models, such as newly released games or user-private games?\u003C/h3>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Flexibility and Denoising: Diffusion models have significant advantages in generating high-quality images and videos, especially their denoising process can produce visually realistic content. Combining with AR models can increase control over sequential data, making the generated text or language content more accurate and coherent.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Long Sequence Processing: The long sequence processing capability of Transformer models enables them to understand and generate complex multimodal content, including long videos and detailed game descriptions.\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">Example-driven Learning: Providing the model with examples related to new games or specific content can help the model better understand and generate data in these fields. These examples can be retrieved via the RAG mechanism when needed, thus assisting the generation process.\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Is the current model's extension an optimization within the diffusion framework or on the Transformer framework?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The optimization of Transformer models is engineering-oriented: The optimization of Transformer models often focuses on improving computational efficiency and reducing model size to facilitate processing long sequence data and reduce inference time. The optimization of Diffusion models is theory-driven: The optimization of Diffusion models mainly focuses on algorithm improvements to enhance the quality and efficiency of generated data.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What kind of challenge does the emergence of Sora pose for start-ups like Pika?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Pika previously mainly adopted the Latent Diffusion model with a U-net structure, which has some gaps compared to Sora's existing models.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Although the previous training data can still be utilized, the demand for computing power has significantly increased, bringing more infrastructure and cost pressures.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Apart from the brute-force aesthetics of Sora’s large models and high computing power requirements, what other important factors are there?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Apart from the scale of the model and its high requirements for computing power, Sora's success also depends on several key factors, mainly including data screening, organization, and technical details of the model. Selecting high-quality data that conforms to physical laws and has less special effects editing is crucial for model training.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What are the experiences in training data from the perspective of VideoPoet?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Although YouTube data is diverse, directly using these data is not the best choice because it contains a large amount of low-quality content, such as boring traditional game live streams and repetitive music videos. Therefore, selecting high-quality data becomes particularly important.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In the field of data engineering, the integration of artificial intelligence plays a vital role. The best data often comes from high-level human creations. For example, in the training process of large language models, textbooks are excellent data sources. Textbooks compile all the information we need, and their authors are usually experts in the field who invest significant effort, making the content rich, illustrated, and thoroughly explained. This kind of data is often created by professionals with doctoral degrees.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">As for high-quality sources of video data, they are mainly purchased stock libraries, including authorized stock videos, news, and media material libraries, which are usually labeled. Although movies and TV works can also serve as data sources, they may involve copyright issues.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Generally speaking, the better educated the creator of the data is, the higher the quality of the data tends to be, but this also means that copyright protection measures will be stricter.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Does Videopoet use synthetic data?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Videopoet does not use synthetic data. However, if data generated by game engines is considered synthetic data, then Sora might use such data. Videopoet focuses more on model innovation, so the choice of data depends on specific needs. Generating data itself is a huge task, often relying on existing AAA game titles to obtain unique datasets. The advantage of using these data is that they follow physical rules.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">If the purpose of generating models is to learn the rules of the world, there may be some rules that are not fully covered by videos. Game engines have already very precisely written physical rules. Combining these rules with those learned from the real world can help improve model performance.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">Combining LLMs with coding can enhance their reasoning abilities. Would adding physical rules help video generation models?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Agree with the point of view:\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The logic of natural language is vague and has many gray areas. On the other hand, programming languages are strict formal languages that strictly follow logical rules and have no ambiguity. Therefore, programming languages can make up for the deficiencies in logical reasoning that exist in natural languages.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">For example, when a cup breaks, the physics engine simulates it as shattering. However, in reality, if it's tempered glass, the situation could be entirely different. The data generated by game engines can handle such complex scenarios.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Oppose the point of view:\u003C/strong>\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The use of game engines deviates from the original idea of forming physical laws through observation of the world. Now, humans have summarized physical laws and input these laws into machines. However, if a model could spontaneously discover new laws, relying solely on existing physical laws would become a limitation.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Naturally generated videos, aside from possibly being processed with human special effects, also follow physical laws. Providing more solutions that meet physical conditions can enable more effective predictions.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Videopoet not only has text-image pairing data but also a large amount of video and image data for training. Image-to-video training is helpful for text-to-video training, especially for labeled videos. Therefore, image-to-video training is also beneficial for improving the effectiveness of text-to-video training.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What shifts in thinking have occurred since the emergence of Sora?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">With the advent of Sora, people began to realize that LLMs (Large Language Models) are scalable, and Diffusion Transformers are also scalable. This architecture itself is scalable, with different learning curves, but all can be expanded. In the future, people may focus more on MoE (Mixture of Experts).\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">The most common misconceptions, underestimations, and overestimations about Sora.\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Underestimation:\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">Holding an underestimating attitude towards multi-resolution design. This design approach is relatively popular. If a fixed resolution is adopted, it may have adverse effects on the data. Its application on generative models might produce significant impacts in the future. More effective use of data. The efficiency of generation and training could potentially be tripled.\u003C/p>\u003C/section>\u003C/li>\u003C/ul>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">\u003Cstrong>Overestimation:\u003C/strong>\u003C/p>\u003Cul class=\"list-paddingleft-1\" style=\"margin-top: 8px;margin-bottom: 8px;padding-left: 25px;width: 517.469px;\">\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">Overestimation of world models. Although there may be internal world models, they might not be usable for visual understanding tasks. Despite discussions about GPT in the visual domain, it is believed that the level of GPT in the visual domain has not yet been reached.\u003C/p>\u003C/section>\u003C/li>\u003Cli>\u003Csection style=\"margin-top: 5px;margin-bottom: 5px;line-height: 26px;color: rgb(1, 1, 1);\">\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;color: black;\">Overestimation of video quality output. Including diversity and success rate. Even with the same prompt, producing many different videos and ensuring that each one is excellent is what defines a good model.\u003C/p>\u003C/section>\u003C/li>\u003C/ul>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">How long does it take to replicate Sora?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Replicating Sora involves considerations of infrastructure, computing resources, and data. The model itself is relatively less important, so rough estimates can be made. For example, large companies like Google or Meta might be able to replicate Sora within half a year, while smaller companies may need more time, possibly up to one year.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Gemini has already surpassed GPT-4, and its open-source version even exceeds GPT-3.5. (Guest opinion, some people may disagree)\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Companies with abundant GPU resources and talent can, to a certain extent, easily handle challenges similar to these models; it’s mainly an issue of time.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">For situations with scarce GPU resources, it is difficult to replicate Sora using limited computational resources, which tests the capabilities of talent even more.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The performance improvement of small models is obvious.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">In image generation, many small companies have accumulated a large amount of data, such as open-source projects like Midjourney or Stable Diffusion. OpenAI did not fully commit to developing DALL-E 3 because it was also working on Sora at the same time. Large companies may catch up, but for smaller companies, the cost of catching up might not be worth it.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">In 2023, the paper on Diffusion Transformer was rejected by an academic journal but achieved great success in the industry.\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Diffusion Transformer is a concept focusing on benchmarking and scalability rather than innovative solutions. Although initially rejected by academic journals for lack of innovation, it has made significant progress in the industrial field. This approach emphasizes data filtering and organization with relatively less input from human intelligence.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The paper on Diffusion Transformer was initially rejected by CVPR 2023 due to perceived lack of innovation but was later accepted by ICCV 2003. This initial rejection highlights academia's emphasis on novelty over practical effectiveness. Despite setbacks in academia, Diffusion Transformer has achieved considerable success in the industrial sector. It is recognized for its simplicity and scalability, which are crucial for handling large datasets. The adoption of this model in industry underscores a trend where practicality and performance may outweigh theoretical innovation.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Researchers may become bored because the well-designed model is not as good as letting smart people annotate the data.\u003C/p>\u003C/section>\u003C/section>\u003Csection data-tool=\"mdnice编辑器\" style='margin-top: 20px;margin-bottom: 20px;padding: 10px 20px;color: rgb(0, 0, 0);font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;background: rgb(250, 250, 250);'>\u003Csection>\u003Ch3 style=\"margin-top: 30px;margin-bottom: 15px;font-weight: bold;font-size: 20px;\">What are you most looking forward to this year?\u003C/h3>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">What people look forward to the most is the launch of new hardware with an order-of-magnitude increase in computing power, while both production and usage costs decrease.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The TPU used by Videopoet has good scalability and high efficiency, but poor flexibility. It can only use TensorFlow and JAX, and only supports static graphs.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">The scale of TPU computing power is comparable to that of the same-generation GPUs, but at a lower cost, despite no significant improvement in computing power.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">It also looks forward to the integration of Diffusion Transformer and Auto-Regressive Transformer.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Dr. Fu believes that people expect a model to accomplish all tasks. As the model scales up, the granularity of rule understanding will also increase, including world models and video models. For example, the scale of the vision model in PaLM is 22B. Video models should be at least as large as language models, or even larger. One 1T handles language, while another 1-2T handles other modalities.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">People hope that the generated content lasts longer, is clear, conforms to physical laws, and has the ability to surprise (Emergence Ability), while also featuring cross-modal capabilities. For example, the model can give lectures to white-collar workers, prioritizing the use of an image or a gif. This kind of multi-modal educational approach will bring about a revolution in education.\u003C/p>\u003Cp style=\"padding-top: 8px;padding-bottom: 8px;line-height: 26px;\">Dr. Yu believes that regarding the integration of chatGPT and Sora, there might be scientific research this year, but it won't lead to practical product advancements.\u003C/p>\u003C/section>\u003C/section>\u003Cp data-tool=\"mdnice编辑器\" style='margin-bottom: 0px;padding-top: 8px;padding-bottom: 8px;color: black;font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, \"PingFang SC\", Cambria, Cochin, Georgia, Times, \"Times New Roman\", serif;font-size: 16px;letter-spacing: normal;text-align: left;text-wrap: wrap;line-height: 26px;'>Since many academic terms are unfamiliar to me, there may be many errors in the note-taking process. I hope everyone still listens to the original podcast to ensure the accuracy of the information.\u003C/p>\u003Cp style=\"display: none;\">\u003Cmp-style-type data-value=\"3\">\u003C/mp-style-type>\u003C/p>\u003C/div>",1752585428144]