ARTHURCHIAO'S BLOG

[译] 软件领域的工业革命：AI 将使软件成为一种新的 UGC（2025）

ARTHURCHIAO'S BLOG

4 days 2 hours ago

译者序

本文翻译自 2025 年的一篇文章 The rise of industrial software。

工业化能以极大的规模生产低质量、低成本的产品，

印刷工艺的工业化导致了平装书的出现
农业的工业化导致了垃圾食品的出现
数字图像传感器的工业化导致了海量普通人拍摄的图片、视频等等

LLM 的出现是软件领域的蒸汽机时刻，软件开发正在经历一次属于它的“工业革命”，

软件开发正在从传统手工业变成制造业
一旦生产成本足够低，垃圾就是能最大化产量、利润和市场触达的东西
最终市场上流通的不是丰富的好东西，而是过量的最易消费的东西 —— 我们确实正在消费它们（AI 垃圾）
人类程序员未来还有多少市场？未来的创新将是什么？

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
1 软件开发的“工业革命”：从手工业到制造业
2 软件作为一次性商品
3 稀缺商品的工业化生产
4 传统软件未来还有生存空间吗？
- 4.1 再次参考食品、服装领域
- 4.2 创新：人类的自留地？
  - 无形产品：开放的方案空间
  - 创新：发现和解决新问题，获得更大价值的唯一路径
5 创新+规模化/商品化：进步的无限循环

Industrial 一词在牛津词典的定义：

Industrial

adj. (sense 3a)

Of or relating to productive work, trade, or manufacture, esp. mechanical industry or large-scale manufacturing; ( also) resulting from such industry.

—Oxford English Dictionary

1 软件开发的“工业革命”：从手工业到制造业 1.1 手工业：成本高、开发慢，高度依赖人的专业技能和经验

从历史看，软件开发更接近于手工业（craft）而非制造业（manufacture）： 成本高、开发慢，且高度依赖人的专业技能和经验。

1.2 制造业：成本低、开发快、很少依赖人的专业知识

现在，AI coding 正在快速改变这一现状，它使得产品开发更加地低成本、快速、且越来越不依赖人的专业知识。

1.3 软件开发日益自动化的世界

我之前曾说 AI coding can be a trap for today’s practitioners ，它看似能快速给出一个实现，但经常细看就会发现给出的方案相当不完整，而且后期理解和维护成本很高。不过随着工具集的不断完善，这些问题都在快速解决，很明显我们正在迈向一个软件开发日益自动化的时代。

当软件开发经历一次“工业革命”，会发生什么？

2 软件作为一次性商品 2.1 现状：劳动力（程序员）贵，生产（软件开发）有规模瓶颈

传统上，软件的生产成本很高，主要是来自具备专业技能的专业劳动力的成本，简单说就是程序员的成本。

在这个时期，由于强依赖人力，因此从世界范围内看，程序员的规模也决定了能开发出的软件规模的上限。在这个阶段，软件作为一种具备价值属性的商品，由于其开发是有不小成本的，因此公司都把钱花在开发有价值的软件上。

2.2 工业化的本质：自动化（不依赖人、低成本）

任何领域的工业化都试图同时解决以上两个限制，通过流程自动化

减少对人类劳动的依赖，既降低成本，
又允许更大规模和更灵活的生产。

这种变化将人类的角色降级为监督、质量控制和工业流程的优化。

影响一：传统开发模式受到挤压，门槛降低，劳动力（程序员）竞争加剧

这种变化的第一层影响是传统的高质量的软件生产方式受到挤压。

行业的进入门槛降低，竞争加剧，变化速度加快 —— 所有这些影响今天都已经开始显现了。

影响二：大规模生产低质量、低成本的软件

这种工业化的第二层影响是能够以极大的规模生产低质量、低成本的产品。其他领域的例子包括：

印刷工艺的工业化导致了平装书的出现
农业的工业化导致了垃圾食品的出现
数字图像传感器的工业化导致了海量普通人拍摄的图片、视频等等

2.3 一次性软件（disposable software）

软件领域的工业化催生了一类新的编程产物，我们可以称之为一次性软件（disposable software）：这种软件的所有权、后续维护和长期可理解性都是完全没有保证的。

传统软件：高成本、高价值；一次性软件：低成本、低价值。

这种产物的支持者可能会将其称为 vibe-coded software，怀疑者则会称为 AI slop（AI 垃圾、泔水）。

显然，不管其质量如何，这种软件的经济学价值是与传统软件完全不同的，因为其易于复制，因此单位软件的经济价值较低。

这种低价值属性可能会让一些人认为这一趋势是昙花一现，但这么想就错了。要理解原因，我们可以看看以前稀缺商品的工业化普及的例子。

3 稀缺商品的工业化生产 3.1 Jevons 悖论煤炭：单位效率提升，单位成本下降，总消费上升

Jevons 悖论是一个古老的经济学理论，最近被广泛引用。这一观察可以追溯到十九世纪，它指出单位煤炭效能的提升会导致成本下降，进而会导致用户更大的需求量，最终导致更高的总体煤炭消费。

Jevons 悖论描述了单位效率提高如何导致总体消费增加。

Token：单位推理成本下降，推理需求变多，总算力消费激增

今天类似的场景是我们对 AI 计算的需求激增：随着模型在预测 token 方面变得更高效，需求激增，导致更大的 token 消费。同样的效果会波及软件开发本身吗？随着努力成本的降低，是否会推动更高的消费和产出？历史表明会如此。

3.2 农业领域的先例：食物生产的工业化：垃圾食品

考虑农业的工业化。

消灭饥饿 vs. 垃圾食品

二十世纪初，人们认为科学进步将消除饥饿，迎来一个丰富、营养的食物时代。
但直到今天，饥饿和饥荒依然存在。
- 2025 年，仍有 3.18 亿人经历急性饥饿，即使在农业盈余的国家也是如此。
- 与此同时，在最富有的国家，工业食品系统产生了另一种丰富：美国的成年人肥胖率为 40%，糖尿病危机日益严重。

极度加工的（ultraprocessed）食品被广泛认为是有害的，然而绝大多数美国人每天仍然在消费它们。

丰富的好东西 vs. 过量的最易消费的东西

工业系统毫无意外地给传统食物加工系统造成了压力，结果导致了过剩、低质量商品在市场上的流通。这个选择权甚至不是生产者所能把控，因为一旦生产成本足够低，垃圾就是最大化产量、利润和市场触达的东西。最终的结果不是丰富的好东西，而是过量的最易消费的东西 —— 我们确实正在消费它们。

3.3 软件领域：AI 垃圾（用户生成的软件/程序）将不可避免地泛滥

我们对 AI 垃圾的青睐也可能会导致与食物领域同样的结果。

工业化的经济压力将推动一次性软件的流行/泛滥。

如果说智能手机的普及带来的无处不在的用户生成的照片、视频和音频（user generated contents），那软件开始工业化生产之后，我们很可能在社交媒体上看到用户海量地创建、共享和丢弃用户生成的软件（user generated softwares）。

一但这个齿轮转动起来，社交媒体和互联网的新奇和奖励反馈循环 将推动用户生产软件的爆炸式增长，使过去半个世纪的发展相形见绌。

4 传统软件未来还有生存空间吗？ 4.1 再次参考食品、服装领域

垃圾食品当然不是市场上留下的唯一食品选择。仍然有很多人对健康、可持续的食品生产有持续不断的需求，这也主对工业化生产的一种回应。像“有机食物”一样，软件是否也可能通过”有机软件”运动来抵抗机械化？

如果看看其他行业，我们会发现，即使是工业化程度最高的行业，也仍然存在小规模、人类主导的生产，作为完整生产体系的一部分。

例如，在工业化之前，服装主要由专业匠人制作，通常通过行会和手协调，资源在当地收集，制作耐用织物的专业知识积累多年，并在家族中传承等等。工业化完全改变了这一模式，原材料在洲际间运输，织物在工厂中大规模生产，衣服由机器组装，所有这些都导致了今天快速、一次性、剥削性的时尚世界。然而，手工制作的服装仍然存在：从定制西装到针织围巾，小规模、慢生产的纺织品仍然有一席之地，原因包括合身定制、彰显财富、耐用，以及享受手工艺产品等等。

4.2 创新：人类的自留地？

那么，人类编写的软件是否会和高级时装或自制针织品类似，成为一个区别与大众市场的精品市场？

未来，人工编写的软件是否会变成精品店？

无形产品：开放的方案空间

如果软件是有形的产品，情况可能就是类似的，工业化导致可重用（物理）组件的大规模生产。但是，软件是无形的商品，与其他领域不同，它本身就有着组件重用的悠久历史，这是软件商品本身固有的属性。

创新不仅限于让现有的产品（例如服装）更好或更便宜，还包括解决方案空间的扩大，例如，蒸汽机的出现使人类能够重用机器组件，造出了后来的生产线、汽车等。

创新：发现和解决新问题，获得更大价值的唯一路径

因此，软件开发的进步不仅仅是工业化，还包括创新。 研发虽然昂贵，但随着时间的推移提供了获得更大价值的唯一路径。

创新是未来人工开发软件的价值增长点。

创新从根本上不同于工业化，因为它不是专注于更有效地复制今天已经存在的东西。而是在以前的基础上，它通过发现和解决新问题来提供以前没有的新能力。

5 创新+规模化/商品化：进步的无限循环

创新提供了以前没有的新能力之后，接下来就又轮到工业化入场了，它把这种新能力规模化和商品化，为下一轮创新建立基础。这两种力量的相互作用就是我们所说的进步。

5.1 大模型是软件领域的蒸汽机，大量工作不再依赖人力劳动

大语言模型的出现是软件领域的蒸汽机时刻。它们降低了以前完全依赖稀缺的人类劳动的那些工作的成本，从而解锁了的非凡加速度。

5.2 蒸汽机并不是凭空出现的，而是一个拐点，自动化、规模和资本在此对齐

但注意，蒸汽机并不是凭空出现的。

风车和水车在涡轮机之前几个世纪就出现了
机械化并不是从煤炭和钢铁开始的

蒸汽机只是刚好达到了一个拐点，在这个拐点上，自动化、规模和资本对齐，推动了经济转型。

5.3 软件领域的巨大加速时刻

同样，软件也已经工业化很长时间了：可重用组件（开源代码）、可移植性（容器化、云）、大众化（低代码/无代码工具）、互操作性（API 标准、包管理器）和许多其他方式。

因此，我们正在进入软件的工业革命，不是作为断裂的时刻，而是巨大的加速时刻。

工业化不会取代技术进步，但它将大大加速新思想的吸收和新能力的商品化。
反过来，能更快地解锁创新，因为在新技术基础上构建的成本下降得更快。

进步的循环继续，但在大规模自动化时代，轮子比以往任何时候都转得更快。

进步的循环：创新+工业化同时驱动。

5.4 工业化生产的软件占据主导地位之后，对周围生态系统的影响

至此，剩下的开放问题不是工业软件是否会占主导地位，而是这种主导地位对周围生态系统将造成怎样的影响。

以前的工业革命将其影响外化到看似无限的环境中，刚开始不会引人注目，但越到后面越明显；
软件生态系统也是类似的：依赖链、维护负担、安全等等问题，都会随着生产出的软件规模不断增加而越来越严重。

导致的技术债是对数字世界的污染，直到严重到足以扼杀依赖它的那些系统。

5.5 最难的不再是生产，而是管理

在大规模自动化时代，我们可能会发现最困难的问题不是生产，而是管理。 谁来维护那些海量的没有 owner 的软件？

[译] 软件领域的工业革命：AI 将使软件成为一种新的 UGC（2025）

ARTHURCHIAO'S BLOG

4 days 2 hours ago

译者序

本文翻译自 2025 年的一篇文章 The rise of industrial software。

工业化能以极大的规模生产低质量、低成本的产品，

印刷工艺的工业化导致了平装书的出现
农业的工业化导致了垃圾食品的出现
数字图像传感器的工业化导致了海量普通人拍摄的图片、视频等等

LLM 的出现是软件领域的蒸汽机时刻，软件开发正在经历一次属于它的“工业革命”，

软件开发正在从传统手工业变成制造业
一旦生产成本足够低，垃圾就是能最大化产量、利润和市场覆盖度的东西
最终市场上流通的不是丰富的好东西，而是过量的最易消费的东西 —— 我们确实正在消费它们（AI 垃圾）
人类程序员未来还有多少市场？未来的创新将是什么？

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
1 软件开发的“工业革命”：从手工业到制造业
2 软件作为一次性商品
3 稀缺商品的工业化生产
4 传统软件未来还有生存空间吗？
- 4.1 再次参考食品、服装领域
- 4.2 创新：人类的自留地？
  - 无形产品：开放的方案空间
  - 创新：发现和解决新问题，获得更大价值的唯一路径
5 创新+规模化/商品化：进步的无限循环

Industrial 一词在牛津词典的定义：

Industrial

adj. (sense 3a)

Of or relating to productive work, trade, or manufacture, esp. mechanical industry or large-scale manufacturing; ( also) resulting from such industry.

—Oxford English Dictionary

1 软件开发的“工业革命”：从手工业到制造业 1.1 手工业：成本高、开发慢，高度依赖人的专业技能和经验

从历史看，软件开发更接近于手工业（craft）而非制造业（manufacture）： 成本高、开发慢，且高度依赖人的专业技能和经验。

1.2 制造业：成本低、开发快、很少依赖人的专业知识

现在，AI coding 正在快速改变这一现状，它使得产品开发更加地低成本、快速、且越来越不依赖人的专业知识。

1.3 软件开发日益自动化的世界

当软件开发经历一次“工业革命”，会发生什么？

2 软件作为一次性商品 2.1 现状：劳动力（程序员）贵，生产（软件开发）有规模瓶颈

传统上，软件的生产成本很高，主要是来自具备专业技能的专业劳动力的成本，简单说就是程序员的成本。

2.2 工业化的本质：自动化（不依赖人、低成本）

任何领域的工业化都试图同时解决以上两个限制，通过流程自动化

减少对人类劳动的依赖，既降低成本，
又允许更大规模和更灵活的生产。

这种变化将人类的角色降级为监督、质量控制和工业流程的优化。

影响一：传统开发模式受到挤压，门槛降低，劳动力（程序员）竞争加剧

这种变化的第一层影响是传统的高质量的软件生产方式受到挤压。

行业的进入门槛降低，竞争加剧，变化速度加快 —— 所有这些影响今天都已经开始显现了。

影响二：大规模生产低质量、低成本的软件

这种工业化的第二层影响是能够以极大的规模生产低质量、低成本的产品。其他领域的例子包括：

印刷工艺的工业化导致了平装书的出现
农业的工业化导致了垃圾食品的出现
数字图像传感器的工业化导致了海量普通人拍摄的图片、视频等等

2.3 一次性软件（disposable software）

传统软件：高成本、高价值；一次性软件：低成本、低价值。

这种产物的支持者可能会将其称为 vibe-coded software，怀疑者则会称为 AI slop（AI 垃圾、泔水）。

显然，不管其质量如何，这种软件的经济学价值是与传统软件完全不同的，因为其易于复制，因此单位软件的经济价值较低。

这种低价值属性可能会让一些人认为这一趋势是昙花一现，但这么想就错了。要理解原因，我们可以看看以前稀缺商品的工业化普及的例子。

3 稀缺商品的工业化生产 3.1 Jevons 悖论煤炭：单位效率提升，单位成本下降，总消费上升

Jevons 悖论描述了单位效率提高如何导致总体消费增加。

Token：单位推理成本下降，推理需求变多，总算力消费激增

3.2 农业领域的先例：食物生产的工业化：垃圾食品

考虑农业的工业化。

消灭饥饿 vs. 垃圾食品

二十世纪初，人们认为科学进步将消除饥饿，迎来一个丰富、营养的食物时代。
但直到今天，饥饿和饥荒依然存在。
- 2025 年，仍有 3.18 亿人经历急性饥饿，即使在农业盈余的国家也是如此。
- 与此同时，在最富有的国家，工业食品系统产生了另一种丰富：美国的成年人肥胖率为 40%，糖尿病危机日益严重。

极度加工的（ultraprocessed）食品被广泛认为是有害的，然而绝大多数美国人每天仍然在消费它们。

丰富的好东西 vs. 过量的最易消费的东西

3.3 软件领域：AI 垃圾（用户生成的软件/程序）将不可避免地泛滥

我们对 AI 垃圾的青睐也可能会导致与食物领域同样的结果。

工业化的经济压力将推动一次性软件的流行/泛滥。

一但这个齿轮转动起来，社交媒体和互联网的新奇和奖励反馈循环 将推动用户生产软件的爆炸式增长，使过去半个世纪的发展相形见绌。

4 传统软件未来还有生存空间吗？ 4.1 再次参考食品、服装领域

如果看看其他行业，我们会发现，即使是工业化程度最高的行业，也仍然存在小规模、人类主导的生产，作为完整生产体系的一部分。

4.2 创新：人类的自留地？

那么，人类编写的软件是否会和高级时装或自制针织品类似，成为一个区别与大众市场的精品市场？

未来，人工编写的软件是否会变成精品店？

无形产品：开放的方案空间

创新：发现和解决新问题，获得更大价值的唯一路径

因此，软件开发的进步不仅仅是工业化，还包括创新。 研发虽然昂贵，但随着时间的推移提供了获得更大价值的唯一路径。

创新是未来人工开发软件的价值增长点。

5 创新+规模化/商品化：进步的无限循环

5.1 大模型是软件领域的蒸汽机，大量工作不再依赖人力劳动

大语言模型的出现是软件领域的蒸汽机时刻。它们降低了以前完全依赖稀缺的人类劳动的那些工作的成本，从而解锁了的非凡加速度。

5.2 蒸汽机并不是凭空出现的，而是一个拐点，自动化、规模和资本在此对齐

但注意，蒸汽机并不是凭空出现的。

风车和水车在涡轮机之前几个世纪就出现了
机械化并不是从煤炭和钢铁开始的

蒸汽机只是刚好达到了一个拐点，在这个拐点上，自动化、规模和资本对齐，推动了经济转型。

5.3 软件领域的巨大加速时刻

因此，我们正在进入软件的工业革命，不是作为断裂的时刻，而是巨大的加速时刻。

工业化不会取代技术进步，但它将大大加速新思想的吸收和新能力的商品化。
反过来，能更快地解锁创新，因为在新技术基础上构建的成本下降得更快。

进步的循环继续，但在大规模自动化时代，轮子比以往任何时候都转得更快。

进步的循环：创新+工业化同时驱动。

5.4 工业化生产的软件占据主导地位之后，对周围生态系统的影响

至此，剩下的开放问题不是工业软件是否会占主导地位，而是这种主导地位对周围生态系统将造成怎样的影响。

以前的工业革命将其影响外化到看似无限的环境中，刚开始不会引人注目，但越到后面越明显；
软件生态系统也是类似的：依赖链、维护负担、安全等等问题，都会随着生产出的软件规模不断增加而越来越严重。

导致的技术债是对数字世界的污染，直到严重到足以扼杀依赖它的那些系统。

5.5 最难的不再是生产，而是管理

在大规模自动化时代，我们可能会发现最困难的问题不是生产，而是管理。 谁来维护那些海量的没有 owner 的软件？

[译][论文] P5 paper | 用语言模型做推荐：一种统一的预训练、个性化提示和预测范式（2022）

ARTHURCHIAO'S BLOG

2 weeks 2 days ago

译者序

本文翻译自 2022 年 RecSys 大会的一篇论文 Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)。

Figure 1: P5 pretrains on an encoder–decoder Transformer model that takes in textual inputs and produces target responses.

图 3：P5 架构示意图。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
摘要
1 引言
2 相关工作
3 个性化 prompts 集合
- 3.1 Prompts 设计
- 3.2 从原始数据构建训练数据集（prompts & answers）
4 P5 范式与模型
- 4.1 P5 架构
- 4.2 用预训练的 P5 进行推荐任务（推理）
5 实验
6 CONCLUSIONS AND FUTURE WORK

MathJax.Hub.Config({ extensions: ["tex2jax.js"], jax: ["input/TeX", "output/HTML-CSS"], tex2jax: { inlineMath: [ ['$','$'], ["\$","\$"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ], processEscapes: true }, "HTML-CSS": { availableFonts: [], preferredFont: null, webFont: "Neo-Euler", mtextFontInherit: true }, TeX: { extensions: ["color.js"], Macros: { lgc: ["{\\color{my-light-green} #1}", 1], gc: ["{\\color{my-green} #1}", 1], lrc: ["{\\color{my-light-red} #1}", 1], rc: ["{\\color{my-red} #1}", 1], lbc: ["{\\color{my-light-blue} #1}", 1], bc: ["{\\color{my-blue} #1}", 1], kc: ["{\\color{my-gray} #1}", 1], loc: ["{\\color{my-light-orange} #1}", 1], oc: ["{\\color{my-orange} #1}", 1], a: ["\\mathbf a"], A: ["\\mathbf A"], b: ["\\mathbf b"], B: ["\\mathbf B"], c: ["\\mathbf c"], C: ["\\mathbf C"], d: ["\\mathbf d"], D: ["\\mathbf D"], E: ["\\mathbf E"], I: ["\\mathbf I"], L: ["\\mathbf L"], m: ["\\mathbf m"], M: ["\\mathbf M"], r: ["\\mathbf r"], s: ["\\mathbf s"], t: ["\\mathbf t"], S: ["\\mathbf S"], x: ["\\mathbf x"], z: ["\\mathbf z"], v: ["\\mathbf v"], y: ["\\mathbf y"], k: ["\\mathbf k"], bp: ["\\mathbf p"], P: ["\\mathbf P"], q: ["\\mathbf q"], Q: ["\\mathbf Q"], r: ["\\mathbf r"], R: ["\\mathbf R"], Sig: ["\\mathbf \\Sigma"], t: ["\\mathbf t"], T: ["\\mathbf T"], e: ["\\mathbf e"], X: ["\\mathbf X"], u: ["\\mathbf u"], U: ["\\mathbf U"], v: ["\\mathbf v"], V: ["\\mathbf V"], w: ["\\mathbf w"], W: ["\\mathbf W"], Y: ["\\mathbf Y"], z: ["\\mathbf z"], Z: ["\\mathbf Z"], p: ["\\,\\text{.}"], tab: ["\\hspace{0.7cm}"], sp: ["^{\\small\\prime}"], mR: ["{\\mathbb R}"], mC: ["{\\mathbb C}"], mN: ["{\\mathbb N}"], mZ: ["{\\mathbb Z}"], deg: ["{^\\circ}"], argmin: ["\\underset{#1}{\\text{argmin}}", 1], argmax: ["\\underset{#1}{\\text{argmax}}", 1], co: ["\\;\\text{cos}"], si: ["\\;\\text{sin}"] } } }); MathJax.Hub.Register.StartupHook("TeX color Ready", function() { MathJax.Extension["TeX/color"].colors["my-green"] = '#677d00'; MathJax.Extension["TeX/color"].colors["my-light-green"] = '#acd373'; MathJax.Extension["TeX/color"].colors["my-red"] = '#b13e26'; MathJax.Extension["TeX/color"].colors["my-light-red"] = '#d38473'; MathJax.Extension["TeX/color"].colors["my-blue"] = '#306693'; MathJax.Extension["TeX/color"].colors["my-light-blue"] = '#73a7d3'; MathJax.Extension["TeX/color"].colors["my-gray"] = '#999'; MathJax.Extension["TeX/color"].colors["my-orange"] = '#E69500'; MathJax.Extension["TeX/color"].colors["my-light-orange"] = '#FFC353'; }); 摘要

长期以来，不同的推荐任务通常需要针对特定任务设计 架构与训练目标 (task-specific architectures and training objectives)。这导致难以将学习到的知识与表征从一个任务迁移到另一个任务，从而限制了现有推荐方法的泛化能力。例如，一个序列推荐模型 (sequential recommendation) 很难被应用或迁移到评论生成 (review generation) 任务中。

考虑到语言几乎可以描述任何事物，而且语言基础是一种表征各种问题或任务的强大媒介，本文提出一种灵活、统一的文本到文本范式来解决以上问题 —— 这种范式我们称为 “Pretrain, Personalized Prompt, and Predict Paradigm” (预训练、个性化提示与预测范式)，缩写为 P5。它将各类推荐任务统一在一个共享框架中，

在 P5 中，所有数据 （user-item interactions, user descriptions, item metadata, user reviews 等）都被转换为统一的自然语言序列。
自然语言所蕴含的丰富信息有助于 P5 捕获更深层的语义，从而实现个性化推荐。

具体而言，P5 在预训练阶段通过相同的语言建模目标学习不同任务，从而成为各类下游推荐任务的基础模型。

P5 不仅能轻松与其他模态信息融合，还能基于提示实现指令驱动的推荐。
P5 将推荐系统从浅层模型、深度模型推进至大模型阶段，并将以通用推荐引擎的形式彻底革新推荐系统的技术形态。
通过为不同用户自适应生成个性化提示，P5 能够以零样本或少样本方式进行预测，大幅减少了对大量微调的依赖。

我们在多个推荐基准测试上进行了实验，验证了 P5 的有效性，相关代码和模型也已经开源：

github.com/jeykigung/P5 开源了源代码、数据集、提示词及预训练的 P5 模型。
huggingface.co/makitanikaze/P5 模型。

1 引言

过去几十年，推荐系统取得了显著进步，并在人们的日常生活中发挥着重要作用。而现在，推荐系统在朝着特征更多样性、应用场景更广泛的综合系统发展。

1.1 现阶段推荐系统的特点特征表示和学习越来越复杂

推荐系统中的 feature engineering 和 learning 已经从简单发展到复杂。

早期，推荐系统通常采用 logistic regression 或 collaborative filtering [25, 35, 50, 52]，利用 user-item interaction 数据来建模用户的行为模式。
之后，通过更复杂的模型如 factorization machines [48] 和 GBDT [20]，将 contextual features（如 user profile 和 item metadata）进一步整合到系统中。
最近，deep neural network models [3, 5, 19, 74] 促进了更加多样和复杂的特征之间的交叉与组合。因此，与传统基于 feature engineering 的方法相比，这些模型获得了更好的表示能力。

推荐任务的类型越来越多样

推荐任务的类型也越来越多。除了经典的 rating prediction 和基于 direct user-item matching 的推荐任务之外，最近的研究正在将范围扩展到新的任务和场景，如

sequential recommendation [21, 60, 63, 80]
conversational recommendation [8, 61, 76]
explainable recommendation [17, 31, 62, 70, 75, 77]

等等。虽然上述推荐任务的方法通常是单独提出的，但一个明显的趋势是 利用多个推荐任务来联合学习 transferable representations [31, 56, 57, 72]。

1.2 现代推荐系统需要什么

尽管现有的推荐系统取得了巨大成功，但在解决实际问题上仍面临很多问题，我们认为需要一个能支持多样特征和不同类型任务的综合推荐系统。

推荐任务通常共享同一个 user–item pool（用户-物品信息池）并具有重叠的 contextual features，因此，我们任务将多个推荐任务合并到一个统一框架中是非常有希望的，这样多个任务可以隐式地 transfer knowledge，相互受益，并泛化到其它没见过的任务。

1.3 P5 的创新点

受最近 multitask prompt-based training [1, 51, 67] 进展的启发，本文提出一个统一的范式 P5。它有三个主要优势：

将推荐模型（行为模型）深度融入到语言环境（语言模型）中。

基于 personalized prompts，所有推荐任务都被重新表述为 NLP 任务。由于自然语言足够灵活和强大，能够用文本表达各种类型的特征，因此无需设计 feature-specific encoders。通过这种方式，P5 可以充分利用训练语料库中丰富的语义和知识；

译注：从 Tokenization 视角看生成式推荐（GR）近几年的发展（2025）
将多个推荐任务放到同一个 text-to-text encoder-decoder 中，并使用相同的 language modeling loss 进行训练，而不是设计 task-specific 架构和 objective functions。

换句话说， P5 将所有 personalized tasks 视为 conditional text generation 问题；
通过 instruction-based prompts 训练，P5 在推广到新的 personalized prompts 或其它领域中未见过的 items 时，获得了良好的 zero-shot 性能。

Figure 1: P5 pretrains on an encoder–decoder Transformer model that takes in textual inputs and produces target responses. We trained P5 on a multitask collection of personalized prompts. After multitask prompt-based pretraining on recommendation datasets, P5 achieves the capability of zero-shot generalization to unseen personalized prompts and new items.

2 相关工作 2.1 统一框架的尝试

之前已经有一些工作试图在统一模型中解决各种推荐任务。

基于通用语言模型（T5 和 GPT3）

早期先驱，

T5 ：通过 text-to-text encoder-decoder 框架统一了 NLP 下游任务。
GPT-3：通过 autoregressive language modeling 统一了 NLP 下游任务。

它们都能基于同一个预训练的语言模型实现不同任务之间的有效知识共享（即，通用模型）。

基于自然语言的 seq-to-seq 架构

最近业界开始专注于通过一个共享的 sequence-to-sequence 框架统一大规模语言任务 [1, 51, 67] 或跨模态应用 [6, 66, 71]，其中不同类型的任务和模态都以自然语言形式表达。

但是，这类方法没有在模型中考虑个性化。

基于通用用户表示

[56, 57, 72] 尝试学习易于迁移到下游任务的通用用户表示。这些方法的一个局限性是它们仍然需要在下游数据集上进行 finetuning。

相比之下，P5 将个性化纳入 encoder-decoder Transformer 模型，该模型可以泛化到广泛的需要个性化推荐的场景。此外，借助 prompt-based pretraining，P5 在迁移到未见过的 prompts 和 items 时获得了良好的 zero-shot generalization 能力。

2.2 通过提示的方式学习（Prompt Learning）

GPT 系列尤其是 GPT-3 的成功标志着 prompt 在 NLP 任务中的普及。

在互联网上收集的大量语言数据进行训练，GPT-3 展示了在提供少量输入-输出示例作为 exemplar prompts 时解决 NLP 任务的能力。
其他一些遵循 “pretrain, prompt, and predict” 范式的 prompt 设计方法最近也有发展 [37]。
- [16, 23, 36, 40, 58] 探索了针对特定离散提示的搜索。
- [18, 28, 33, 38, 45, 81] 利用连续向量 embedding 作为提示。

由于 instruction-based prompt 包含详细的任务描述，更符合自然语言方式，而且与人类的交流方式很类似，一些工作 [11, 68] 认为从多样的 NLP 数据集学习是通往通用 NLP 系统的一种方式。最近的工作如 FLAN [67] 和 T0 [51] 在大型 NLP 数据集上微调 pretrained language models，这些数据集通过人类可读的提示进行组织，在未见过的任务上表现出强大 zero-shot 能力。

受这些方法成功的启发，我们创建了一个个性化提示集，然后在一个多样化的推荐任务上训练一个 sequence-to-sequence 模型。

2.3 推荐领域的 NLP

推荐已经与 NLP 技术有很长时间的交集了。四个主要方向：

explainable recommendation [4, 10, 30–32, 75, 77] where NLP models help generating text explanations for a given recommendation;
sequential recommendation as language modeling [9, 60, 80] which considers user interaction histories as word token sequences;
text feature extraction [69, 74, 79] which aims to extract informative text encodings that can improve the performance of recommendation;
conversational recommendation [8, 12–14, 22, 61, 76] that reasons the intent of users and gives recommendation in an interactive dialog format.

本文主要涵盖前两种任务，并讨论了如何设计一个统一的 NLP 框架来涵盖 rating prediction、top-k recommendation 和 review summarization 等任务。

此外，通过使用与传统相似的指令式提示进行预训练，P5 受益于自然语言环境，提高了在系列推荐任务上的性能。

2.4 Zero-shot 和冷启动推荐

推荐系统的性能很大程度上依赖于可用的训练数据，但总是存在零样本或少样本的情况。如果在这类冷启动场景下，推荐系统的表现也很好，就表明这个推荐模型具有良好的泛化能力。

一个常见的研究是冷启动推荐，即用户 [26] 或物品 [53] 是新系统，没有之前的交互记录。

常见解决方案是学习去建模内容特征 [15, 29, 44, 55]，以便在没有交互记录的情况下进行推理，或者是从其他的辅助域学习迁移表示 [42, 56, 59, 72, 82]。
另一种解决方式是快速适应新域（quick adaptation to the new domain），而非供冷启动 case。解决方案通常遵循meta learning [27, 64] 或因果学习 [34] (causal learning) 框架，使模型对域适应具有鲁棒性。

在我们的工作中，我们要求 P5 模型在辅助域上预训练，以解决目标域上的任务，其中用户对 P5 是已知的，但物品 P5 是没见过的。

3 个性化 prompts 集合

为了方便 multitask prompt-based pretraining，我们创建了一个个性化提示集。个性化提示集覆盖了五类不同的任务：

rating prediction
sequential recommendation
explanation
review
direct recommendation

每类任务包含多个个性化提示，帮助 P5 发现用户和物品的各个方面关联。

[51] 中，一个提示由一个输入模板和一个目标模板组成，以及一组相关的元数据。在本文中，我们进一步定义个性化提示为包含个性化字段的提示，用于不同的用户和物品（a prompt that includes personalized fields for different users and items）。

例如，一个用户的偏好可以通过一个 ID 描述，也可以通过一段文本描述表示。此外，给定个性化提示，期望模型输出也应该根据其物品字段而变化。这按时的说用户对不同物品的不同偏好。这样的物品字段可以表示为物品 ID 号码或包含详细描述的物品元数据。

3.1 Prompts 设计

我们针对每个任务设计了一个基本的个性化提示集。

rating prediction 提示词设计

对于 rating prediction 任务，我们将其提示分为三个类别：

给定用户和物品的信息，直接预测用户给该物品的评分，范围从 1 到 5；
预测用户是否会给一个物品指定的评分（rate an item a given score）。期望输出是 yes 或 no；
预测用户是否喜欢或不喜欢一个物品。

我们考虑评分等于或大于 4 为用户的喜欢偏好，而较低的评分表示用户的不喜欢偏好。

sequential recommendation 提示词设计

针对 sequential recommendation 任务，我们创建了三种类型的提示：

基于用户交互历史，直接预测下一个物品；
给定用户交互历史，从候选列表中选择可能的下一个物品，其中只有一个物品是正样本；
基于用户交互历史，预测给定物品是否会被用户下次交互。

explanation 提示词设计

针对 explanation 任务，我们要求 P5 生成一个文本解释，以证明用户对给定物品的偏好。两种提示：

直接生成一个包含用户和物品信息的解释句子；
基于一个特征词作为提示，生成解释。

对于每个类别，可能还包括其他辅助信息，例如评论标题和评分。

review 相关提示词设计

针对 review 相关任务，我们创建了两种类型的提示：

总结评论，生成一个更短的评论标题；
基于给定的评论，预测相应的评分。

direct recommendation 提示词设计

针对 direct recommendation 任务，我们创建了两种类型的提示：

预测是否向用户推荐一个物品，期望输出是 yes 或 no；
从候选物品列表中选择最合适的物品推荐给用户。

完整的个性化提示集见附录。

3.2 从原始数据构建训练数据集（prompts & answers）

构建训练数据的过程如图 2 所示，

图 2：根据设计的个性化提示模板，从原始数据构建训练用的 input-target pairs 或零样本测试个性化提示。原始数据来自三个数据源。具体的，rating/review/explanation （a）共享相同的原始数据，而 sequential recommendation (b) 和 direct recommendation (c) 使用类似的原始数据，但前者还需要用户交互历史。完整的 P5 个性化提示集见附录。

训练数据和预训练任务对这些数据中的信息进行萃取，提炼用户的偏好和个性化信息。预训练时，我们将不同任务的 input-target pairs 混合在一起作为训练数据。

为了增强 P5 的鲁棒性和零样本泛化能力，对于每个原始数据，我们只采样一部分，而不是每个任务中的所有个性化提示。在 sequential 和 direct recommendation 任务中，我们还会对那些需要候选列表的场景随机选择一些负物品。

4 P5 范式与模型

所有预训练数据共享统一的 input-target token 序列格式，打破了不同任务之间的界限。在条件生成统一框架下预训练多个推荐任务可以提升所有任务的效果。

整个预训练阶段将 P5 沉浸在完整的语言环境中，我们期望增强其零样本泛化能力，能够理解新颖的个性化提示，即使这些提示包含详细的物品描述。这就是为什么 P5 被称为统一的“预训练、个性化提示和预测范式”（”Pretrain, Personalized Prompt, and Predict Paradigm”）。

4.1 P5 架构

具体到 P5 架构，我们采用基本的 encoder-decoder 框架，并使用 Transformer 构建编码器和解码器。

假设输入 token 序列的 embedding 为 $\mathbf{x} = \left[x_1, \cdots, x_n\right]$。如 Figure 3 所示，

图 3：P5 架构示意图。对于示例 prompt 输入 What star rating do you think user_23 will give item_7391?，P5 首先使用双向文本编码器编码输入，然后通过文本解码器自回归地生成答案。与任务特定的推荐模型不同，P5 基于 multitask prompt-based pretraining，因此能够适应不同的任务，泛化能力很强。

位置编码

增加位置编码，以捕获序列中的位置信息。

Whole-word embedding，补偿 item token 表示被 tokenizer 拆分带来的语义损失

为了使 P5 捕捉输入序列中包含的个性化信息，我们还应用 whole-word embedding $\mathcal{W}$ 来表示连续的 sub-word token 是否来自同一个原始单词。

为什么需要这个步骤呢？举个例子，

如果我们直接用 ID 7391 表示物品，即 item_7391，那么这个词经过 SentencePiece tokenizer 之后，就会变成 4 个独立的 token（item, _, 73, 91），而不是我们期望的一个。通过共享的 whole-word embedding （图 3 中的 <w10>），P5 可以更好地识别包含个性化信息的字段。
另一种方案是每个用户/物品用一个独立的额外 token 表示（例如，<item_7391>）。然而，当用户和物品数量很大时，这可能会引入大量的额外 token。

encoder & decoder

接下来，文本编码器将上述三个 embedding 的和 $\mathbf{e} = \left[e_1, \cdots, e_n\right]$ 作为输入，并输出上下文化之后的表示 $\mathbf{t} = \left[t_1, \cdots, t_n\right] = \mathcal{E}(\mathbf{e})$。

解码器 $\mathcal{D}(\cdot)$ 然后关注之前生成的 token $\mathbf{y}$ 和编码器输出 $\mathbf{t}$，并预测未来 token 的概率分布：

$P_{\theta}\left(\mathbf{y}_{j} \mid \mathbf{y}_{<j}, \mathbf{x}\right) = \mathcal{D}(\mathbf{y}_{<j}, \mathbf{t})$。

在预训练阶段，P5 minimizing the negative log-likelihood of label tokens y conditioned on input text x in an end-to-end manner：

这个相同的损失函数被所有 P5 下的推荐任务共享。因此，我们统一推荐任务，使用一个模型、一个损失和一个数据格式。

4.2 用预训练的 P5 进行推荐任务（推理）

在预训练之后，P5 可以直接个性化提示执行不同的任务，不管这些 prompts 它有没有见过。

对于 rating、explanation 和 review 任务，简单地使用贪心解码（greedy decoding）来生成答案。
对于 sequential 和 direct recommendation 任务，通常需要一个物品列表作为目标输出，使用 beam search。

对于 sequential recommendation，我们应用 beam search 生成一个潜在的下一个物品列表。对于 direct recommendation，我们从一个候选物品集合 $\mathbf{S} = {S_1, \cdots, S_m}$ 中预测推荐的物品，其中只有 $m$ 个候选物品中的一个是正样本。这里，我们同样使用 beam search 解码一个具有最高分数的潜在目标物品列表，然后进行评估。上述两种解码过程可以写为：

其中 $B$ 表示 beam size，$\mathbf{C}$ 表示输出物品列表。

5 实验

本节我们评估 P5 在真实世界数据上的性能，并与其他代表性方法进行比较。通过性能比较和消融研究，我们旨在回答以下问题：

5.0 要回答的问题 (RQ 1~5) 问题一：P5 与 task-specific 方法的性能比较

How does our unified P5 framework perform compared with task-specific methods on all five task families?

问题二：P5 的零样本泛化能力

Does P5 have enough zero-shot generalization ability when transferring to unseen personalized prompts for either existing or new items?

问题三：P5 的性能如何受模型大小、任务数量和提示数量影响？

How do scaling factors such as model size, number of task families, and number of prompts affect the performance of P5?

问题四：P5 中实现个性化推荐的最佳方式是什么？（unique token vs. sub-word units）

Which is a better way to implement personalization in P5: adopting an independent extra token for each user or item (e.g., “⟨user_23⟩”) or the default setting, i.e., tokenizing each user or item into multiple sub-word units (e.g., “user”, “_”, “23”)?

问题五：P5 的预训练时间？P5 的推理性能？

How long does it take for P5 to conduct pretraining? Is it efficient to make inference with the pretrained P5 model? We provide statistics on training and inference time in the Appendix

5.1 Experimental Setup Datasets

We conduct extensive experiments over four real-world datasets. The Amazon1 datasets are collected from Amazon.com platform with user ratings and reviews on 29 categories of products. In this paper, we adopt three of them to evaluate our method, namely Sports & Outdoors, Beauty, as well as Toys & Games. Besides, Yelp2 dataset contains a large number of user ratings and reviews for business recommendation. We follow [80] and use transaction records between January 1, 2019 to December 31, 2019. Due to space limit and that the results on Yelp show similar trends with other datasets, we put the experimental results on Yelp dataset in the Appendix. The detailed statistics of these datasets are presented in Table 1.

Task splits

For rating, explanation, and review task families, we randomly split each dataset into training (80%), validation (10%) and testing (10%) sets, and ensure that there is at least one instance included in the training set for each user and item. To obtain the ground-truth explanations, following the natural language explanation works [30, 31], we first extract item feature words from the reviews with the help of the Sentires toolkit3[77, 78], and then extract the sentences from reviews that comment on one or more item feature words as users’ explanation about their preference. In terms of sequential recommendation task family, for each user interaction sequence, the last item is used as the test data, the item before the last one is used as the validation data, and the remaining data is used for training. To avoid data leakage during pretraining, we follow the training split of sequential recommendation to build the training set for direct recommendation task family.

Implementation Details

Our P5 model utilizes the pretrained T5 checkpoints [47] as backbone. According to the size of T5 backbone, we create two versions of P5, namely P5-small (P5-S) and P5-base (P5-B). For P5-small, there are 6 layers for both encoder and decoder, the model dimensionality is 512 with 8-headed attention, and the number of parameters is 60.75 million. For P5-base, encoder and decoder both have 12 Transformer blocks. The model has an embedding dimensionality of 768 and a 12-headed attention, and the number of parameters is 223.28 million. For tokenization, we use the SentencePiece [54] tokenizer with a vocabulary size of 32,128 for parsing sub-word units. We pretrain P5 for 10 epochs with AdamW optimization [39] on four NVIDIA RTX A5000 GPUs. The batch size is set to 16 for P5-base and 32 for P5-small. We choose 1 × 10−3 as the peak learning rate and set the maximum length of input tokens to 512. The warmup strategy is used to adjust the learning rate during training, the warmup stage is set to be the first 5% of all iterations. When negative sampling is needed for training, we use 1:1 positive vs. negative sampling for both P5 and baselines. Our default pretrain–predict combination adopts the last prompt in each task family for zero-shot evaluation while all remaining prompts are utilized for multitask prompted pretraining. For rating prediction, we use Gaussian sampling to convert the original integer scores to float numbers rounded to 1 decimal place. In this way, we can avoid overfitting the limited score types. After this change, we increase the number of score classes from 5 to 41. For sequential recommendation, we set the beam size 𝐵 to 20. For direct recommendation, the beam size is also 20 and the candidate pool contains 100 items, which consist of one ground-truth item and 99 sampled negative ones that the user has not interacted with.

评估指标（Metrics）

对于 review prediction，我们采用 Root Mean Square Error (RMSE) 和 Mean Absolute Error (MAE) 评估。
对于 sequential recommendation 和 direct recommendation，我们采用 topK Hit Ratio (HR@K) 和 Normalized Discounted Cumulative Gain (NDCG@K) 评估，给出 HR@1, 5, 10 和 NDCG@5, 10 的结果。
对于 explanation generation 和 review summarization，我们采用 BLEU-4, ROUGE-1, ROUGE-2, 和 ROUGE-L 评估。

RMSE 和 MAE 是“越低越好”，而其他指标是“越高越好”。对于所有表格，粗体数字表示最佳性能，下划线数字表示第二最佳性能。

Rating Prediction and Direct Recommendation

These tasks take the user–item rating/interaction data, but no content or side information is provided. We aim to justify whether the models are able to provide accurate rating prediction or recommendation lists that align with the user preferences. We use MF [25] and MLP [5] under mean square root loss as rating prediction baselines. For direct recommendation, we use BPR-MF [49], BPR-MLP [5], and a state-of-the-art contrastive learning-based collaborative filtering model SimpleX [43] as baselines.

Sequential Recommendation

We adopt several representative sequential recommendation approaches as our baselines. Caser [63] treats sequential recommendation as a Markov Chain and employs convolutional neural networks to model user interests. HGN [41] adopts a hierarchical gating networks to learn user behaviors from the perspectives of both long and short terms. GRU4Rec [21] is originally proposed for session-based recommendation. It utilizes GRU [7] to model the user click history sequence. BERT4Rec [60] mimics the BERT-style masked language modeling and learns a bidirectional representation for sequential recommendation. FDSA [73] focuses on the feature transition patterns by modeling feature sequence with a self-attention module. SASRec [24] adopts selfattention mechanism in a sequential recommendation model, which reconciles the properties of Markov Chains and RNN-based approaches. S3-Rec [80] leverages self-supervised objectives to help sequential recommendation model better discover the correlations among different items and their attributes. We use the implementation of S3-Rec and its baselines for comparison4.

Explanation Generation

For performance comparison, we consider several baselines with regard to the task of explanation generation. Attn2Seq [10] learns to encode attributes into vectors, and then invokes an attention mechanism to generate reviews conditioned on the attribute vector. NRT [32] utilizes GRU [7] to generate explanations based on user and item IDs. PETER [31] is a simple and effective framework that attempts to utilize user and item IDs to generate explanations. It is built upon a modified attention mask of the Transformer architecture. There is also a variant PETER+, which takes a hint feature word to assist the explanation generation.

Review Related

For review summarization, we adopt pretrained T0 [51] and GPT-2 [46] checkpoints hosted by Hugging Face5 as baselines. For review preference prediction, we only use T0 to make comparisons because GPT-2 cannot perform this task.

5.3 Performance Comparison on Different Task Families (RQ1)

In this section, we pretrain P5 with prompts from all five task families to verify its multitask learning ability. According to the default pretrain–predict task combination, we leave Prompt 1-10, Prompt 2-13, Prompt 3-12, Prompt 4-4, and Prompt 5-8 for zeroshot evaluation and pretrain P5 with the remaining personalized prompts. The performances of P5 and relevant baselines on the five task families are presented in Table 2 to Table 7. For each task family, we choose one or more seen prompts as supplement to the aforementioned zero-shot unseen prompts to perform evaluations.

5.3.1 Rating Prediction

Prompt 1-6 and Prompt 1-10 are used for evaluating P5’s performance on rating prediction. The performance comparison is presented in Table 2. We can see that when testing with seen Prompt 1-6, P5-B gets better MAE and slightly higher RMSE on all three datasets compared with MF. When testing with unseen Prompt 1-10, P5-B can achieve similar performance as Prompt 1-6. Moreover, P5-S usually has better MAE but higher RMSE. It seems that P5 is overfitting these data since the task complexity of rating prediction is relatively lower than other recommendation tasks. Overall, these results show that it is feasible to perform rating prediction on a conditional text generation framework.

5.3.2 Sequential Recommendation

As illustrated in Table 3, Prompt 2-3 and Prompt 2-13 are employed for the evaluation of sequential recommendation under all-item setting, i.e., using all items as candidates rather than sampling 100 or 1,000 items for ranking. From the table, we can see that P5-B surpasses all competitive baselines with a relatively large gap on both seen (Prompt 2-3) and unseen (Prompt 2-13) prompts. On Toys, P5-S can get even better performance than P5-B. While on Beauty and Sports, P5-B achieves the advantage over P5-S. The results show that the P5 architecture is effective in modeling the user interaction history and conducting next item prediction with the help of beam search.

5.3.3 Explanation Generation

In Table 4, Prompt 3-9 and Prompt 3-12 are used to evaluate P5’s performance on explanation generation under feature-based setup, while Prompt 3-3 is used for direct explanation generation without providing a hint word. We can see that for Prompt 3-3, P5 achieves the best performances against all baselines. For feature-based prompts (Prompts 3-9 & 3-12), P5 can outperform PETER+ on most cases, especially for Beauty and Toys.

5.3.4 Review Related

We take Prompts 4-2 and 4-4 to compare P5’s performance with T0 on review preference prediction, as shown in Table 5. We can see that P5-S achieves better RMSE and MAE on Beauty and Toys, while P5-B shows better performance on Sports. Additionally, we take Prompt 4-1 to evaluate P5’s ability on review summarization, as shown in Table 6. For this task, P5-S clearly outperforms T0 and GPT-2 on both Beauty and Toys datasets. It is worth noting that GPT-2 and T0 has 1.5B and 11B parameters, respectively. This shows that P5 can achieve better performances than these competitive baselines with a much smaller model size.

5.3.5 Direct Recommendation

Finally, Prompts 5-1, 5-4, 5-5 and 5-8 are applied to evaluate the direct recommendation task under the 1-out-of-100 evaluation setting. For binary question prompts (5-1 & 5-4), which are discriminative prompts, we use the softmax generation probability of “yes” to rank the candidate items. For open question prompts (5-5 & 5-8), which are generative prompts, we use beam-search (Eq.(2)) to generate the top-𝑘 list. The results are presented in Table 7. From the table, we can see that P5-B and P5-S have great advantages over BPR-MF and BPR-MLP on all three datasets. Comparing with SimpleX, we can see that P5 works especially well on top-1 item ranking, which is more than two times better than SimpleX on HR@1. Besides, P5 also achieves the best result on most of the other metrics. The success of P5 on direct recommendation shows the competence of the sequence-to-sequence generation framework in recommendation domain.

5.4 Zero-shot Generalization to Unseen Prompts and Items in New Domain (RQ2) 5.4.1 Transfer to Unseen Personalized Prompts

In this section, we transfer the pretrained P5 models to the previously heldout prompts during pretraining. These unseen prompts are from the same task families, and the testing items have been seen by P5 during pretraining at least once. The experimental results are also reported in Table 2 to Table 7. As previously discussed in Section 5.3, P5 achieves surprisingly good performances on various task families when being challenged by unseen prompts. On some specific datasets, the performances of P5 on unseen prompts even surpass seen prompts, e.g., P5-B gets the best performance under Prompt 2-13 on Sports. These results show that multitask prompted pretraining empowers P5 enough robustness to understand unseen prompts with wording variations.

5.4.2 Transfer to Items in New Domain

Next, we increase the difficulty level of zero-shot transfer. We collect a group of 741 users that exist in all the three domains with their interaction and review histories in other domains. The detailed statistics of these domain transfer evaluation sets are illustrated in Table 8. We then challenge P5-B pretrained on one domain with unseen prompts from the Task Family Z, whose item fields are filled with the information from a new product domain. For example, we ask the P5 model pretrained on the Toys domain about an existing user’s preference towards an item in the Beauty domain. The full results on all six directions are reported in Table 9. From the table, we notice P5 still maintains sufficient performances for rating prediction (Prompts Z-2 & Z-3), like/dislike prediction (Prompts Z-1 & Z- 4), as well as explanation generation with feature word (Prompt Z-6). In contrast, direct explanation generation without feature word (Prompts Z-5 & Z-7) is very difficult for P5 because it lacks awareness of relevant knowledge in the new domain. In Figure 4, we provide some example explanations generated by P5-B under the setup of zero-shot domain transfer (Prompt Z-6). We can see that P5 is able to catch different users’ rating preferences and hint feature words, then integrate them with the knowledge learned from previous domain to generate plausible explanations.

5.5 Ablation on Model Size (RQ3)

In this section, we will discuss the influence of model size on the performance of P5 on different recommendation tasks. Here, we train two size variants of P5, namely P5-small and P5-base. The parameter numbers of these two P5 models are 60.75M and 223.28M, respectively. From Table 2 to Table 7, we can see that although P5-S is only 1/4 of the size of P5-B, P5-S can beats P5-B on a series of tasks and datasets. For example, P5-S achieves better sequential recommendation, review preference prediction, and direct recommendation (Prompts 5-5 & 5-8) performances than P5-B on Toys. In contrast, P5-B shows advantages on sequential recommendation and review preference prediction tasks for Sports. Since Sports contains more users, items and reviews and has a lower sparsity, it requires a model with higher capacity to discover latent correlation among different personalized factors. The findings indicate that larger P5 models may be needed when the dataset is large, while for smaller datasets, smaller P5 models could be enough. As a result, we should decide an appropriate model size that matches the scale of the training data.

5.6 Ablation on Task Scaling (RQ3)

Moreover, we explore whether multitask prompted pretraining is superior than pretraining on each task family alone. We pretrain P5-small on Beauty dataset with prompts from every single task family, resulting in five models – P5-S1, P5-S2, P5-S3, P5-S4, and P5-S5. We then compare P5-S on various recommendation tasks with the corresponding single task P5 model. The performance comparison between P5-S and P5-SN (𝑁 ∈ [1, 2, 3, 4, 5]) is illustrated in Figure 5. As shown in the figure, P5-S achieves comparable or better performance than P5-SN on rating prediction, sequential recommendation and direct recommendation tasks, while on text generation tasks such as explanation generation (Prompts 3-9 & 3-12) and review summarization (Prompt 4-1), P5-SN is better than P5-S. This indicates that multitask modeling (P5-S) seeks a good balance among tasks and improves recommendation performance by leveraging the power of language understanding. Besides, both P5-S and P5-SN perform better than or comparable with state-ofthe-art baselines on all tasks, as shown in Table 2 through Table 7, which demonstrates the power of P5 for recommendation.

5.7 Ablation on Prompt Scaling (RQ3)

As mentioned in implementation details, our default pretrain–predict task combination follows the leave-one-out strategy. However, do we need so many prompts during pretraining to enable P5’s zeroshot generalization ability? In this section, we explore to reduce the number of pretraining prompts and then make comparisons with the P5 model pretrained under default setup. To this end, we choose a collection of pretraining prompts that has the minimum number of prompts to cover all important personalized fields. Specifically, this combination contains the following 18 personalized prompts: {1-5, 1-6, 1-8, 1-9, 2-1, 2-3, 2-8, 2-11, 3-2, 3-3, 3-6, 3-9, 4-1, 4-2, 4-3, 5-2, 5-5, 5-7}. Similar to the default pretrain–predict combination, the last prompt in each task family is for zero-shot evaluation. We name this prompt scaling variant of P5-small as P5-PS and then pretrain P5-PS on Beauty dataset. The performance comparison between P5-S and P5-PS is also presented in Figure 5. From the figure, we can observe that P5-S beats P5-PS on most tasks except for some generation tasks (i.e., Prompts 3-3, 3-9 & 4-1). Interestingly, P5-S outperforms P5-PS on Prompt 3-12 – a zero-shot explanation generation task. In fact, P5-S also shows its superiority on other zero-shot tasks such as Prompts 1-10, 2-13, and 5-8. Overall, we can find that larger number of high quality personalized prompts can generally help P5 achieve better performances on various recommendation tasks especially zero-shot tasks with unseen prompts.

5.8 如何实现个性化（unique tokens vs. sub-word units） (RQ4)

这一节讨论不同的个性化实现方式，并比较它们在 P5 中的性能。

方案一（默认，P5-S 模型）：是使用 SentencePiece tokenizer 将个性化字段拆分为多个 sub-word 单元，同时使用 whole-word embedding 来保留字段信息（见图 3）。
方案二：给每个 user 和 item 一个独立 token。这里我们称之为 P5-I。

前者利用协同学习隐式优化不同 sub-work token 之间的相关性，后者通过新引入的 token 学习到了每个唯一的用户或物品。性能比较见下图，

Figure 6: Performance of P5-S and P5-I on Beauty showing the influence of how to implement personalization.

可以看到

P5-I 在回归任务（Prompts 1-6 & 1-10 for rating prediction, Prompts 4-2 & 4-4 for review-based rating regression）和摘要生成任务（Prompt 4-1）上与 P5-S 表现相似。
P5-I 在解释生成任务（Prompts 3-3, 3-9 & 3-12）上略优于 P5-S。
P5-I 在顺序推荐和直接推荐任务（all prompts in Figure 6 (c) & (d)）上显著低于 P5-S，差距很大。

P5-I 性能较低的原因，跟 T5 初始化的那些原始子 sub-word units 比，新引入的大量额外 token 和 embedding 太稀疏。

这表明我们采用的 sub-word 方案可以通过协同学习实现更好的推荐和整体性能，同时只需要保持数量比较少的可学习 tokens。

根据论文 5.8 小节（”How to Implement Personalization”），P5-I（为每个用户/物品分配独立 token）与 P5-S（默认的子词单元分解）在不同任务场景下表现差异显著，具体原因可归结为以下四个层面的技术细节：

Kimi 老师的进一步解释（译注）一、参数效率与协同学习机制的根本差异实现方式参数规模初始化方式协同学习能力 P5-S（默认）恒定（复用 T5 的 32,128 词表）继承 T5 预训练子词嵌入强：不同 ID 共享子词单元（如”user”、”_“、”12”），通过组合模式隐式学习用户/物品间的关联 P5-I（独立 token）线性增长（需为每个用户/物品新增嵌入）随机初始化弱：每个 ID 嵌入完全独立，只能通过任务监督信号学习，无法跨 ID 共享知识

核心问题：P5-I 为 Amazon Sports 数据集的 35,598 个用户和 18,357 个物品各新增一个 token 时，需引入约 5.4 万个全新嵌入向量，这些参数从零开始训练且在预训练数据中出现频率不均，导致：

高频 ID 过拟合到特定训练样本
低频 ID 欠训练，表示质量差
失去 T5 原有的语言理解和泛化能力

二、任务场景差异的具体分析 1. P5-I 表现”相似或略好”的场景：回归任务 & 文本生成任务

具体任务：评分预测（Prompt 1-6/1-10）、评论偏好预测（Prompt 4-2/4-4）、解释生成（Prompt 3-3/3-9/3-12）

原因：

监督信号直接：这些任务的输入包含丰富的语义信息（如评论文本、物品标题），模型主要依赖 T5 的编码-解码能力，对 ID 本身的协同信号需求较低
记忆优势：P5-I 的独立嵌入能有效”记忆”特定用户的评分/写作风格模式，在训练集上获得更低损失
论文数据佐证：在 Beauty 数据集上，P5-I 在解释生成任务 BLEU-4 分数略高（+0.02），但在 Sports 数据集上无显著差异，说明小数据集上记忆效应更明显

2. P5-I 表现”显著更差”的场景：纯推荐任务

具体任务：

序列推荐（Prompt 2-3/2-13）：需建模用户行为序列中的模式转移（如”买了篮球→可能买球鞋”）
直接推荐（Prompt 5-5/5-8）：需从候选物品中选出最匹配的 top-k

性能差距数据（论文 Table 7 & Figure 6）：

Sports 数据集上，P5-I 的 HR@1 比 P5-S 下降 61%（0.0701→0.0274）
Beauty 数据集上，NDCG@5 下降 47%（0.1673→0.0882）

根本原因：

协同信号丢失：子词分解让相似 ID 共享模式（如”item_12345”和”item_12346”共享前缀），P5-I 完全隔离，无法捕捉用户-物品交互的隐含结构
冷启动灾难：在 zero-shot 场景（Prompt 5-8），P5-I 对未见物品的独立嵌入从未被训练，预测完全失效；而 P5-S 可通过子词组合泛化到新物品 ID
优化困难：P5-I 的 ID 嵌入参数量巨大，在 multitask pretraining 中梯度更新不稳定，易陷入局部最优

三、数据集规模的影响

论文图 6 显示，数据规模越大，P5-I 劣势越明显：

数据集用户/物品数 Sparsity P5-I vs P5-S 在直接推荐上的差距 Sports 35,598 / 18,357 0.0453% 最大（HR@1 差距-61%） Beauty 22,363 / 12,101 0.0734% 中等（HR@1 差距-30%） Toys 19,412 / 11,924 0.0724% 最小（但仍有-15%）

解释：数据越稀疏，越需要强大的协同过滤能力来挖掘有限交互中的模式。P5-S 的子词共享机制在稀疏场景下能有效迁移知识，而 P5-I 的独立嵌入因样本不足无法充分训练。

四、总结

原文指出：

“The reason behind P5-I’s lower performance lies in that the newly introduced huge number of extra tokens and embeddings cannot be well trained compared with the original sub-word units initialized from T5.”

P5-S 通过whole-word embedding 补偿了子词拆分带来的信息损失，既保留协同学习能力，又避免引入过多新参数，是实现个性化更优的工程选择。

总结：P5-I 仅在对协同信号依赖低的文本生成任务中表现尚可；在核心推荐任务中，其参数爆炸性增长和协同能力缺失导致性能灾难性下降。这验证了论文的核心设计理念——通过语言建模的统一框架实现知识迁移，而非为每个实体学习孤立表示。

6 CONCLUSIONS AND FUTURE WORK

In this paper, we present P5 which unifies different recommendation tasks into a shared language modeling and natural language generation framework. By designing a collection of personalized prompts covering five recommendation task families, we transfer all raw data such as the user-item interactions, user descriptions, item metadata, and user reviews to the same format – input-target text pairs. We then pretrain P5 in a full language environment to help it discover deeper semantics for various recommendation tasks. According to our experiments, P5 can beat or achieve similar performance with several representative approaches on all five task families. Moreover, P5 shows the generalization ability on performing zeroshot transfer to new items, new domains, and new personalized prompts. In the future, we will continue exploring to further enlarge the model size of P5 and employ more powerful base models such as GPT-3, OPT, and BLOOM. Besides, P5 is a very flexible paradigm and it is promising to further extend P5 to diverse modalities and more tasks such as conversational recommendation, comparative recommendation, cross-platform recommendation, or even various search tasks by incorporating user queries into P5. Finally, in this work, we designed explicit prompts since they are intuitive, flexible, and close to the natural way of how humans communicate with each other, which enables instruction-based recommendation, while in the future, we will also investigate prompt search and/or latent prompt techniques to achieve instruction prompts or leverage retrieval-enhanced generation to further boost P5’s performance on downstream tasks.

[译][论文] P5 paper | 用语言模型做推荐：一种统一的预训练、个性化提示和预测范式（2022）

ARTHURCHIAO'S BLOG

2 weeks 2 days ago

译者序

本文翻译自 2022 年 RecSys 大会的一篇论文 Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)。

Figure 1: P5 pretrains on an encoder–decoder Transformer model that takes in textual inputs and produces target responses.

图 3：P5 架构示意图。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

以下是译文。

译者序
摘要
1 引言
2 相关工作
3 个性化 prompts 集合
- 3.1 Prompts 设计
- 3.2 从原始数据构建训练数据集（prompts & answers）
4 P5 范式与模型
- 4.1 P5 架构
- 4.2 用预训练的 P5 进行推荐任务（推理）
5 实验
6 CONCLUSIONS AND FUTURE WORK

在 P5 中，所有数据 （user-item interactions, user descriptions, item metadata, user reviews 等）都被转换为统一的自然语言序列。
自然语言所蕴含的丰富信息有助于 P5 捕获更深层的语义，从而实现个性化推荐。

具体而言，P5 在预训练阶段通过相同的语言建模目标学习不同任务，从而成为各类下游推荐任务的基础模型。

P5 不仅能轻松与其他模态信息融合，还能基于提示实现指令驱动的推荐。
P5 将推荐系统从浅层模型、深度模型推进至大模型阶段，并将以通用推荐引擎的形式彻底革新推荐系统的技术形态。
通过为不同用户自适应生成个性化提示，P5 能够以零样本或少样本方式进行预测，大幅减少了对大量微调的依赖。

我们在多个推荐基准测试上进行了实验，验证了 P5 的有效性，相关代码和模型也已经开源：

github.com/jeykigung/P5 开源了源代码、数据集、提示词及预训练的 P5 模型。
huggingface.co/makitanikaze/P5 模型。

1 引言

1.1 现阶段推荐系统的特点特征表示和学习越来越复杂

推荐系统中的 feature engineering 和 learning 已经从简单发展到复杂。

早期，推荐系统通常采用 logistic regression 或 collaborative filtering [25, 35, 50, 52]，利用 user-item interaction 数据来建模用户的行为模式。
之后，通过更复杂的模型如 factorization machines [48] 和 GBDT [20]，将 contextual features（如 user profile 和 item metadata）进一步整合到系统中。
最近，deep neural network models [3, 5, 19, 74] 促进了更加多样和复杂的特征之间的交叉与组合。因此，与传统基于 feature engineering 的方法相比，这些模型获得了更好的表示能力。

推荐任务的类型越来越多样

sequential recommendation [21, 60, 63, 80]
conversational recommendation [8, 61, 76]
explainable recommendation [17, 31, 62, 70, 75, 77]

等等。虽然上述推荐任务的方法通常是单独提出的，但一个明显的趋势是 利用多个推荐任务来联合学习 transferable representations [31, 56, 57, 72]。

1.2 现代推荐系统需要什么

尽管现有的推荐系统取得了巨大成功，但在解决实际问题上仍面临很多问题，我们认为需要一个能支持多样特征和不同类型任务的综合推荐系统。

1.3 P5 的创新点

受最近 multitask prompt-based training [1, 51, 67] 进展的启发，本文提出一个统一的范式 P5。它有三个主要优势：

将推荐模型（行为模型）深度融入到语言环境（语言模型）中。

基于 personalized prompts，所有推荐任务都被重新表述为 NLP 任务。由于自然语言足够灵活和强大，能够用文本表达各种类型的特征，因此无需设计 feature-specific encoders。通过这种方式，P5 可以充分利用训练语料库中丰富的语义和知识；

译注：从 Tokenization 视角看生成式推荐（GR）近几年的发展（2025）
将多个推荐任务放到同一个 text-to-text encoder-decoder 中，并使用相同的 language modeling loss 进行训练，而不是设计 task-specific 架构和 objective functions。

换句话说， P5 将所有 personalized tasks 视为 conditional text generation 问题；
通过 instruction-based prompts 训练，P5 在推广到新的 personalized prompts 或其它领域中未见过的 items 时，获得了良好的 zero-shot 性能。

2 相关工作 2.1 统一框架的尝试

之前已经有一些工作试图在统一模型中解决各种推荐任务。

基于通用语言模型（T5 和 GPT3）

早期先驱，

T5 ：通过 text-to-text encoder-decoder 框架统一了 NLP 下游任务。
GPT-3：通过 autoregressive language modeling 统一了 NLP 下游任务。

它们都能基于同一个预训练的语言模型实现不同任务之间的有效知识共享（即，通用模型）。

基于自然语言的 seq-to-seq 架构

但是，这类方法没有在模型中考虑个性化。

基于通用用户表示

[56, 57, 72] 尝试学习易于迁移到下游任务的通用用户表示。这些方法的一个局限性是它们仍然需要在下游数据集上进行 finetuning。

2.2 通过提示的方式学习（Prompt Learning）

GPT 系列尤其是 GPT-3 的成功标志着 prompt 在 NLP 任务中的普及。

在互联网上收集的大量语言数据进行训练，GPT-3 展示了在提供少量输入-输出示例作为 exemplar prompts 时解决 NLP 任务的能力。
其他一些遵循 “pretrain, prompt, and predict” 范式的 prompt 设计方法最近也有发展 [37]。
- [16, 23, 36, 40, 58] 探索了针对特定离散提示的搜索。
- [18, 28, 33, 38, 45, 81] 利用连续向量 embedding 作为提示。

受这些方法成功的启发，我们创建了一个个性化提示集，然后在一个多样化的推荐任务上训练一个 sequence-to-sequence 模型。

2.3 推荐领域的 NLP

推荐已经与 NLP 技术有很长时间的交集了。四个主要方向：

explainable recommendation [4, 10, 30–32, 75, 77] where NLP models help generating text explanations for a given recommendation;
sequential recommendation as language modeling [9, 60, 80] which considers user interaction histories as word token sequences;
text feature extraction [69, 74, 79] which aims to extract informative text encodings that can improve the performance of recommendation;
conversational recommendation [8, 12–14, 22, 61, 76] that reasons the intent of users and gives recommendation in an interactive dialog format.

本文主要涵盖前两种任务，并讨论了如何设计一个统一的 NLP 框架来涵盖 rating prediction、top-k recommendation 和 review summarization 等任务。

此外，通过使用与传统相似的指令式提示进行预训练，P5 受益于自然语言环境，提高了在系列推荐任务上的性能。

2.4 Zero-shot 和冷启动推荐

一个常见的研究是冷启动推荐，即用户 [26] 或物品 [53] 是新系统，没有之前的交互记录。

常见解决方案是学习去建模内容特征 [15, 29, 44, 55]，以便在没有交互记录的情况下进行推理，或者是从其他的辅助域学习迁移表示 [42, 56, 59, 72, 82]。
另一种解决方式是快速适应新域（quick adaptation to the new domain），而非供冷启动 case。解决方案通常遵循meta learning [27, 64] 或因果学习 [34] (causal learning) 框架，使模型对域适应具有鲁棒性。

在我们的工作中，我们要求 P5 模型在辅助域上预训练，以解决目标域上的任务，其中用户对 P5 是已知的，但物品 P5 是没见过的。

3 个性化 prompts 集合

为了方便 multitask prompt-based pretraining，我们创建了一个个性化提示集。个性化提示集覆盖了五类不同的任务：

rating prediction
sequential recommendation
explanation
review
direct recommendation

每类任务包含多个个性化提示，帮助 P5 发现用户和物品的各个方面关联。

3.1 Prompts 设计

我们针对每个任务设计了一个基本的个性化提示集。

rating prediction 提示词设计

对于 rating prediction 任务，我们将其提示分为三个类别：

给定用户和物品的信息，直接预测用户给该物品的评分，范围从 1 到 5；
预测用户是否会给一个物品指定的评分（rate an item a given score）。期望输出是 yes 或 no；
预测用户是否喜欢或不喜欢一个物品。

我们考虑评分等于或大于 4 为用户的喜欢偏好，而较低的评分表示用户的不喜欢偏好。

sequential recommendation 提示词设计

针对 sequential recommendation 任务，我们创建了三种类型的提示：

基于用户交互历史，直接预测下一个物品；
给定用户交互历史，从候选列表中选择可能的下一个物品，其中只有一个物品是正样本；
基于用户交互历史，预测给定物品是否会被用户下次交互。

explanation 提示词设计

针对 explanation 任务，我们要求 P5 生成一个文本解释，以证明用户对给定物品的偏好。两种提示：

直接生成一个包含用户和物品信息的解释句子；
基于一个特征词作为提示，生成解释。

对于每个类别，可能还包括其他辅助信息，例如评论标题和评分。

review 相关提示词设计

针对 review 相关任务，我们创建了两种类型的提示：

总结评论，生成一个更短的评论标题；
基于给定的评论，预测相应的评分。

direct recommendation 提示词设计

针对 direct recommendation 任务，我们创建了两种类型的提示：

预测是否向用户推荐一个物品，期望输出是 yes 或 no；
从候选物品列表中选择最合适的物品推荐给用户。

完整的个性化提示集见附录。

3.2 从原始数据构建训练数据集（prompts & answers）

构建训练数据的过程如图 2 所示，

4 P5 范式与模型

4.1 P5 架构

具体到 P5 架构，我们采用基本的 encoder-decoder 框架，并使用 Transformer 构建编码器和解码器。

假设输入 token 序列的 embedding 为 $\mathbf{x} = \left[x_1, \cdots, x_n\right]$。如 Figure 3 所示，

位置编码

增加位置编码，以捕获序列中的位置信息。

Whole-word embedding，补偿 item token 表示被 tokenizer 拆分带来的语义损失

为了使 P5 捕捉输入序列中包含的个性化信息，我们还应用 whole-word embedding $\mathcal{W}$ 来表示连续的 sub-word token 是否来自同一个原始单词。

为什么需要这个步骤呢？举个例子，

如果我们直接用 ID 7391 表示物品，即 item_7391，那么这个词经过 SentencePiece tokenizer 之后，就会变成 4 个独立的 token（item, _, 73, 91），而不是我们期望的一个。通过共享的 whole-word embedding （图 3 中的 <w10>），P5 可以更好地识别包含个性化信息的字段。
另一种方案是每个用户/物品用一个独立的额外 token 表示（例如，<item_7391>）。然而，当用户和物品数量很大时，这可能会引入大量的额外 token。

encoder & decoder

解码器 $\mathcal{D}(\cdot)$ 然后关注之前生成的 token $\mathbf{y}$ 和编码器输出 $\mathbf{t}$，并预测未来 token 的概率分布：

$P_{\theta}\left(\mathbf{y}_{j} \mid \mathbf{y}_{<j}, \mathbf{x}\right) = \mathcal{D}(\mathbf{y}_{<j}, \mathbf{t})$。

在预训练阶段，P5 minimizing the negative log-likelihood of label tokens y conditioned on input text x in an end-to-end manner：

这个相同的损失函数被所有 P5 下的推荐任务共享。因此，我们统一推荐任务，使用一个模型、一个损失和一个数据格式。

4.2 用预训练的 P5 进行推荐任务（推理）

在预训练之后，P5 可以直接个性化提示执行不同的任务，不管这些 prompts 它有没有见过。

对于 rating、explanation 和 review 任务，简单地使用贪心解码（greedy decoding）来生成答案。
对于 sequential 和 direct recommendation 任务，通常需要一个物品列表作为目标输出，使用 beam search。

其中 $B$ 表示 beam size，$\mathbf{C}$ 表示输出物品列表。

5 实验

本节我们评估 P5 在真实世界数据上的性能，并与其他代表性方法进行比较。通过性能比较和消融研究，我们旨在回答以下问题：

5.0 要回答的问题 (RQ 1~5) 问题一：P5 与 task-specific 方法的性能比较

How does our unified P5 framework perform compared with task-specific methods on all five task families?

问题二：P5 的零样本泛化能力

Does P5 have enough zero-shot generalization ability when transferring to unseen personalized prompts for either existing or new items?

问题三：P5 的性能如何受模型大小、任务数量和提示数量影响？

How do scaling factors such as model size, number of task families, and number of prompts affect the performance of P5?

问题四：P5 中实现个性化推荐的最佳方式是什么？（unique token vs. sub-word units）

问题五：P5 的预训练时间？P5 的推理性能？

How long does it take for P5 to conduct pretraining? Is it efficient to make inference with the pretrained P5 model? We provide statistics on training and inference time in the Appendix

5.1 Experimental Setup Datasets

Task splits

Implementation Details

评估指标（Metrics）

对于 review prediction，我们采用 Root Mean Square Error (RMSE) 和 Mean Absolute Error (MAE) 评估。
对于 sequential recommendation 和 direct recommendation，我们采用 topK Hit Ratio (HR@K) 和 Normalized Discounted Cumulative Gain (NDCG@K) 评估，给出 HR@1, 5, 10 和 NDCG@5, 10 的结果。
对于 explanation generation 和 review summarization，我们采用 BLEU-4, ROUGE-1, ROUGE-2, 和 ROUGE-L 评估。

RMSE 和 MAE 是“越低越好”，而其他指标是“越高越好”。对于所有表格，粗体数字表示最佳性能，下划线数字表示第二最佳性能。

Rating Prediction and Direct Recommendation

Sequential Recommendation

Explanation Generation

Review Related

5.3 Performance Comparison on Different Task Families (RQ1)

5.3.1 Rating Prediction

5.3.2 Sequential Recommendation

5.3.3 Explanation Generation

5.3.4 Review Related

5.3.5 Direct Recommendation

5.4 Zero-shot Generalization to Unseen Prompts and Items in New Domain (RQ2) 5.4.1 Transfer to Unseen Personalized Prompts

5.4.2 Transfer to Items in New Domain

5.5 Ablation on Model Size (RQ3)

5.6 Ablation on Task Scaling (RQ3)

5.7 Ablation on Prompt Scaling (RQ3)

5.8 如何实现个性化（unique tokens vs. sub-word units） (RQ4)

这一节讨论不同的个性化实现方式，并比较它们在 P5 中的性能。

方案一（默认，P5-S 模型）：是使用 SentencePiece tokenizer 将个性化字段拆分为多个 sub-word 单元，同时使用 whole-word embedding 来保留字段信息（见图 3）。
方案二：给每个 user 和 item 一个独立 token。这里我们称之为 P5-I。

前者利用协同学习隐式优化不同 sub-work token 之间的相关性，后者通过新引入的 token 学习到了每个唯一的用户或物品。性能比较见下图，

Figure 6: Performance of P5-S and P5-I on Beauty showing the influence of how to implement personalization.

可以看到

P5-I 在回归任务（Prompts 1-6 & 1-10 for rating prediction, Prompts 4-2 & 4-4 for review-based rating regression）和摘要生成任务（Prompt 4-1）上与 P5-S 表现相似。
P5-I 在解释生成任务（Prompts 3-3, 3-9 & 3-12）上略优于 P5-S。
P5-I 在顺序推荐和直接推荐任务（all prompts in Figure 6 (c) & (d)）上显著低于 P5-S，差距很大。

P5-I 性能较低的原因，跟 T5 初始化的那些原始子 sub-word units 比，新引入的大量额外 token 和 embedding 太稀疏。

这表明我们采用的 sub-word 方案可以通过协同学习实现更好的推荐和整体性能，同时只需要保持数量比较少的可学习 tokens。

高频 ID 过拟合到特定训练样本
低频 ID 欠训练，表示质量差
失去 T5 原有的语言理解和泛化能力

二、任务场景差异的具体分析 1. P5-I 表现”相似或略好”的场景：回归任务 & 文本生成任务

具体任务：评分预测（Prompt 1-6/1-10）、评论偏好预测（Prompt 4-2/4-4）、解释生成（Prompt 3-3/3-9/3-12）

原因：

监督信号直接：这些任务的输入包含丰富的语义信息（如评论文本、物品标题），模型主要依赖 T5 的编码-解码能力，对 ID 本身的协同信号需求较低
记忆优势：P5-I 的独立嵌入能有效”记忆”特定用户的评分/写作风格模式，在训练集上获得更低损失
论文数据佐证：在 Beauty 数据集上，P5-I 在解释生成任务 BLEU-4 分数略高（+0.02），但在 Sports 数据集上无显著差异，说明小数据集上记忆效应更明显

2. P5-I 表现”显著更差”的场景：纯推荐任务

具体任务：

序列推荐（Prompt 2-3/2-13）：需建模用户行为序列中的模式转移（如”买了篮球→可能买球鞋”）
直接推荐（Prompt 5-5/5-8）：需从候选物品中选出最匹配的 top-k

性能差距数据（论文 Table 7 & Figure 6）：

Sports 数据集上，P5-I 的 HR@1 比 P5-S 下降 61%（0.0701→0.0274）
Beauty 数据集上，NDCG@5 下降 47%（0.1673→0.0882）

根本原因：

协同信号丢失：子词分解让相似 ID 共享模式（如”item_12345”和”item_12346”共享前缀），P5-I 完全隔离，无法捕捉用户-物品交互的隐含结构
冷启动灾难：在 zero-shot 场景（Prompt 5-8），P5-I 对未见物品的独立嵌入从未被训练，预测完全失效；而 P5-S 可通过子词组合泛化到新物品 ID
优化困难：P5-I 的 ID 嵌入参数量巨大，在 multitask pretraining 中梯度更新不稳定，易陷入局部最优

三、数据集规模的影响

论文图 6 显示，数据规模越大，P5-I 劣势越明显：

四、总结

原文指出：

P5-S 通过whole-word embedding 补偿了子词拆分带来的信息损失，既保留协同学习能力，又避免引入过多新参数，是实现个性化更优的工程选择。

6 CONCLUSIONS AND FUTURE WORK

以旅行规划（Trip Planning）为例，看 DeepSeek-V3.2 如何合成高质量训练数据（2025）

ARTHURCHIAO'S BLOG

2 weeks 6 days ago

如何基于 Agent/LLM 强大的规划能力+生成能力+代码执行能力+反思能力，自动化合成大批量高质量数据：

Hypothetical workflow

DeepSeek-V3.2: workflow for synthesizing high-quality agentic datasets for RL training (in agentic fashion, without human intervention)

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 场景：增强模型的 Trip Planning 能力
- 1.1 方案拆解
- 1.2 子任务：准备高质量的 Trip Planning 数据
2 方案：自动合成高质量 Trip Planning 数据
3 图解：DeepSeek-V3.2 是怎么做的（”Large-Scale Agentic Tasks”）
4 Kimi 老师补充的一些细节，帮助理解
- 4.1 生成的 Task 示例
- 4.2 输出样本要求
5 DeepSeek papers

1 场景：增强模型的 Trip Planning 能力

假设你在训练一个通用模型或垂域的旅游行业模型，那你可能会遇到下面这样的用户诉求：

我计划今年十一从杭州出发玩三天，请帮我制定一份行程规划。几个要求：整个行程我不想重复任何一个城市、酒店、景点或餐厅。另外，请务必确保推荐的每家酒店、餐厅和景点都确实位于我当日所在的城市。关于第二天还需要注意：如果当晚入住的豪华酒店价格在800元人民币及以上，则需严格控制其他开销——当日两家餐厅（午餐与晚餐）总消费需低于350元，且两家餐厅评分均不低于4星，下午游览的景点门票需低于120元。若第二天酒店属于中高档（500-800元），则预算可稍放宽：只需确保至少一家餐厅评分达 4.0星以上，且景点门票低于180元。若选择经济型酒店（200-500元），则只需保证至少一家餐厅评分在3.2星以上。

要回答好这类问题，就需要对模型的行程规划（Itinerary）或称 旅游规划（Trip Planning）能力进行专门训练。

具体该怎么做呢？我们来尝试设计一个方案。

1.1 方案拆解

从非常高的 level 来说，要完成以上训练任务只需要做两件事情：

数据集准备：准备一批高质量的 Trip Planning 数据
后训练：基于高质量训练数据，对模型进行微调（SFT）或强化学习（RL）

本文接下来只关注第一个任务，高质量数据集的准备。

1.2 子任务：准备高质量的 Trip Planning 数据

再次从 high level 来说，这样的高质量数据集有两种来源：

人工标注：例如，找专业的旅行定制师或资深的旅行家，人工编写高质量的语料；
自动合成：通过某种不依赖人工的方式自动合成。

考虑到这个数据集不仅要求质量高，样本数量也要比较多，靠专业的人工标注成本是很高的，而且人工标注方式的可扩展很差，因此我们接下来考虑自动合成的方式。

2 方案：自动合成高质量 Trip Planning 数据 2.1 思考：人（专家）怎么完成这个任务

先来设想一下，如果上面的旅行规划任务给到的是专业的旅游定制师或资深的旅行家，他们是如何来完成这个任务的（也就是数据标注过程）。可能的工作流程：

定制师或旅行家基于自己丰富的业务知识（城市、交通、景点、酒店、预算、偏好等等），初步判断下杭州出发三天能玩的目的地范围，得到一些备选目的地；
针对这些备选目的地，以杭州为出发地，通过手动搜索或数据库查询，进一步充实交通、住宿、餐饮、景点、预算等需求，得到一些备选线路；
针对这些备选线路，再进一步验证里面的每个具体步骤是否满足用户的要求，以及整体方案是否满足用户的要求；如果满足就留下这个线路；如果不满足（例如某一天的预算超了）就进行相应的调整直到满足，或者多次失败之后直接弃用这个备选路线；
如果用户觉得上一步验证通过的线路还是不够有吸引力，则回到 step 1 or step 2 并顺序执行到 step 3，针对用户需求重新设计一些更有吸引力的线路。

经过以上步骤，最终得到的就是一些符合用户要求的高质量线路规划。

2.2 自动化：人工方案的 workflow 化

把以上的人工生产线路过程变成一个 workflow，就得到了一个基于 Agent 的自动化方案：

首先，我们得从某些地方获取一些 Trip Planning 相关的基础旅游数据，例如城市、交通、酒店、景点、价格等等信息，把它们存储起来备用；
接下来，得有一些工具来从这些数据中筛选出我们想要的信息，例如查询两个城市之间的交通方案、查询给定城市内的餐厅和景点等；
有了前两步的基础，剩下的就是生成一个具体的旅行规划任务，例如，“规划从上海到北京的三日游”，让 Agent 基于上一步提供的各种工具，帮我们将这个旅行规划方案设计出来。这个过程可以进一步拆解为两个子任务：
1. 生成：生成具体的旅行规划；
2. 验证：验证生成的旅行规划是否符合用户的要求。

基于以上流程，无需人工参与，就能自动完成一个行程规划任务，

如果验证 OK，就将这个结果输出；然后继续生成下一个（更难的）旅行规划任务；
如果失败，就要看问题是出在哪里，例如可能是工具不够、生成的方案不对、方案对但验证过程有问题等，尝试调整这几个环节，直到方案成功。

2.3 这个 workflow 的独特之处

这个 workflow 画成图大概长下面这样，跟普通 workflow 的重要区别是： Agent 不仅生成任务本身（task），还生成完成这个任务的代码（solution function）、工具代码（tool functions）和验证结果的代码（verification function），并通过动态执行这些代码筛选出符合用户要求的高质量结果。

Hypothetical workflow

图的上半部分可以叫“生成环境”，这是常规 LLM 擅长做的；
图的下半部分是“执行环境”，把上一步生成的代码真正拿来运行，再根据运行结果给 Agent 一个反馈，进入 Agent 的反思和下一次迭代流程。
整个方案的输入只有一段提示词（如果不算执行环境），其他都是 Agent+Workflow 创建和管理的。

2.4 小结

实际上，思考以上问题是因为在看 DeepSeek-V3.2 tech report 时刚好看到它有这样一个 case，觉得玩得很高级。接下来我们看看 DeepSeek 在这种合成高质量数据场景的具体方案设计。

3 图解：DeepSeek-V3.2 是怎么做的（”Large-Scale Agentic Tasks”）

DeepSeek-V3.2 tech report 的 3.2.3 Large-Scale Agentic Tasks 介绍了他们是如何强化大规模 Agentic 任务的，其中就涉及到了数据集的合成，我们前面介绍的 “Trip Planning” 例子其实就是来自这里。

3.1 方案描述

原文：

General Agent To scale up agent environments and tasks in RL, we employ an automatic environment-synthesis agent that synthesizes 1,827 task-oriented environments. These tasks are hard to solve but easy to verify. The synthesis workflow primarily consists of environment and toolset construction, task synthesis, and solution generation. Specifically, the workflow proceeds as follows.

Given a task category (e.g., planning a travel itinerary) and a sandbox equipped with a bash and a search tool, the agent first uses these tools to generate or retrieve relevant data from the Internet and store them in the sandbox database.
The agent then synthesizes a set of task-specific tools, each implemented as a function.
To create tasks that are both challenging and automatically verifiable, the agent initially proposes a simple task based on the current database, along with its solution and verification functions implemented in Python. The solution function is restricted to invoking tool functions or performing logical computations, and cannot call other functions or directly access the database, ensuring the task can only be solved through the tool interface. Additionally, the results produced by the solution function must be validated by the verification function. If the solution is not validated, the agent will modify the solution or verification functions until the solution’s output passes the verification. The agent then iteratively increases the difficulty of the task and updates the corresponding solution and verification functions. During this iterative process, if the current toolset is not sufficient to solve the task, the agent will augment the toolset.

为了扩展 RL 中的 agent 环境和任务，我们采用了一个自动的 environment-synthesis agent，该 agent 合成了 1,827 个 task-oriented environments。这些任务的特点是解决起来很难，但验证很容易。该 synthesis workflow 主要包括 environment & toolset 构建、task synthesis 以及 solution generation。

Trip Planning 是其中的任务类型之一。

3.2 方案图解

具体过程如下图所示（根据个人理解画的，仅供参考，因为很多细节原文没提）：

核心是一个 Agent，接下来按序号介绍下各步骤。

Step 0: Agent 输入

给 Agent 输入任务类型（e.g. “Trip Planning”）和可用的 sandbox 信息；

任务类型有很多种，旅行规划只是其中之一；
sandbox 可以理解成一个 linux container，例如 Ubuntu，配置了 bash 和 search tool；

Step 1: Agent 构建旅行数据库

Agent 开始干活，首先进入 sandbox，然后用 internet search tool 从互联网搜索相关数据，并保存到 local database；

输入：任务类别（如 “trip planning”）+ 配备 bash 和 search 工具的 sandbox 环境
过程：Agent 使用搜索工具从互联网爬取或生成结构化数据，包括交通、酒店、景点、门票、餐厅等等，存储到 sandbox 的数据库中
输出：结构化数据表
local database 可以想象成一个 SQLite 数据库

效果示意：

输入指令：请为"杭州三日游规划"任务准备基础数据执行过程： - 调用搜索工具查询"杭州五星级酒店 2025"、"杭州西湖景点"、"杭州米其林餐厅" - 调用 bash 工具解析搜索结果并写入 SQLite 数据库输出（数据库内容）： - cities 表: [杭州, 苏州, 上海, 南京] - hotels 表: ┌─────────────────┬────────┬────────┐ │ hotel_name │ city │ price │ ├─────────────────┼────────┼────────┤ │ Westlake Hotel │ 杭州 │ 850 │ │ Jinjiang Inn │ 杭州 │ 450 │ │ Nanjing Grand │ 南京 │ 620 │ └─────────────────┴────────┴────────┘ - attractions 表: [西湖, 灵隐寺, 中山陵, 拙政园] - restaurants 表: 含评分、价格等字段 Step 2: Agent 合成 tools（代码生成）

合成这类任务所需的 tools。由于 Agent 非常清楚前一步的存储方式（例如，SQLite 表结构），因此生成 tools 非常简单，可能就是一些查表的 SQL wrappers：

def get_all_hotels_by_city(city: str) -> List[Dict]: """查询指定城市的所有酒店""" return db.query("SELECT * FROM hotels WHERE city = ?", city) def get_infos_by_hotel(info_keywords: List[str], hotel: str) -> Dict: """获取酒店的详细信息（设施、政策等）""" return {...} # 从数据库或缓存中检索 def get_city_by_attraction(attraction: str) -> str: """查询景点所在城市""" return db.query_single("SELECT city FROM attractions WHERE name = ?", attraction) def get_inter_city_transport(from_city: str, to_city: str) -> List[Dict]: """查询城市间交通""" return [...] # 调用外部 API 或查询本地数据 def submit_result(answer_text: str) -> bool: """提交最终答案""" return True Step 3: 合成一个具体旅行规划任务

任务的生成从易到难，既有挑战又要能自动验证，先从最简单的开始。

Agent 会为这个任务生成两个 python 函数：

solution function：仅能调用 tool functions 或执行逻辑计算，不能调用其他 functions 或直接访问 database，从而确保该 task 只能通过 tool interface 来解决。
verification function：对 solution function 的运行结果进行验证。

示例：

task_description = "从杭州选择一家价格低于500元的酒店" def solve_task_1() -> str: hotels = get_all_hotels_by_city("杭州") affordable = [h for h in hotels if h["price"] < 500] return affordable[0]["hotel_name"] if affordable else "无" def verify_task_1(answer: str) -> bool: # 检查答案是否存在于数据库且满足约束 if answer == "无": return True hotel = db.query("SELECT * FROM hotels WHERE hotel_name = ?", answer) return hotel["city"] == "杭州" and hotel["price"] < 500 Step 4：执行 solution function，（基于 tool calling）生成一个线路规划

执行上面的 solve_task_1()，得到一个路线规划结果。转 step 5。

Step 5：执行 verification function，对上一步生成的线路规划进行验证

执行上面的 verify_task_1()，对上一步得到的路线进行验证。转 step 6。

Step 6: 如果验证成功，将这条数据输出

将这条数据以 <environment, tools, task, verifier> 的格式输出，这就是 DeepSeek-V3.2 下一阶段的一条训练样本；转 step 7。

Step 7: 返回到 step 3，继续合成下一个更难的任务

难度迭代升级：Agent 会逐步增加约束条件，直到任务具有挑战性但可验证。举例：

迭代版本新增约束任务描述 v1 无选择一家酒店 v2 + 不重复选择3家不同城市的酒店，不重复 v3 + 预算第二天酒店若≥800元，则餐厅+景点总预算 < 350元 v4 + 逻辑链完整的三天行程，含跨城交通，所有地点需满足城市归属验证 Step 8: 如果 step 5 验证失败，也返回到 step 3

尝试修改 solution function 或 verification function，然后继续 step 4；如果是因为 tool 不够导致的失败，进入 step 9；

Step 9: 将错误返回给 Agent，让 Agent 尝试扩充 toolset 3.3 官方 Trip Planning sample

官方文章中给的 Trip Planning 数据 sample 和输出格式、toolset：

结构化的输出：

[ { "time": "2025-10-01", "city": "cite_name", "hotel": "hotel_name", "afternoon_restaurant": "restaurant_name", "afternoon_attraction": "attraction_name", "evening_restaurant": "restaurant_name" }, { "time": "2025-10-02", "city": "cite_name", "hotel": "hotel_name", "afternoon_restaurant": "restaurant_name", "afternoon_attraction": "attraction_name", "evening_restaurant": "restaurant_name" }, { "time": "2025-10-03", "city": "cite_name", "hotel": "hotel_name", "afternoon_restaurant": "restaurant_name", "afternoon_attraction": "attraction_name", "evening_restaurant": "restaurant_name" } ]

包含的字段：

日期
城市
酒店名称
午餐的餐厅名字
下午游玩的景点的名字
晚餐的餐厅名字

4 Kimi 老师补充的一些细节，帮助理解

向 kimi 老师问了几个问题，补充一些可能的细节，帮助更好地理解这个过程。这一节可能存在误导，仅供"仅供参考"。

4.1 生成的 Task 示例 # --- Task 4.0 (最终版本) --- task_prompt = """ I'm planning a three-day trip starting from Hangzhou... [完整论文描述] Requirements: 1. 不重复任何城市、酒店、景点、餐厅 2. 所有推荐地点必须位于当天住宿城市 3. 第二天预算规则： - 豪华酒店(≥800CNY): 餐厅总消费<350CNY且评分≥4.0，景点门票<120CNY - 中高档酒店(500-800CNY): 至少一家餐厅评分≥4.0，景点门票<180CNY - 经济酒店(200-500CNY): 至少一家餐厅评分≥3.2 """ # 解决方案函数（Agent 生成） def solve_trip_planning() -> List[Dict]: # 1. 搜索所有可能的城市组合 cities = ["杭州", "苏州", "上海"] # 2. 为每天选择符合约束的酒店 for day2_hotel in get_all_hotels_by_city("苏州"): if not validate_budget_rules(day2_hotel): continue # 3. 验证地点不重复 used_places = {day2_hotel["hotel_name"]} # 4. 选择景点和餐厅... # 完整实现会涉及组合搜索和回溯 plan = generate_valid_itinerary(cities, used_places) if plan: return plan return [] # 验证函数（Agent 生成） def verify_trip_planning(answer: List[Dict]) -> bool: # 约束1: 无重复 all_hotels = [d["hotel"] for d in answer] if len(all_hotels) != len(set(all_hotels)): return False # 约束2: 城市归属验证 for day in answer: if get_city_by_hotel(day["hotel"]) != day["city"]: return False if get_city_by_restaurant(day["afternoon_restaurant"]) != day["city"]: return False # 约束3: 预算规则验证 day2 = answer[1] hotel_price = get_infos_by_hotel(["price"], day2["hotel"])["price"] restaurant_cost = sum(get_infos_by_restaurant(["price"], r)["price"] for r in [day2["afternoon_restaurant"], day2["evening_restaurant"]]) if hotel_price >= 800 and restaurant_cost >= 350: return False return True 4.2 输出样本要求关键点

可验证性：所有任务都带有自动验证函数，支持 RL 训练中的奖励信号计算
难度可控：通过迭代增加约束，确保任务对当前模型有挑战性（论文表5显示 DeepSeek-V3.2-Exp 在合成任务上仅 12% 准确率）
通用性：Solution 函数必须仅通过工具接口访问数据，不能直接查询数据库，确保 RL 策略可迁移到真实环境
规模：最终生成了 1,827 个环境 + 4,417 个任务，覆盖旅行规划、代码工程、数学推理等多领域

该 workflow 的核心创新在于将任务生成作为元学习问题，让模型自动创造高质量、可验证的训练样本，解决了大规模 RL 训练中数据稀缺的瓶颈。

成功样本会被筛选并持久化存储，作为后续 RL 训练的离线数据集。

“We then perform RL on this dataset using DeepSeek-V3.2 and retain only instances with non-zero pass@100, resulting in 1,827 environments and their corresponding tasks (4,417 in total).”

样本筛选标准

Pass@100 > 0：在 100 次随机尝试中至少能成功一次的任务才保留
确保任务可学习且非平凡：避免过于简单或不可能完成的任务

样本保存格式

样本以 四元组 结构存储：

{ "environment": { /* 数据库配置 */ }, "tools": { /* 工具函数定义 */ }, "task": { /* 任务描述 */ }, "verifier": { /* 验证逻辑 */ } } 输出样本示例（Trip Planning 任务）

以下是一个持久化样本：

{ "environment": { "description": "旅行规划数据库，包含长三角城市信息", "schema": { "cities": ["杭州", "苏州", "上海"], "hotels": [ {"name": "Westlake Hotel", "city": "杭州", "price": 850, "rating": 4.8}, {"name": "Jinjiang Inn", "city": "杭州", "price": 450, "rating": 4.0}, {"name": "Suzhou Garden Hotel", "city": "苏州", "price": 720, "rating": 4.5}, {"name": "Shanghai Grand", "city": "上海", "price": 680, "rating": 4.3} ], "restaurants": [ {"name": "知味观", "city": "杭州", "price": 180, "rating": 4.2}, {"name": "松鹤楼", "city": "苏州", "price": 220, "rating": 4.5}, {"name": "南翔馒头店", "city": "上海", "price": 120, "rating": 3.8} ], "attractions": [ {"name": "西湖", "city": "杭州", "ticket": 0}, {"name": "拙政园", "city": "苏州", "ticket": 90}, {"name": "外滩", "city": "上海", "ticket": 0} ] } }, "tools": { "get_all_hotels_by_city": { "code": "def get_all_hotels_by_city(city):\n return [h for h in db['hotels'] if h['city'] == city]", "signature": "(city: str) -> List[Dict]" }, "get_city_by_hotel": { "code": "def get_city_by_hotel(hotel_name):\n hotel = next((h for h in db['hotels'] if h['name'] == hotel_name), None)\n return hotel['city'] if hotel else None", "signature": "(hotel_name: str) -> str" }, "get_all_restaurants_by_city": { "code": "def get_all_restaurants_by_city(city):\n return [r for r in db['restaurants'] if r['city'] == city]", "signature": "(city: str) -> List[Dict]" }, "get_city_by_restaurant": { "code": "def get_city_by_restaurant(restaurant_name):\n rest = next((r for r in db['restaurants'] if r['name'] == restaurant_name), None)\n return rest['city'] if rest else None", "signature": "(restaurant_name: str) -> str" }, "get_all_attractions_by_city": { "code": "def get_all_attractions_by_city(city):\n return [a for a in db['attractions'] if a['city'] == city]", "signature": "(city: str) -> List[Dict]" }, "submit_result": { "code": "def submit_result(answer_text):\n return {'status': 'submitted', 'answer': answer_text}", "signature": "(answer_text: str) -> Dict" } }, "task": { "id": "trip_planning_001", "difficulty_level": 3, "prompt": "I'm planning a three-day trip starting from Hangzhou... [完整要求，同论文] ... Can you help me put together this itinerary?", "expected_output_format": "[{\"time\":\"2025-10-01\",\"city\":\"...\",\"hotel\":\"...\",...}, {...}, {...}]", "max_tool_calls": 20 }, "verifier": { "code": "def verify_answer(answer):\n import json\n try:\n plan = json.loads(answer)\n # 约束1: 无重复\n hotels = [d['hotel'] for d in plan]\n if len(set(hotels)) != len(hotels): return False\n \n # 约束2: 城市归属验证\n for day in plan:\n if get_city_by_hotel(day['hotel']) != day['city']: return False\n if get_city_by_restaurant(day['afternoon_restaurant']) != day['city']: return False\n if get_city_by_restaurant(day['evening_restaurant']) != day['city']: return False\n if get_city_by_attraction(day['afternoon_attraction']) != day['city']: return False\n \n # 约束3: 第二天预算规则验证\n day2 = plan[1]\n hotel_price = next(h['price'] for h in db['hotels'] if h['name'] == day2['hotel'])\n restaurant_names = [day2['afternoon_restaurant'], day2['evening_restaurant']]\n restaurant_cost = sum(next(r['price'] for r in db['restaurants'] if r['name'] == rn) for rn in restaurant_names)\n \n if hotel_price >= 800 and restaurant_cost >= 350:\n return False\n \n return True\n except Exception as e:\n return False", "expected_reward": 1.0 } } 5 DeepSeek papers

2025.12, DeepSeek-V3.2 tech report
2025.09, DeepSeek-V3.2-Exp tech report
2025.08, DeepSeek-V3.1，no tech report
2024, DeepSeek-R1：通过强化学习激励大模型的推理能力
2024, DeepSeek-V3 tech report

以旅行规划（Trip Planning）为例，看 DeepSeek-V3.2 如何合成高质量训练数据（2025）

ARTHURCHIAO'S BLOG

2 weeks 6 days ago

如何基于 Agent/LLM 强大的规划能力+生成能力+代码执行能力+反思能力，自动化合成大批量高质量数据：

Hypothetical workflow

DeepSeek-V3.2: workflow for synthesizing high-quality agentic datasets for RL training (in agentic fashion, without human intervention)

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 场景：增强模型的 Trip Planning 能力
- 1.1 方案拆解
- 1.2 子任务：准备高质量的 Trip Planning 数据
2 方案：自动合成高质量 Trip Planning 数据
3 图解：DeepSeek-V3.2 是怎么做的（”Large-Scale Agentic Tasks”）
4 Kimi 老师补充的一些细节，帮助理解
- 4.1 生成的 Task 示例
- 4.2 输出样本要求
5 DeepSeek papers

1 场景：增强模型的 Trip Planning 能力

假设你在训练一个通用模型或垂域的旅游行业模型，那你可能会遇到下面这样的用户诉求：

要回答好这类问题，就需要对模型的行程规划（Itinerary）或称 旅游规划（Trip Planning）能力进行专门训练。

具体该怎么做呢？我们来尝试设计一个方案。

1.1 方案拆解

从非常高的 level 来说，要完成以上训练任务只需要做两件事情：

数据集准备：准备一批高质量的 Trip Planning 数据
后训练：基于高质量训练数据，对模型进行微调（SFT）或强化学习（RL）

本文接下来只关注第一个任务，高质量数据集的准备。

1.2 子任务：准备高质量的 Trip Planning 数据

再次从 high level 来说，这样的高质量数据集有两种来源：

人工标注：例如，找专业的旅行定制师或资深的旅行家，人工编写高质量的语料；
自动合成：通过某种不依赖人工的方式自动合成。

2 方案：自动合成高质量 Trip Planning 数据 2.1 思考：人（专家）怎么完成这个任务

定制师或旅行家基于自己丰富的业务知识（城市、交通、景点、酒店、预算、偏好等等），初步判断下杭州出发三天能玩的目的地范围，得到一些备选目的地；
针对这些备选目的地，以杭州为出发地，通过手动搜索或数据库查询，进一步充实交通、住宿、餐饮、景点、预算等需求，得到一些备选线路；
针对这些备选线路，再进一步验证里面的每个具体步骤是否满足用户的要求，以及整体方案是否满足用户的要求；如果满足就留下这个线路；如果不满足（例如某一天的预算超了）就进行相应的调整直到满足，或者多次失败之后直接弃用这个备选路线；
如果用户觉得上一步验证通过的线路还是不够有吸引力，则回到 step 1 or step 2 并顺序执行到 step 3，针对用户需求重新设计一些更有吸引力的线路。

经过以上步骤，最终得到的就是一些符合用户要求的高质量线路规划。

2.2 自动化：人工方案的 workflow 化

把以上的人工生产线路过程变成一个 workflow，就得到了一个基于 Agent 的自动化方案：

首先，我们得从某些地方获取一些 Trip Planning 相关的基础旅游数据，例如城市、交通、酒店、景点、价格等等信息，把它们存储起来备用；
接下来，得有一些工具来从这些数据中筛选出我们想要的信息，例如查询两个城市之间的交通方案、查询给定城市内的餐厅和景点等；
有了前两步的基础，剩下的就是生成一个具体的旅行规划任务，例如，“规划从上海到北京的三日游”，让 Agent 基于上一步提供的各种工具，帮我们将这个旅行规划方案设计出来。这个过程可以进一步拆解为两个子任务：
1. 生成：生成具体的旅行规划；
2. 验证：验证生成的旅行规划是否符合用户的要求。

基于以上流程，无需人工参与，就能自动完成一个行程规划任务，

如果验证 OK，就将这个结果输出；然后继续生成下一个（更难的）旅行规划任务；
如果失败，就要看问题是出在哪里，例如可能是工具不够、生成的方案不对、方案对但验证过程有问题等，尝试调整这几个环节，直到方案成功。

2.3 这个 workflow 的独特之处

Hypothetical workflow

图的上半部分可以叫“生成环境”，这是常规 LLM 擅长做的；
图的下半部分是“执行环境”，把上一步生成的代码真正拿来运行，再根据运行结果给 Agent 一个反馈，进入 Agent 的反思和下一次迭代流程。
整个方案的输入只有一段提示词（如果不算执行环境），其他都是 Agent+Workflow 创建和管理的。

2.4 小结

3 图解：DeepSeek-V3.2 是怎么做的（”Large-Scale Agentic Tasks”）

3.1 方案描述

原文：

Given a task category (e.g., planning a travel itinerary) and a sandbox equipped with a bash and a search tool, the agent first uses these tools to generate or retrieve relevant data from the Internet and store them in the sandbox database.
The agent then synthesizes a set of task-specific tools, each implemented as a function.
To create tasks that are both challenging and automatically verifiable, the agent initially proposes a simple task based on the current database, along with its solution and verification functions implemented in Python. The solution function is restricted to invoking tool functions or performing logical computations, and cannot call other functions or directly access the database, ensuring the task can only be solved through the tool interface. Additionally, the results produced by the solution function must be validated by the verification function. If the solution is not validated, the agent will modify the solution or verification functions until the solution’s output passes the verification. The agent then iteratively increases the difficulty of the task and updates the corresponding solution and verification functions. During this iterative process, if the current toolset is not sufficient to solve the task, the agent will augment the toolset.

Trip Planning 是其中的任务类型之一。

3.2 方案图解

具体过程如下图所示（根据个人理解画的，仅供参考，因为很多细节原文没提）：

核心是一个 Agent，接下来按序号介绍下各步骤。

Step 0: Agent 输入

给 Agent 输入任务类型（e.g. “Trip Planning”）和可用的 sandbox 信息；

任务类型有很多种，旅行规划只是其中之一；
sandbox 可以理解成一个 linux container，例如 Ubuntu，配置了 bash 和 search tool；

Step 1: Agent 构建旅行数据库

Agent 开始干活，首先进入 sandbox，然后用 internet search tool 从互联网搜索相关数据，并保存到 local database；

输入：任务类别（如 “trip planning”）+ 配备 bash 和 search 工具的 sandbox 环境
过程：Agent 使用搜索工具从互联网爬取或生成结构化数据，包括交通、酒店、景点、门票、餐厅等等，存储到 sandbox 的数据库中
输出：结构化数据表
local database 可以想象成一个 SQLite 数据库

效果示意：

合成这类任务所需的 tools。由于 Agent 非常清楚前一步的存储方式（例如，SQLite 表结构），因此生成 tools 非常简单，可能就是一些查表的 SQL wrappers：

任务的生成从易到难，既有挑战又要能自动验证，先从最简单的开始。

Agent 会为这个任务生成两个 python 函数：

solution function：仅能调用 tool functions 或执行逻辑计算，不能调用其他 functions 或直接访问 database，从而确保该 task 只能通过 tool interface 来解决。
verification function：对 solution function 的运行结果进行验证。

示例：

执行上面的 solve_task_1()，得到一个路线规划结果。转 step 5。

Step 5：执行 verification function，对上一步生成的线路规划进行验证

执行上面的 verify_task_1()，对上一步得到的路线进行验证。转 step 6。

Step 6: 如果验证成功，将这条数据输出

将这条数据以 <environment, tools, task, verifier> 的格式输出，这就是 DeepSeek-V3.2 下一阶段的一条训练样本；转 step 7。

Step 7: 返回到 step 3，继续合成下一个更难的任务

难度迭代升级：Agent 会逐步增加约束条件，直到任务具有挑战性但可验证。举例：

尝试修改 solution function 或 verification function，然后继续 step 4；如果是因为 tool 不够导致的失败，进入 step 9；

Step 9: 将错误返回给 Agent，让 Agent 尝试扩充 toolset 3.3 官方 Trip Planning sample

官方文章中给的 Trip Planning 数据 sample 和输出格式、toolset：

结构化的输出：

包含的字段：

日期
城市
酒店名称
午餐的餐厅名字
下午游玩的景点的名字
晚餐的餐厅名字

4 Kimi 老师补充的一些细节，帮助理解

向 kimi 老师问了几个问题，补充一些可能的细节，帮助更好地理解这个过程。这一节可能存在误导，仅供"仅供参考"。

可验证性：所有任务都带有自动验证函数，支持 RL 训练中的奖励信号计算
难度可控：通过迭代增加约束，确保任务对当前模型有挑战性（论文表5显示 DeepSeek-V3.2-Exp 在合成任务上仅 12% 准确率）
通用性：Solution 函数必须仅通过工具接口访问数据，不能直接查询数据库，确保 RL 策略可迁移到真实环境
规模：最终生成了 1,827 个环境 + 4,417 个任务，覆盖旅行规划、代码工程、数学推理等多领域

该 workflow 的核心创新在于将任务生成作为元学习问题，让模型自动创造高质量、可验证的训练样本，解决了大规模 RL 训练中数据稀缺的瓶颈。

成功样本会被筛选并持久化存储，作为后续 RL 训练的离线数据集。

“We then perform RL on this dataset using DeepSeek-V3.2 and retain only instances with non-zero pass@100, resulting in 1,827 environments and their corresponding tasks (4,417 in total).”

样本筛选标准

Pass@100 > 0：在 100 次随机尝试中至少能成功一次的任务才保留
确保任务可学习且非平凡：避免过于简单或不可能完成的任务

样本保存格式

样本以 四元组 结构存储：

{ "environment": { /* 数据库配置 */ }, "tools": { /* 工具函数定义 */ }, "task": { /* 任务描述 */ }, "verifier": { /* 验证逻辑 */ } } 输出样本示例（Trip Planning 任务）

以下是一个持久化样本：

2025.12, DeepSeek-V3.2 tech report
2025.09, DeepSeek-V3.2-Exp tech report
2025.08, DeepSeek-V3.1，no tech report
2024, DeepSeek-R1：通过强化学习激励大模型的推理能力
2024, DeepSeek-V3 tech report

[笔记] 从 Tokenization 视角看生成式推荐（GR）近几年的发展（2025）

ARTHURCHIAO'S BLOG

1 month 1 week ago

不同类型的真实世界建模元素对应的模型类型感知世界（Perceptual World）视觉（Vision）扩散模型（Diffusion Models, DMs）认知世界（Cognitive World）语言（Language）大语言模型（LLMs）行为世界（Behavioral World）交互（Interaction）用户行为的模型？

从模型和现实世界的对应关系来看，感知世界（Perceptual World）和 认知世界（Cognitive World）都已经有了对应的大模型类型，分别基于视觉（Vision）和语言（Language）建模，并且基本都是基于生成式架构，实际效果非常好。

推荐领域属于行为世界（Behavioral World），这个场景基于交互（Interaction）建模，目前还没有跟前两个领域一样成功的模型。一个思路是：如果大量场景已经充分证明了生成式是一把非常好的锤子，那我们是不是能把还没有很好解决的问题变成钉子？—— 具体到推荐场景，就是通过一些工程和算法手段，把推荐任务变成一个生成任务，从而套到生成式框架里。这就是生成式推荐模型（generative recommendation models）背后的思想。

最近有一篇很详尽的关于这个领域近几年发展的综述： Towards Large Generative Recommendation: A Tokenization Perspective。本文整理一些阅读笔记和思考。

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 背景
2 方向一：基于语言模型+文本描述的生成式推荐（LLM-based GR）
3 Semantic ID 简介
4 方向二：基于 SemanticID 的生成式推荐
5 总结
- 5.1 生成式推荐仍然面临的挑战
- 5.2 生成式推荐带来的新机会

大型生成式模型（large generative models）的出现正在深刻改变推荐系统领域。构建此类模型的基础组件之一是 action tokenization，即将人类可读数据（例如用户-商品交互数据）转换为机器可读格式（例如离散 token 序列），这个过程在进入模型之前。

本文介绍几种 action tokenization 技术（将用户行为分别转换为物品 ID、文本描述、语义 ID），然后从 action tokenization 的视角探讨生成式推荐领域面临的挑战、开放性问题及未来潜在发展方向，为下一代推荐系统的设计提供启发。

1 背景 1.1 什么是生成式模型（Generative Models）？

生成式模型从大量给定样本中学习到底层的数据分布（underlying distribution of data），然后就能生成新的样本（generate new samples）。如下图所示，在学习了大量动物图文之后，模型就能根据给定指令生成动物照片（“奔跑的猫/狗/马”），

1.2 什么是规模定律（Scaling laws）？

Scaling laws 提供了一个框架，通过这框架可以理解 model size, data volume, test-time computing 如何影响 AI 能力的进化。语言建模领域已经验证了这一框架的有效性。

Scaling Law as a Pathway towards AGI. Understanding Scaling Laws for Recommendation Models. Arxiv 2022

1.3 模型作为真实世界的映像

三种类型的真实世界：

做个表格对比，

基于 Vision 和 Language 的模型都有了，并且生成式占据主导地位，也见证了 scaling law，表现非常好；
基于 Interaction 的模型还在探索中，是不是也可以套用生成式？也就是构建大型生成式推荐模型（large generative recommendation models）。

1.4 为什么要做“生成式”推荐？

总结起来有两点，

更好地 scaling 行为；
与其他模态 (text, image, audio, …) 的对齐更好；

1.4.1 建模：语言建模 vs. 推荐建模

语言建模：根据给定的文本，预测接下来的文本；
推荐建模：根据用户的历史行为（购买商品、点击链接、浏览笔记等等），预测用户接下来的行为（购买、点击等等）；

这里的 Item 是推荐系统推荐的东西，可以是一个商品，也可以是一个笔记、视频等等。

1.4.2 现状：推荐领域的知识非常稀疏建模类型知识密度 Token 类型 Token 空间语言模型稠密的世界知识（Dense world knowledge）文本 token 10^5 推荐模型 稀疏的“用户-物品”交互数据（Sparse user-item interactions） Item token 10^9

可以看到，相比于语言建模，推荐领域的知识非常稀疏，因而 scaling laws 在传统推荐模型上几乎没什么效果。

1.4.3 为什么要 token 化 (“Tokenization”)？

Token 化是为了方便计算机处理。具体来说，就是将 human-readable data (Text, Image, Action, …) 转换成 machine-readble formats (Sequence of Tokens)。

语言模型的 tokenize 和 de-tokenize 过程如下，更多信息可参考如何训练一个企业级 GPT 助手（OpenAI，2023）。

推荐模型的 tokenization 我们后面介绍。

1.5 生成式推荐模型 tokenization 方案举例

几种生成式推荐模型的 tokenization 方案（有点早期了）：

SASRec [ICDM’18], Kang and McAuley. Self-Attentive Sequential Recommendation. ICDM 2018

Each item is indexed by a unique item ID, corresponding to a learnable embedding
UniSRec [KDD’22], Hou et al. Towards Universal Sequence Representation Learning for Recommender Systems. KDD 2022
- Each item is indexed by a unique item ID, corresponding to a fixed representation
- 中国人民大学 & 阿里
LLaRA [SIGIR’24], Liao et al. LLaRA: Large Language-Recommendation Assistant. SIGIR 2024
- Align item representations with text tokens in LLMs

1.6 生成式推荐模型 tokenization 面临的问题 1.6.1 问题：Token 空间太大，行为数据太稀疏

和语言模型做个对比，典型模型的 token 数量（vocabulary size）：

https://amazon-reviews-2023.github.io/

典型的大语言模型只有 128K~256K tokens；
典型的推荐领域，例如 amazon-reviews-2023，有 48.2M items，如果一个 item 用一个 token 表示，那就是 48.2M tokens； Token 太多导致数据太稀疏，很难有效训练一个大型生成式模型。

1.6.2 思路：将行为数据 tokenize 为数据分布

是否可以将人类可读的行为数据通过 tokenization 变成一种数据分布（跟语言建模类似），然后训练一个生成式模型来拟合这个分布？

1.6.3 方向：LLM-based GenRec vs. SID-based GenRec

如上图所示，在实际实现上有两个方向：

Tokenize 为文本：LLM-based Generative Rec（基于大语言模型+文本描述的生成式推荐）；
Tokenize 为 Semantic IDs：SemID-based Generative Rec（基于语义 ID 的生成式推荐）。

2 方向一：基于语言模型+文本描述的生成式推荐（LLM-based GR） 2.1 Tokenization 过程

这类方案的 Tokenization 过程：

输入（人类可读数据）：用户行为数据；
输出（方便计算机处理的数据）：这些行为数据对应的纯文本描述；

例如在下图的商品推荐场景，输入是用户购买过的四个商品，token 化之后就是四段分别描述这四个商品的纯文本：

一句话总结优缺点：

优点：基于文本的推荐本身就是 LLM 的工作机制，底层数据分布与 LLM 是对齐的；
缺点：低效（inefficient）。

下面详细看一下这类方案的特点。

2.2 基于语言模型的生成式推荐的特点

2.2.1 丰富的世界知识

大语言本身有丰富的世界知识，例如下图的文本中只是出现了一个单词（token） Titanic，它就已经知道这指代的是一部著名电影了 —— 这部电影的知识都已经内化在模型里了。

Liao et al. LLaRA: Large Language-Recommendation Assistant. SIGIR 2024.

因此，在基于语言模型+文本描述的生成式推荐中，只需少量数据就能得到一个不错的推荐效果， Few data -> a good recommender

2.2.2 强大的自然语言理解和生成

传统推荐系统主要是利用用户的历史购买记录和用户行为来预测接下来的购买行为：

LLM-based 生成式推荐，则可以利用 LLM 强大的自然语言理解和生成能力，通过对话方式叠加购买记录/用户行为，给出推荐：

2.2.3 推理能力/执行复杂任务的能力

很好理解，大模型的强项。

2.2.4 如何评估推荐效果

如何验证效果？

离线评估：数据丰富，但不够准确；
在线评估：准确，但代价比较大。

一种评估方式：LLM as user simulator。

2.3 基础：LLM as Sequential Recommender

早期尝试：直接用通用的预训练模型做推荐：

Directly use freezed LLMs (e.g., GPT 4) for recommendation
效果明显不及传统推荐系统。

因此后续开始在通用预训练的大语言模型上，通过 Continue Pre-Train (CPT)、SFT、RL 等等，对齐到推荐任务和用户偏好。

2.3.1 将 LLM 对齐到推荐任务

这里介绍两个方案，P5 和 InstructRec。

P5 如下图所示，5 类推荐任务及对应的训练样本，

P5 Multi-task Cross-task generalization.

P5 paper：用语言模型做推荐：一种统一的预训练、个性化提示和预测范式

InstructRec 的训练样本：

InstructRec: Unify recommendation & search via instruction tuning.
Zhang et al. Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach. TOIS

2.3.2 训练目标（SFT/Preference/RL） SFT

SFT 的训练目标是预测下一个 token。例如，给定输入：

I have watched Titanic, Roman Holiday, … Gone with the wind. Predict the next movie I will watch:

期望模型依次预测出 Waterloo 和 Bridge 这两个 token。

优化的目标：

Preference learning

通用语言模型：对齐到人类偏好；
推荐模型：对齐到用户偏好，实现方式一般训练一个奖励模型，然后基于奖励模型进行强化学习；

下面是一个例子，对给定的两个推荐结果做出评价（反馈/奖励），好还是坏，

Preference learning 典型方案：Chen et al. On Softmax Direct Preference Optimization for Recommendation. NeurIPS 2024

RL（强化学习）

这一步是通过强化学习激发出推理能力，典型方案：

Lin et al. Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning. TMLR
Tan et al. Reinforced Preference Optimization for Recommendation. arXiv:2510.12211

2.3.3 推理算法

Beam Search
Constrained Beam Search
Improved Constrained Beam Search (D3)
Dense Retrieval Grounding (BIGRec)

Retrieve real items by generated text.
Bao et al. A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems. TORS

2.3.4 小结

Early efforts: using LLMs in a zero-shot setting
Aligning LLMs for recommendation
Training objective: SFT, DPO, RL;
Inference: (constrained) beam search, retrieval;

2.4 应用一：LLM as Conversational Recommender 2.4.1 LLM 时代之前的对话式推荐

在非常有限的对话数据集上训练，针对具体任务的对话式推荐引擎，缺点：

缺少世界知识；
需要复杂的推荐策略；
缺少泛化能力。

2.4.2 基于 LLM 的对话式推荐

Recommendations with multiple turns conversation
Interactive; engaging users in the loop

Chen et al. All Roads Lead to Rome: Unveiling the Trajectory of Recommender Systems Across the LLM Era. arXiv.2407.10081

2.4.3 面临的挑战

数据集：Public datasets for CRS are limited, due to the scarcity of conversational products and real-world CRS datasets
评估方式：Traditional metrics like NDCG and BLEU are often insufficient to assess user experience
产品形态：ChatBot? Search bar? Independent App?

2.5 应用二：LLM as User Simulator

Zhang et al. On generative agents in recommendation. SIGIR 2024
Zhang et al. AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems. WWW 2024
Wang et al. When Large Language Model based Agent Meets User Behavior Analysis: A Novel User Simulation Paradigm. TOIS 2025.
Zhang et al. LLM-Powered User Simulator for Recommender System. AAAI 2025.

2.6 小结

Tokenize actions by text
- Pros: distribution naturally aligned with LLMs
- Cons: inefficient
From zero-shot to instruction tuning
- Training objectives: SFT, DPO, RL, …
- Inference: constrained beam search, retrieval
Applications Conversational RS, User Simulator

基于语言模型+文本描述的生成式推荐，效率低，效果也比较有效，因此需要探索其他方式，其中比较有希望的一种是引入特殊的 token （Semantic IDs）来表征 Item。

3 Semantic ID 简介 3.1 语言模型的 Token 设计

再来回顾下语言模型的 tokenize/de-tokenize 过程：

这里需要注意，一般来说 token 和单词并不是一一对应的，有时候一个 token 只是一个完整单词的一部分，

问题：

3.1.1 为什么 token:word ≠ 1:1

也就是说，为什么不设计成一个单词一个 token？

这会导致 vocabulary size 非常大，例如每个动词都有好几种时态，每个名词一般单复数都不一样； vocabulary size 过大会导致模型不健壮；

3.1.2 为什么 token:char ≠ 1:1

也就是说，为什么不设计成一个字符一个 token？

这会导致每个句子的 token 太多（上下文窗口非常长）；建模困难。

3.2 推荐模型的 Token 设计

推荐模型的 tokenization 可以有几种不同的方式。

3.2.1 方案一：每个商品用一个 token 表示

如下图所示：

优点是简单直接，缺点是

没有商品语义信息；
商品类型非常多，导致 vocabulary 非常非常大，比语言模型的 vocabulary 大几个数量级；

因此实际上基本不可用。

3.2.2 方案二：每个商品用一段 text 表示

如下图所示，

其中的蓝色长文本分别是图中四个商品的文本描述：

短袖：Premium Men’s Short Sleeve Athletic Training T-Shirt Made of Lightweight Breathable Fabric, Ideal for Running, Gym Workouts, and Casual Sportswear in All Seasons;
长袜：High-Performance Breathable Cotton Crew Socks for Men with Arch Support, Cushioned Heel and Toe, and Moisture Control, Perfect for Sports, Walking, and Everyday Comfort;
短裤：Men’s Loose-Fit Basketball Shorts with Elastic Drawstring Waistband, Quick-Dry Mesh Fabric, and Printed Number 11 for Professional and Recreational Play;
篮球：Official Size 7 Composite Leather Basketball Designed for Indoor and Outdoor Use, Deep Channel Design for Enhanced Grip and Ball Control, Ideal for Training and Competitive Matches;

优点是有商品的语义信息；缺点是每个商品的 token（文本描述）过长，训练/推理非常低效，另外类似商品的区分度很低，也导致实际上基本不可用。

3.2.3 方案三：结合方案一和方案二的优点 -> SemanticID

有没有一种方案能结合前两种方案的优点呢？有，这就是我们接下来要重点介绍的 SemanticID。

用几个 token 联合索引一个商品

下图是一个例子，这里是用四个连续 token 索引一个商品，

每个 token 来自不同 vocabulary，表征商品的不同维度

还是上面那个例子，其中的四个 token 分别来自四个 vocabulary，每个 vocabulary 表征商品的不同维度。例如第二个 token 来自下图中所示的 vocabulary：

vocabulary size 和支持的商品总数

如果每个 vocabulary 256 tokens，那

用四个 token 索引一个商品时，大致能索引的商品量级为 256^4≈4.3×10^9，也就是 4.3 亿个商品；
总的 vocabulary 空间为 256x4=1024 tokens，也就是只需要引入 1024 个独立 token；

3.2.4 三种方式对应的 vocabulary 大小对比

下图是三种方式的对比（从左到右依次是方案一、三、二），

左边是方案一：每个商品一个 token 表示，因此是 4 个 token；
右边是方案二：每个商品一段 text 表示；
中间是方案三：每个商品 4 token 表示（SemanticID），因此总共 16 tokens；

对应的 vocabulary 大小：

3.3 典型 SemanticID 方案 3.3.1 TIGER, NeurIPS 2023

详见 paper：

Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023.

3.3.2 将推荐问题转化成 seq-to-seq 生成问题

将 recommendation 转化成 seq-to-seq 生成问题：

输入：用户交互的商品序列（user interacted items），用 SemanticID 序列表示；
输出：下一个商品，也是用 SemanticID 表示。

4 方向二：基于 SemanticID 的生成式推荐 4.1 Semantic ID 的构建 4.1.1 目标：输入 & 输出

输入：所有关于这个商品的信息，包括商品描述、标题、用户行为数据、特征 …；
输出：商品和它的 SemanticID 之间的映射关系（items <--> SemanticIDs）；

4.1.2 RQ-VAE-based SemIDs (TIGER as example)

其中一类是称为 RQ-VAE-based SemIDs。代表是 TIGER。

如下图所示，TIGER 用到了 ItemID/Title/Description/Categories/Brand 作为输入信息：

Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023.

构建步骤：

步骤一：商品内容信息（Text）

第一步是以规定的顺序组织商品内容信息，

Ni et al. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. Findings of ACL 2022. Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023

步骤二：商品内容信息向量化（Text -> Vector）

第二步是对内容信息进行编码，这里用了一个 Encoder，然后再做 Embedding，

Ni et al. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. Findings of ACL 2022. Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023

步骤三：残差量化（Vector -> IDs）

RQ-VAE Quantization 将向量变成 ID，图中的 7, 1, 4 就是 SemanticIDs，

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec. TASLP 2022. Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023.

4.1.3 RQ-VAE-based SemIDs 的特性

Semantic
Ordered / sequential dependent
Collisions

4.1.4 RQ-VAE-based SemIDs 存在的问题

Enc-Dec Training Unstable
Unbalanced IDs

因此后面陆续有一些变种，

这里介绍下快手的 OneRec，

Deng et al. OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment. arXiv:2502.18965

4.1.5 小结

几种构建 SemIDs 的方式：

Residual Quantization (ordered)
Product Quantization (unordered)
Hierarchical Clustering
LM-based ID Generator

4.2 构建 SemID 时的输入

Input: all data associated with the item What exactly does “all data” mean?

4.2.1 商品元数据 (Text / Multimodal / Categorical / No Features)

Zhu et al. Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics. arXiv:2503.23333.

4.2.2 商品元数据 + 用户行为

Regularization / Fusion
Context-independent -> Context-aware

[笔记] 从 Tokenization 视角看生成式推荐（GR）近几年的发展（2025）

ARTHURCHIAO'S BLOG

1 month 1 week ago

最近有一篇很详尽的关于这个领域近几年发展的综述： Towards Large Generative Recommendation: A Tokenization Perspective。本文整理一些阅读笔记和思考。

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 背景
2 方向一：基于语言模型+文本描述的生成式推荐（LLM-based GR）
3 Semantic ID 简介
4 方向二：基于 SemanticID 的生成式推荐
5 总结
- 5.1 生成式推荐仍然面临的挑战
- 5.2 生成式推荐带来的新机会

1 背景 1.1 什么是生成式模型（Generative Models）？

1.2 什么是规模定律（Scaling laws）？

Scaling Law as a Pathway towards AGI. Understanding Scaling Laws for Recommendation Models. Arxiv 2022

1.3 模型作为真实世界的映像

三种类型的真实世界：

做个表格对比，

基于 Vision 和 Language 的模型都有了，并且生成式占据主导地位，也见证了 scaling law，表现非常好；
基于 Interaction 的模型还在探索中，是不是也可以套用生成式？也就是构建大型生成式推荐模型（large generative recommendation models）。

1.4 为什么要做“生成式”推荐？

总结起来有两点，

更好地 scaling 行为；
与其他模态 (text, image, audio, …) 的对齐更好；

1.4.1 建模：语言建模 vs. 推荐建模

语言建模：根据给定的文本，预测接下来的文本；
推荐建模：根据用户的历史行为（购买商品、点击链接、浏览笔记等等），预测用户接下来的行为（购买、点击等等）；

这里的 Item 是推荐系统推荐的东西，可以是一个商品，也可以是一个笔记、视频等等。

可以看到，相比于语言建模，推荐领域的知识非常稀疏，因而 scaling laws 在传统推荐模型上几乎没什么效果。

1.4.3 为什么要 token 化 (“Tokenization”)？

Token 化是为了方便计算机处理。具体来说，就是将 human-readable data (Text, Image, Action, …) 转换成 machine-readble formats (Sequence of Tokens)。

语言模型的 tokenize 和 de-tokenize 过程如下，更多信息可参考如何训练一个企业级 GPT 助手（OpenAI，2023）。

推荐模型的 tokenization 我们后面介绍。

1.5 生成式推荐模型 tokenization 方案举例

几种生成式推荐模型的 tokenization 方案（有点早期了）：

SASRec [ICDM’18], Kang and McAuley. Self-Attentive Sequential Recommendation. ICDM 2018

Each item is indexed by a unique item ID, corresponding to a learnable embedding
UniSRec [KDD’22], Hou et al. Towards Universal Sequence Representation Learning for Recommender Systems. KDD 2022
- Each item is indexed by a unique item ID, corresponding to a fixed representation
- 中国人民大学 & 阿里
LLaRA [SIGIR’24], Liao et al. LLaRA: Large Language-Recommendation Assistant. SIGIR 2024
- Align item representations with text tokens in LLMs

1.6 生成式推荐模型 tokenization 面临的问题 1.6.1 问题：Token 空间太大，行为数据太稀疏

和语言模型做个对比，典型模型的 token 数量（vocabulary size）：

https://amazon-reviews-2023.github.io/

典型的大语言模型只有 128K~256K tokens；
典型的推荐领域，例如 amazon-reviews-2023，有 48.2M items，如果一个 item 用一个 token 表示，那就是 48.2M tokens； Token 太多导致数据太稀疏，很难有效训练一个大型生成式模型。

1.6.2 思路：将行为数据 tokenize 为数据分布

是否可以将人类可读的行为数据通过 tokenization 变成一种数据分布（跟语言建模类似），然后训练一个生成式模型来拟合这个分布？

1.6.3 方向：LLM-based GenRec vs. SID-based GenRec

如上图所示，在实际实现上有两个方向：

Tokenize 为文本：LLM-based Generative Rec（基于大语言模型+文本描述的生成式推荐）；
Tokenize 为 Semantic IDs：SemID-based Generative Rec（基于语义 ID 的生成式推荐）。

2 方向一：基于语言模型+文本描述的生成式推荐（LLM-based GR） 2.1 Tokenization 过程

这类方案的 Tokenization 过程：

输入（人类可读数据）：用户行为数据；
输出（方便计算机处理的数据）：这些行为数据对应的纯文本描述；

例如在下图的商品推荐场景，输入是用户购买过的四个商品，token 化之后就是四段分别描述这四个商品的纯文本：

一句话总结优缺点：

优点：基于文本的推荐本身就是 LLM 的工作机制，底层数据分布与 LLM 是对齐的；
缺点：低效（inefficient）。

下面详细看一下这类方案的特点。

2.2 基于语言模型的生成式推荐的特点

2.2.1 丰富的世界知识

Liao et al. LLaRA: Large Language-Recommendation Assistant. SIGIR 2024.

因此，在基于语言模型+文本描述的生成式推荐中，只需少量数据就能得到一个不错的推荐效果， Few data -> a good recommender

2.2.2 强大的自然语言理解和生成

传统推荐系统主要是利用用户的历史购买记录和用户行为来预测接下来的购买行为：

LLM-based 生成式推荐，则可以利用 LLM 强大的自然语言理解和生成能力，通过对话方式叠加购买记录/用户行为，给出推荐：

2.2.3 推理能力/执行复杂任务的能力

很好理解，大模型的强项。

2.2.4 如何评估推荐效果

如何验证效果？

离线评估：数据丰富，但不够准确；
在线评估：准确，但代价比较大。

一种评估方式：LLM as user simulator。

2.3 基础：LLM as Sequential Recommender

早期尝试：直接用通用的预训练模型做推荐：

Directly use freezed LLMs (e.g., GPT 4) for recommendation
效果明显不及传统推荐系统。

因此后续开始在通用预训练的大语言模型上，通过 Continue Pre-Train (CPT)、SFT、RL 等等，对齐到推荐任务和用户偏好。

2.3.1 将 LLM 对齐到推荐任务

这里介绍两个方案，P5 和 InstructRec。

P5 如下图所示，5 类推荐任务及对应的训练样本，

P5 Multi-task Cross-task generalization.

P5 paper：用语言模型做推荐：一种统一的预训练、个性化提示和预测范式

InstructRec 的训练样本：

InstructRec: Unify recommendation & search via instruction tuning.
Zhang et al. Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach. TOIS

2.3.2 训练目标（SFT/Preference/RL） SFT

SFT 的训练目标是预测下一个 token。例如，给定输入：

I have watched Titanic, Roman Holiday, … Gone with the wind. Predict the next movie I will watch:

期望模型依次预测出 Waterloo 和 Bridge 这两个 token。

优化的目标：

Preference learning

通用语言模型：对齐到人类偏好；
推荐模型：对齐到用户偏好，实现方式一般训练一个奖励模型，然后基于奖励模型进行强化学习；

下面是一个例子，对给定的两个推荐结果做出评价（反馈/奖励），好还是坏，

Preference learning 典型方案：Chen et al. On Softmax Direct Preference Optimization for Recommendation. NeurIPS 2024

RL（强化学习）

这一步是通过强化学习激发出推理能力，典型方案：

Lin et al. Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning. TMLR
Tan et al. Reinforced Preference Optimization for Recommendation. arXiv:2510.12211

2.3.3 推理算法

Beam Search
Constrained Beam Search
Improved Constrained Beam Search (D3)
Dense Retrieval Grounding (BIGRec)

Retrieve real items by generated text.
Bao et al. A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems. TORS

2.3.4 小结

Early efforts: using LLMs in a zero-shot setting
Aligning LLMs for recommendation
Training objective: SFT, DPO, RL;
Inference: (constrained) beam search, retrieval;

2.4 应用一：LLM as Conversational Recommender 2.4.1 LLM 时代之前的对话式推荐

在非常有限的对话数据集上训练，针对具体任务的对话式推荐引擎，缺点：

缺少世界知识；
需要复杂的推荐策略；
缺少泛化能力。

2.4.2 基于 LLM 的对话式推荐

Recommendations with multiple turns conversation
Interactive; engaging users in the loop

Chen et al. All Roads Lead to Rome: Unveiling the Trajectory of Recommender Systems Across the LLM Era. arXiv.2407.10081

2.4.3 面临的挑战

数据集：Public datasets for CRS are limited, due to the scarcity of conversational products and real-world CRS datasets
评估方式：Traditional metrics like NDCG and BLEU are often insufficient to assess user experience
产品形态：ChatBot? Search bar? Independent App?

2.5 应用二：LLM as User Simulator

Zhang et al. On generative agents in recommendation. SIGIR 2024
Zhang et al. AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems. WWW 2024
Wang et al. When Large Language Model based Agent Meets User Behavior Analysis: A Novel User Simulation Paradigm. TOIS 2025.
Zhang et al. LLM-Powered User Simulator for Recommender System. AAAI 2025.

2.6 小结

Tokenize actions by text
- Pros: distribution naturally aligned with LLMs
- Cons: inefficient
From zero-shot to instruction tuning
- Training objectives: SFT, DPO, RL, …
- Inference: constrained beam search, retrieval
Applications Conversational RS, User Simulator

3 Semantic ID 简介 3.1 语言模型的 Token 设计

再来回顾下语言模型的 tokenize/de-tokenize 过程：

这里需要注意，一般来说 token 和单词并不是一一对应的，有时候一个 token 只是一个完整单词的一部分，

问题：

3.1.1 为什么 token:word ≠ 1:1

也就是说，为什么不设计成一个单词一个 token？

这会导致 vocabulary size 非常大，例如每个动词都有好几种时态，每个名词一般单复数都不一样； vocabulary size 过大会导致模型不健壮；

3.1.2 为什么 token:char ≠ 1:1

也就是说，为什么不设计成一个字符一个 token？

这会导致每个句子的 token 太多（上下文窗口非常长）；建模困难。

3.2 推荐模型的 Token 设计

推荐模型的 tokenization 可以有几种不同的方式。

3.2.1 方案一：每个商品用一个 token 表示

如下图所示：

优点是简单直接，缺点是

没有商品语义信息；
商品类型非常多，导致 vocabulary 非常非常大，比语言模型的 vocabulary 大几个数量级；

因此实际上基本不可用。

3.2.2 方案二：每个商品用一段 text 表示

如下图所示，

其中的蓝色长文本分别是图中四个商品的文本描述：

短袖：Premium Men’s Short Sleeve Athletic Training T-Shirt Made of Lightweight Breathable Fabric, Ideal for Running, Gym Workouts, and Casual Sportswear in All Seasons;
长袜：High-Performance Breathable Cotton Crew Socks for Men with Arch Support, Cushioned Heel and Toe, and Moisture Control, Perfect for Sports, Walking, and Everyday Comfort;
短裤：Men’s Loose-Fit Basketball Shorts with Elastic Drawstring Waistband, Quick-Dry Mesh Fabric, and Printed Number 11 for Professional and Recreational Play;
篮球：Official Size 7 Composite Leather Basketball Designed for Indoor and Outdoor Use, Deep Channel Design for Enhanced Grip and Ball Control, Ideal for Training and Competitive Matches;

优点是有商品的语义信息；缺点是每个商品的 token（文本描述）过长，训练/推理非常低效，另外类似商品的区分度很低，也导致实际上基本不可用。

3.2.3 方案三：结合方案一和方案二的优点 -> SemanticID

有没有一种方案能结合前两种方案的优点呢？有，这就是我们接下来要重点介绍的 SemanticID。

用几个 token 联合索引一个商品

下图是一个例子，这里是用四个连续 token 索引一个商品，

每个 token 来自不同 vocabulary，表征商品的不同维度

还是上面那个例子，其中的四个 token 分别来自四个 vocabulary，每个 vocabulary 表征商品的不同维度。例如第二个 token 来自下图中所示的 vocabulary：

vocabulary size 和支持的商品总数

如果每个 vocabulary 256 tokens，那

用四个 token 索引一个商品时，大致能索引的商品量级为 256^4≈4.3×10^9，也就是 4.3 亿个商品；
总的 vocabulary 空间为 256x4=1024 tokens，也就是只需要引入 1024 个独立 token；

3.2.4 三种方式对应的 vocabulary 大小对比

下图是三种方式的对比（从左到右依次是方案一、三、二），

左边是方案一：每个商品一个 token 表示，因此是 4 个 token；
右边是方案二：每个商品一段 text 表示；
中间是方案三：每个商品 4 token 表示（SemanticID），因此总共 16 tokens；

对应的 vocabulary 大小：

3.3 典型 SemanticID 方案 3.3.1 TIGER, NeurIPS 2023

详见 paper：

Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023.

3.3.2 将推荐问题转化成 seq-to-seq 生成问题

将 recommendation 转化成 seq-to-seq 生成问题：

输入：用户交互的商品序列（user interacted items），用 SemanticID 序列表示；
输出：下一个商品，也是用 SemanticID 表示。

4 方向二：基于 SemanticID 的生成式推荐 4.1 Semantic ID 的构建 4.1.1 目标：输入 & 输出

输入：所有关于这个商品的信息，包括商品描述、标题、用户行为数据、特征 …；
输出：商品和它的 SemanticID 之间的映射关系（items <--> SemanticIDs）；

4.1.2 RQ-VAE-based SemIDs (TIGER as example)

其中一类是称为 RQ-VAE-based SemIDs。代表是 TIGER。

如下图所示，TIGER 用到了 ItemID/Title/Description/Categories/Brand 作为输入信息：

Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023.

构建步骤：

步骤一：商品内容信息（Text）

第一步是以规定的顺序组织商品内容信息，

Ni et al. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. Findings of ACL 2022. Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023

步骤二：商品内容信息向量化（Text -> Vector）

第二步是对内容信息进行编码，这里用了一个 Encoder，然后再做 Embedding，

Ni et al. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. Findings of ACL 2022. Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023

步骤三：残差量化（Vector -> IDs）

RQ-VAE Quantization 将向量变成 ID，图中的 7, 1, 4 就是 SemanticIDs，

Zeghidour et al. SoundStream: An End-to-End Neural Audio Codec. TASLP 2022. Rajput et al. Recommender Systems with Generative Retrieval. NeurIPS 2023.

4.1.3 RQ-VAE-based SemIDs 的特性

Semantic
Ordered / sequential dependent
Collisions

4.1.4 RQ-VAE-based SemIDs 存在的问题

Enc-Dec Training Unstable
Unbalanced IDs

因此后面陆续有一些变种，

这里介绍下快手的 OneRec，

Deng et al. OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment. arXiv:2502.18965

4.1.5 小结

几种构建 SemIDs 的方式：

Residual Quantization (ordered)
Product Quantization (unordered)
Hierarchical Clustering
LM-based ID Generator

4.2 构建 SemID 时的输入

Input: all data associated with the item What exactly does “all data” mean?

4.2.1 商品元数据 (Text / Multimodal / Categorical / No Features)

Zhu et al. Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics. arXiv:2503.23333.

4.2.2 商品元数据 + 用户行为

Regularization / Fusion
Context-independent -> Context-aware

An Illustrated Guide to AP2 (Agent Payment Protocol) (2025)

ARTHURCHIAO'S BLOG

1 month 2 weeks ago

With the rapid evolution of GenAI and the growing trend of accomplishing more and more tasks through chat, can you imagine a day (perhaps in the near future) we can buy almost anything simply by chatting? Instead of browsing e-commerce sites, comparing products yourself, you’ll just tell your agent what you need. It will handle everything: selecting options, comparing features, negotiating prices, making payments, and ensuring the product arrives at the right place and time.

To bring this vision to life, one essential piece is still missing: a payment protocol designed for agent-to-agent transactions. That’s exactly why AP2 was created.

This post offers an illustrative guide to this emerging topic.

Fig. Shopping agent view of the "Buy a coffee maker" AP2 demo.

Fig. Call flow of the AP2 demo. Note: for clarity, the "Shopping Agent" shown in this diagram combines the responsibilities of three distinct agents from the actual demo: the shopping agent, address collection agent, and payment method collection agent.

1 Why AP2?
- 1.1 An Era of Agentic Commerce
- 1.2 AP2: Payment Protocol for Agents
2 How AP2 Works
- 2.1 Core Concepts
  - 2.1.1 Mandate
  - 2.1.2 VC (Verifiable Credential)
- 2.2 Working Fashions (Scenarios)
  - 2.2.1 Real-time purchases (human present)
  - 2.2.2 Delegated tasks (human not present)
3 Demo: Buy A Coffee Maker Through Chat
References

1 Why AP2? 1.1 An Era of Agentic Commerce

The digital interaction fashion is likely to enter a new phase:

Now and the past: people interact directly with websites and applications. Such as, people browse websites or apps, select the products they like and add to cart, and finally click the “Buy” or “Pay” button;
The future: may shift toward an era of conversational and delegated task execution via agents; no manually browsing, just chat with your AI assistant.

This means agents will manage various daily tasks for users (humans), such as

routine purchases
complex product research
price negotiations, and more.

This new era of agentic commerce will bring new opportunities for both users and businesses:

For users: get a highly personalized, seamless shopping experience
For businesses: open up a new, intelligent channel for reaching customers

1.2 AP2: Payment Protocol for Agents

The above mentiond scenario raises new challenges for payments, and it is in this background, Google introduced the Agent Payments Protocol (AP2) in September, 2025: Powering AI commerce with the new Agent Payments Protocol (AP2).

Today, Google announced the Agent Payments Protocol (AP2), an open protocol developed with leading payments and technology companies to securely initiate and transact agent-led payments across platforms. The protocol can be used as an extension of the Agent2Agent (A2A) protocol and Model Context Protocol (MCP). In concert with industry rules and standards, it establishes a payment-agnostic framework for users, merchants, and payments providers to transact with confidence across all types of payment methods.

2 How AP2 Works

In a nutshell: establishing trust via Mandates and Verifiable Credentials (VCs).

2.1 Core Concepts 2.1.1 Mandate

Mandates are tamper-proof, cryptographically-signed digital contracts;
Mandates serve as verifiable proof of a user's instructions;
Mandates are signed by VC.

2.1.2 VC (Verifiable Credential)

VC is a special kind of data payload between agents.

2.2 Working Fashions (Scenarios) 2.2.1 Real-time purchases (human present)

Image source: [1]

User -> Agent: “Find me new white running shoes”
Agent: capture the request in an initial IntentMandate. This provides the auditable context for the entire interaction in a transaction process.
Agent -> Merchant Agents: find shoes with IntentMandate; get some candidates;
Agent -> User: present a cart with the shoes users would like;
User: select the item he/she likes;
Agent: sign a CartMandate. This is a critical step that creates a secure, unchangeable record of the exact items and price, ensuring what user see is what them pay for.
Agent -> Merchant Agent & Credential Provider Agent: complete payment with a PaymentMandate.

2.2.2 Delegated tasks (human not present)

Image source: [1]

User -> Agent: “Buy concert tickets the moment they go on sale”.
Agent: the user signed a detailed Intent Mandate upfront. This mandate specifies the rules of engagement—price limits, timing, and other conditions.
Agent -> Merchant Agent & Credential Provider Agent: automatically generate a Cart Mandate on behalf of user once the precise conditions are met.

3 Demo: Buy A Coffee Maker Through Chat

This is a demo from AP2 community, see github for the code and more details.

3.1 Components

The demo is a simple multi-agent system based on google ADK, this is what looks like when the demo finished:

It consists of the following components (agents):

Root Agent: for orchestrating all the entire demo
Shopping agent: chat-based agent that providing shopping services to User;
Shipping address collecting agent: utility agent for Root Agent;
Payment method collecting agent: utility agent for Root Agent;
Merchant agent: commerce agent that selling products;
Merchant payment processor agent: utility agent for Merchant agent that that handles payment stuffs for the latter;
Payment credential provider agent: providing AP2 auth between shopping agent and merchant agents;

3.2 Agent Card & System Prompt 3.2.1 Shopping Agent

System prompt to see how it works:

shopper = RetryingLlmAgent( name="shopper", instruction=""" You are an agent responsible for helping the user shop for products. %s When asked to complete a task, follow these instructions: 1. Find out what the user is interested in purchasing. 2. Ask clarifying questions one at a time to understand their needs fully. The shopping agent delegates responsibility for helping the user shop for products to this subagent. Help the user craft an IntentMandate that will be used to find relevant products for their purchase. Reason about the user's instructions and the information needed for the IntentMandate. The IntentMandate will be shown back to the user for confirmation so it's okay to make reasonable assumptions about the IntentMandate criteria initially. For example, inquire about: - A detailed description of the item. - Any preferred merchants or specific SKUs. - Whether the item needs to be refundable. 3. After you have gathered what you believe is sufficient information, use the 'create_intent_mandate' tool with the collected information (user's description, and any other details they provided). Do not include any user guidance on price in the intent mandate. Use user's preference for the price as a filter when recommending products for the user to select from. 4. Present the IntentMandate to the user in a clear, well-formatted summary. Start with the statement: "Please confirm the following details for your purchase. Note that this information will be shared with the merchant." And then has a row space and a breakdown of the details: Item Description: The natural_language_description. Never include any user guidance on price in the intent mandate. User Confirmation Required: A human-readable version of user_cart_confirmation_required (e.g., 'Yes', 'No'). Merchants: A comma-separated list of merchants, or 'Any' if not specified. SKUs: A comma-separated list of SKUs, or 'Any' if not specified. Refundable: 'Yes' or 'No'. Expires: Convert the intent_expiry timestamp into a human-readable relative time (e.g., "in 1 hour", "in 2 days"). After the breakdown, leave a blank line and end with: "Shall I proceed?" 5. Once the user confirms, use the 'find_products' tool. It will return a list of `CartMandate` objects. 6. For each CartMandate object in the list, create a visually distinct entry that includes the following details from the object: Item: Display the item_name clearly and in bold. Price: Present the total_price with the currency. Format the price with commas, and use the currency symbol (e.g., "$1,234.56"). Expires: Convert the cart_expiry into a human-readable format (e.g., "in 2 hours," "by tomorrow at 5 PM"). Refund Period: Convert the refund_period into a human-readable format (e.g., "30 days," "14 days"). Present these details to the user in a clear way. If there are more than one CartMandate object, present them as a numbered list. At the bottom, present Sold by: Show the merchant_name associate the first Transaction. Ensure the cart you think matches the user's intent the most is presented at the top of the list. Add a 2-3 line summary of why you recommended the first option to the user. 7. Ask the user which item they would like to purchase. 8. After they choose, call the update_chosen_cart_mandate tool with the appropriate cart ID. 9. Monitor the tool's output. If the cart ID is not found, you must inform the user and prompt them to try again. If the selection is successful, signal a successful update and hand off the process to the root_agent. """ % DEBUG_MODE_INSTRUCTIONS, tools=[ tools.create_intent_mandate, tools.find_products, tools.update_chosen_cart_mandate, ], ) 3.2.2 Merchant Agent

A2A agent card:

{ "name": "MerchantAgent", "description": "A sales assistant agent for a merchant.", "skills": [ { "description": "Searches the merchant's catalog based on a shopping intent & returns a cart containing the top results.", "id": "search_catalog", "name": "Search Catalog", "tags": [ "merchant", "search", "catalog" ] } ], "capabilities": { "extensions": [ { "description": "Supports the Agent Payments Protocol.", "required": true, "uri": "https://github.com/google-agentic-commerce/ap2/v1" }, { "description": "Supports the Sample Card Network payment method extension", "required": true, "uri": "https://sample-card-network.github.io/paymentmethod/types/v1" } ] }, "defaultInputModes": [ "json" ], "defaultOutputModes": [ "json" ], "preferredTransport": "JSONRPC", "protocolVersion": "0.3.0", "url": "http://localhost:8001/a2a/merchant_agent", "version": "1.0.0" } 3.2.3 Merchant Payment Agent

A2A agent card:

{ "name": "merchant_payment_processor_agent", "description": "An agent that processes card payments on behalf of a merchant.", "skills": [ { "description": "Processes card payments.", "id": "card-processor", "name": "Card Processor", "tags": [ "payment", "card" ] } ], "capabilities": { "extensions": [ { "description": "Supports the Agent Payments Protocol.", "required": true, "uri": "https://github.com/google-agentic-commerce/ap2/v1" }, { "description": "Supports the Sample Card Network payment method extension", "required": true, "uri": "https://sample-card-network.github.io/paymentmethod/types/v1" } ] }, "defaultInputModes": [ "text/plain" ], "defaultOutputModes": [ "application/json" ], "preferredTransport": "JSONRPC", "protocolVersion": "0.3.0", "url": "http://localhost:8003/a2a/merchant_payment_processor_agent", "version": "1.0.0" } 3.2.4 Payment Credential Provider Agent

A2A agent card:

{ "name": "CredentialsProvider", "description": "An agent that holds a user's payment credentials.", "skills": [ { "description": "Initiates a payment with the correct payment processor.", "id": "initiate_payment", "name": "Initiate Payment", "tags": [ "payments" ] }, { "description": "Provides a list of eligible payment methods for a particular purchase.", "id": "get_eligible_payment_methods", "name": "Get Eligible Payment Methods", "tags": [ "eligible", "payment", "methods" ] }, { "description": "Fetches the shipping address from a user's wallet.", "id": "get_account_shipping_address", "name": "Get Shipping Address", "tags": [ "account", "shipping" ] } ], "capabilities": { "extensions": [ { "description": "Supports the Agent Payments Protocol.", "required": true, "uri": "https://github.com/google-agentic-commerce/ap2/v1" }, { "description": "Supports the Sample Card Network payment method extension", "required": true, "uri": "https://sample-card-network.github.io/paymentmethod/types/v1" } ] }, "defaultInputModes": [ "text/plain" ], "defaultOutputModes": [ "application/json" ], "preferredTransport": "JSONRPC", "protocolVersion": "0.3.0", "url": "http://localhost:8002/a2a/credentials_provider", "version": "1.0.0" }

Account Manager (User Database):

"""An in-memory manager of a user's 'account details'. Each 'account' contains a user's payment methods and shipping address. For demonstration purposes, several accounts are pre-populated with sample data. """ _account_db = { "[email protected]": { "shipping_address": { "recipient": "Bugs Bunny", "organization": "Sample Organization", "address_line": ["123 Main St"], "city": "Sample City", "region": "ST", "postal_code": "00000", "country": "US", "phone_number": "+1-000-000-0000", }, "payment_methods": { "card1": { "type": "CARD", "alias": "American Express ending in 4444", "network": [{"name": "amex", "formats": ["DPAN"]}], "cryptogram": "fake_cryptogram_abc123", "token": "1111000000000000", "card_holder_name": "John Doe", "card_expiration": "12/2025", "card_billing_address": { "country": "US", "postal_code": "00000", }, }, "card2": { "type": "CARD", "alias": "American Express ending in 8888", "network": [{"name": "amex", "formats": ["DPAN"]}], "cryptogram": "fake_cryptogram_ghi789", "token": "2222000000000000", "card_holder_name": "Bugs Bunny", "card_expiration": "10/2027", "card_billing_address": { "country": "US", "postal_code": "00000", }, }, "bank_account1": { "type": "BANK_ACCOUNT", "account_number": "111", "alias": "Primary bank account", }, "digital_wallet1": { "type": "DIGITAL_WALLET", "brand": "PayPal", "account_identifier": "[email protected]", "alias": "Bugs's PayPal account", }, }, }, "[email protected]": { "payment_methods": { "bank_account1": { "type": "BANK_ACCOUNT", "brand": "Bank of Money", "account_number": "789", "alias": "Main checking account", } }, }, "[email protected]": { "payment_methods": { "digital_wallet1": { "type": "DIGITAL_WALLET", "brand": "PayPal", "account_identifier": "[email protected]", "alias": "Fudd's PayPal", } } }, } _token = {} class CredentialsProviderExecutor(BaseServerExecutor): """AgentExecutor for the credentials provider agent.""" _system_prompt = """ You are a credentials provider agent acting as a secure digital wallet. Your job is to manage a user's payment methods and shipping addresses. Based on the user's request, identify their intent and select the single correct tool to use. Your only output should be a tool call. Do not engage in conversation. %s """ % DEBUG_MODE_INSTRUCTIONS def __init__(self, supported_extensions: list[dict[str, Any]] = None): agent_tools = [ tools.handle_create_payment_credential_token, tools.handle_get_payment_method_raw_credentials, tools.handle_get_shipping_address, tools.handle_search_payment_methods, tools.handle_signed_payment_mandate, ] 3.3 Run The Demo (Chat to Buy a Coffee Maker)

Just follow the README to deploy it.

For Chinese users, Gemini may block you by location (return 40x responses), so you need to setup a proxy:

$ export no_proxy=localhost; export http_proxy=YOUR_PROXY; export https_proxy=YOUR_PROXY; export GOOGLE_API_KEY=YOUR_KEY; bash samples/python/scenarios/a2a/human-present/cards/run.sh

Below is an intact chat session, from first query to payment completing. Note that this example is designed to demonstrate the various capabilities and steps within AP2, which is why it may appear intricate. In practice, the process can be more streamlined than shown here.

Let’s see what’s happened in the behind.

3.4 Detailed Traces

We have two ways to inspect what’s happened in the behind. The first one is via the UI’s built-in tracing capability:

Fig.

3.5 Detailed A2A/AP2 Messages

The second way is diving into agent logs, which can give us more details. Just pick some of them, from the .logs/watch.log, which combines all the A2A messages between agents in this demo.

ShoppingAgent -> MerchantAgent: Find products matching user’s IntentMandate POST http://MerchantAgent/a2a/merchant_agent [Request Body] {'id': '888a4384-2aa8-41c3-adbe-864c767bdba5', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'kind': 'message', 'messageId': '00162a36c7d645d9840e3fbda5bd625e', 'parts': [{'kind': 'text', 'text': "Find products that match the user's IntentMandate."}, {'data': {'ap2.mandates.IntentMandate': {'user_cart_confirmation_required': True, 'natural_language_description': 'espresso coffee maker', 'merchants': [], 'skus': [], 'requires_refundability': True, 'intent_expiry': '2025-11-12T03:45:42.037007+00:00'}}, 'kind': 'data'}, {'data': {'risk_data': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data'}, 'kind': 'data'}, {'data': {'shopping_agent_id': 'trusted_shopping_agent'}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ["Find products that match the user's IntentMandate."] [An Intent Mandate was in the request Data] {'user_cart_confirmation_required': True, 'natural_language_description': 'espresso coffee maker', 'merchants': [], 'skus': [], 'requires_refundability': True, 'intent_expiry': '2025-11-12T03:45:42.037007+00:00'} [Data Part: risk_data] eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data [Data Part: shopping_agent_id] trusted_shopping_agent [Response Body] {"id":"888a4384-2aa8-41c3-adbe-864c767bdba5","jsonrpc":"2.0","result":{"artifacts":[{"artifactId":"c0dad082-0c54-4f9a-963f-e312f5a4bf24","parts":[{"data":{"ap2.mandates.CartMandate":{"contents":{"id":"cart_1","user_cart_confirmation_required":true,"payment_request":{"method_data":[{"supported_methods":"CARD","data":{"network":["mastercard","paypal","amex"]}}],"details":{"id":"order_1","display_items":[{"label":"Compact espresso maker","amount":{"currency":"USD","value":89.99},"pending":null,"refund_period":30}],"shipping_options":null,"modifiers":null,"total":{"label":"Total","amount":{"currency":"USD","value":89.99},"pending":null,"refund_period":30}},"options":{"request_payer_name":false,"request_payer_email":false,"request_payer_phone":false,"request_shipping":true,"shipping_type":null},"shipping_address":null},"cart_expiry":"2025-11-11T04:15:58.088214+00:00","merchant_name":"Generic Merchant"},"merchant_authorization":null}},"kind":"data"}]},{"artifactId":"33680fca-e0b2-439a-bc1b-0f8ede344cb9","parts":[{"data":{"ap2.mandates.CartMandate":{"contents":{"id":"cart_2","user_cart_confirmation_required":true,"payment_request":{"method_data":[{"supported_methods":"CARD","data":{"network":["mastercard","paypal","amex"]}}],"details":{"id":"order_2","display_items":[{"label":"Automatic espresso and cappuccino machine","amount":{"currency":"USD","value":249.0},"pending":null,"refund_period":30}],"shipping_options":null,"modifiers":null,"total":{"label":"Total","amount":{"currency":"USD","value":249.0},"pending":null,"refund_period":30}},"options":{"request_payer_name":false,"request_payer_email":false,"request_payer_phone":false,"request_shipping":true,"shipping_type":null},"shipping_address":null},"cart_expiry":"2025-11-11T04:15:58.088214+00:00","merchant_name":"Generic Merchant"},"merchant_authorization":null}},"kind":"data"}]},{"artifactId":"d6dd431b-80a9-4892-b612-d4303524b674","parts":[{"data":{"ap2.mandates.CartMandate":{"contents":{"id":"cart_3","user_cart_confirmation_required":true,"payment_request":{"method_data":[{"supported_methods":"CARD","data":{"network":["mastercard","paypal","amex"]}}],"details":{"id":"order_3","display_items":[{"label":"Professional-grade espresso machine","amount":{"currency":"USD","value":599.99},"pending":false,"refund_period":60}],"shipping_options":null,"modifiers":null,"total":{"label":"Total","amount":{"currency":"USD","value":599.99},"pending":null,"refund_period":30}},"options":{"request_payer_name":false,"request_payer_email":false,"request_payer_phone":false,"request_shipping":true,"shipping_type":null},"shipping_address":null},"cart_expiry":"2025-11-11T04:15:58.088214+00:00","merchant_name":"Generic Merchant"},"merchant_authorization":null}},"kind":"data"}]}],"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"00162a36c7d645d9840e3fbda5bd625e","parts":[{"kind":"text","text":"Find products that match the user's IntentMandate."},{"data":{"ap2.mandates.IntentMandate":{"user_cart_confirmation_required":true,"natural_language_description":"espresso coffee maker","merchants":[],"skus":[],"requires_refundability":true,"intent_expiry":"2025-11-12T03:45:42.037007+00:00"}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"},{"data":{"shopping_agent_id":"trusted_shopping_agent"},"kind":"data"}],"role":"agent","taskId":"f76fdfa1-b505-4707-b1cc-a7f25bbadc00"}],"id":"f76fdfa1-b505-4707-b1cc-a7f25bbadc00","kind":"task","status":{"state":"completed","timestamp":"2025-11-11T03:45:58.161385+00:00"}}} ShoppingAgent -> PaymentCredentialProviderAgent: Get the user’s shipping address POST http://CredentialsProvider/a2a/credentials_provider [Request Body] {'id': '03155305-f224-48c5-9617-d51474022d4c', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': '8517c8ca101c4bde9b2fe4b0d52043af', 'parts': [{'kind': 'text', 'text': "Get the user's shipping address."}, {'data': {'user_email': '[email protected]'}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ["Get the user's shipping address."] [Data Part: user_email] [email protected] [Response Body] {"id":"03155305-f224-48c5-9617-d51474022d4c","jsonrpc":"2.0","result":{"artifacts":[{"artifactId":"04dc9b8b-223d-432d-ad5b-ea513948b3be","parts":[{"data":{"contact_picker.ContactAddress":{"recipient":"Bugs Bunny","organization":"Sample Organization","address_line":["123 Main St"],"city":"Sample City","region":"ST","postal_code":"00000","country":"US","phone_number":"+1-000-000-0000"}},"kind":"data"}]}],"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"8517c8ca101c4bde9b2fe4b0d52043af","parts":[{"kind":"text","text":"Get the user's shipping address."},{"data":{"user_email":"[email protected]"},"kind":"data"}],"role":"agent","taskId":"b9341fb9-d060-427e-b216-971f5ee3f72f"}],"id":"b9341fb9-d060-427e-b216-971f5ee3f72f","kind":"task","status":{"state":"completed","timestamp":"2025-11-11T03:49:03.656069+00:00"}}} ShoppingAgent -> MerchantAgent: Update the cart with the user’s shipping address POST http://MerchantAgent/a2a/merchant_agent [Request Body] {'id': '82fb46ac-4ff8-4012-b70a-d85d528bfede', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': 'bc9c493b5d9640a1a4a902c71ec10f39', 'parts': [{'kind': 'text', 'text': "Update the cart with the user's shipping address."}, {'data': {'cart_id': 'cart_3'}, 'kind': 'data'}, {'data': {'shipping_address': {'recipient': 'Bugs Bunny', 'region': 'ST', 'country': 'US', 'postal_code': '00000', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'city': 'Sample City', 'address_line': ['123 Main St']}}, 'kind': 'data'}, {'data': {'shopping_agent_id': 'trusted_shopping_agent'}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ["Update the cart with the user's shipping address."] [Data Part: cart_id] cart_3 [Data Part: shipping_address] {'recipient': 'Bugs Bunny', 'region': 'ST', 'country': 'US', 'postal_code': '00000', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'city': 'Sample City', 'address_line': ['123 Main St']} [Data Part: shopping_agent_id] trusted_shopping_agent [Response Body] {"id":"82fb46ac-4ff8-4012-b70a-d85d528bfede","jsonrpc":"2.0","result":{"artifacts":[{"artifactId":"b88f3fa6-70a8-4382-a000-9b76d60c135d","parts":[{"data":{"ap2.mandates.CartMandate":{"contents":{"id":"cart_3","user_cart_confirmation_required":true,"payment_request":{"method_data":[{"supported_methods":"CARD","data":{"network":["mastercard","paypal","amex"]}}],"details":{"id":"order_3","display_items":[{"label":"Professional-grade espresso machine","amount":{"currency":"USD","value":603.49},"pending":false,"refund_period":60},{"label":"Shipping","amount":{"currency":"USD","value":2.0},"pending":null,"refund_period":30},{"label":"Tax","amount":{"currency":"USD","value":1.5},"pending":null,"refund_period":30}],"shipping_options":null,"modifiers":null,"total":{"label":"Total","amount":{"currency":"USD","value":603.49},"pending":null,"refund_period":30}},"options":{"request_payer_name":false,"request_payer_email":false,"request_payer_phone":false,"request_shipping":true,"shipping_type":null},"shipping_address":{"city":"Sample City","country":"US","dependent_locality":null,"organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","sorting_code":null,"address_line":["123 Main St"]}},"cart_expiry":"2025-11-11T04:15:58.088214+00:00","merchant_name":"Generic Merchant"},"merchant_authorization":"eyJhbGciOiJSUzI1NiIsImtpZIwMjQwOTA..."}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"}]}],"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"bc9c493b5d9640a1a4a902c71ec10f39","parts":[{"kind":"text","text":"Update the cart with the user's shipping address."},{"data":{"cart_id":"cart_3"},"kind":"data"},{"data":{"shipping_address":{"recipient":"Bugs Bunny","region":"ST","country":"US","postal_code":"00000","organization":"Sample Organization","phone_number":"+1-000-000-0000","city":"Sample City","address_line":["123 Main St"]}},"kind":"data"},{"data":{"shopping_agent_id":"trusted_shopping_agent"},"kind":"data"}],"role":"agent","taskId":"43c9c925-df04-48a5-970b-6ec86bd3d27c"}],"id":"43c9c925-df04-48a5-970b-6ec86bd3d27c","kind":"task","status":{"state":"completed","timestamp":"2025-11-11T03:49:16.810434+00:00"}}} ShoppingAgent -> PaymentCredentialProviderAgent: Get a filtered list of the user’s payment methods POST http://CredentialsProvider/a2a/credentials_provider [Request Body] {'id': '14885fe8-7637-4096-b997-6d58a0782b29', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': 'f281b2f77101477e82de95aae26bea78', 'parts': [{'kind': 'text', 'text': "Get a filtered list of the user's payment methods."}, {'data': {'user_email': '[email protected]'}, 'kind': 'data'}, {'data': {'payment_request.PaymentMethodData': {'supported_methods': 'CARD', 'data': {'network': ['mastercard', 'paypal', 'amex']}}}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ["Get a filtered list of the user's payment methods."] [Data Part: user_email] [email protected] [Data Part: payment_request.PaymentMethodData] {'supported_methods': 'CARD', 'data': {'network': ['mastercard', 'paypal', 'amex']}} [Response Body] {"id":"14885fe8-7637-4096-b997-6d58a0782b29","jsonrpc":"2.0","result":{"artifacts":[{"artifactId":"e605eb15-18ee-49c3-b7c7-05638e4b0ff6","parts":[{"data":{"payment_method_aliases":["American Express ending in 4444","American Express ending in 8888"]},"kind":"data"}]}],"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"f281b2f77101477e82de95aae26bea78","parts":[{"kind":"text","text":"Get a filtered list of the user's payment methods."},{"data":{"user_email":"[email protected]"},"kind":"data"},{"data":{"payment_request.PaymentMethodData":{"supported_methods":"CARD","data":{"network":["mastercard","paypal","amex"]}}},"kind":"data"}],"role":"agent","taskId":"495725e5-2923-4552-9bfb-5fc0918d28de"}],"id":"495725e5-2923-4552-9bfb-5fc0918d28de","kind":"task","status":{"state":"completed","timestamp":"2025-11-11T03:49:31.574452+00:00"}}} ShoppingAgent -> PaymentCredentialProviderAgent: Get a payment credential token for the user’s payment method POST http://CredentialsProvider/a2a/credentials_provider [Request Body] {'id': 'e707b136-c1f6-4620-b330-59e19c4800d4', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': '50e1e010700242ee995a7b9721e67f09', 'parts': [{'kind': 'text', 'text': "Get a payment credential token for the user's payment method."}, {'data': {'payment_method_alias': 'American Express ending in 4444'}, 'kind': 'data'}, {'data': {'user_email': '[email protected]'}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ["Get a payment credential token for the user's payment method."] [Data Part: payment_method_alias] American Express ending in 4444 [Data Part: user_email] [email protected] [Response Body] {"id":"e707b136-c1f6-4620-b330-59e19c4800d4","jsonrpc":"2.0","result":{"artifacts":[{"artifactId":"b2f8c50e-1bb0-4398-ae59-dee80926b667","parts":[{"data":{"token":"fake_payment_credential_token_0"},"kind":"data"}]}],"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"50e1e010700242ee995a7b9721e67f09","parts":[{"kind":"text","text":"Get a payment credential token for the user's payment method."},{"data":{"payment_method_alias":"American Express ending in 4444"},"kind":"data"},{"data":{"user_email":"[email protected]"},"kind":"data"}],"role":"agent","taskId":"5ffdd520-93ab-4237-81e6-25e63765032a"}],"id":"5ffdd520-93ab-4237-81e6-25e63765032a","kind":"task","status":{"state":"completed","timestamp":"2025-11-11T03:49:57.616296+00:00"}}} ShoppingAgent -> PaymentCredentialProviderAgent: This is the signed payment mandate POST http://CredentialsProvider/a2a/credentials_provider [Request Body] {'id': '2fbe086c-6eab-46a1-b5c4-06e61ee3f90c', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': 'cf8e7c0d0c534636bbd34619aea40486', 'parts': [{'kind': 'text', 'text': 'This is the signed payment mandate'}, {'data': {'ap2.mandates.PaymentMandate': {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'address_line': ['123 Main St']}, 'payer_email': '[email protected]'}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'}}, 'kind': 'data'}, {'data': {'risk_data': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data'}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ['This is the signed payment mandate'] [A Payment Mandate was in the request Data] {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'address_line': ['123 Main St']}, 'payer_email': '[email protected]'}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'} [Data Part: risk_data] eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data [Response Body] {"id":"2fbe086c-6eab-46a1-b5c4-06e61ee3f90c","jsonrpc":"2.0","result":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"cf8e7c0d0c534636bbd34619aea40486","parts":[{"kind":"text","text":"This is the signed payment mandate"},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","address_line":["123 Main St"]},"payer_email":"[email protected]"},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"}],"role":"agent","taskId":"fc67d42e-57f9-4efe-8d2a-7d7f3c31d70f"}],"id":"fc67d42e-57f9-4efe-8d2a-7d7f3c31d70f","kind":"task","status":{"state":"completed","timestamp":"2025-11-11T03:51:06.650655+00:00"}}} ShoppingAgent -> MerchantAgent: Initiate a payment POST http://MerchantAgent/a2a/merchant_agent [Request Body] {'id': 'be1ef52c-fd9d-4177-810d-cd14303219f1', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': '14ed8b94ec5a4cc0a516a7b8d62cc6f8', 'parts': [{'kind': 'text', 'text': 'Initiate a payment'}, {'data': {'ap2.mandates.PaymentMandate': {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'address_line': ['123 Main St']}, 'payer_email': '[email protected]'}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'}}, 'kind': 'data'}, {'data': {'risk_data': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data'}, 'kind': 'data'}, {'data': {'shopping_agent_id': 'trusted_shopping_agent'}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ['Initiate a payment'] [A Payment Mandate was in the request Data] {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'address_line': ['123 Main St']}, 'payer_email': '[email protected]'}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'} [Data Part: risk_data] eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data [Data Part: shopping_agent_id] trusted_shopping_agent MerchantAgent -> MerchantPaymentAgent: Initiate a payment POST http://merchant_payment_processor_agent/a2a/merchant_payment_processor_agent [Request Body] {'id': '7c7fbb73-4bb6-42b2-b5bd-d6d766078cca', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': '85dc6b61ae8e4e23bc8d14fc02ca14eb', 'parts': [{'kind': 'text', 'text': 'initiate_payment'}, {'data': {'ap2.mandates.PaymentMandate': {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'pending': None, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'dependent_locality': None, 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'sorting_code': None, 'address_line': ['123 Main St']}, 'shipping_option': None, 'payer_name': None, 'payer_email': '[email protected]', 'payer_phone': None}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'}}, 'kind': 'data'}, {'data': {'risk_data': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data'}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ['initiate_payment'] [A Payment Mandate was in the request Data] {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'pending': None, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'dependent_locality': None, 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'sorting_code': None, 'address_line': ['123 Main St']}, 'shipping_option': None, 'payer_name': None, 'payer_email': '[email protected]', 'payer_phone': None}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'} [Data Part: risk_data] eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data [Response Body] {"id":"7c7fbb73-4bb6-42b2-b5bd-d6d766078cca","jsonrpc":"2.0","result":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"85dc6b61ae8e4e23bc8d14fc02ca14eb","parts":[{"kind":"text","text":"initiate_payment"},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"pending":null,"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","dependent_locality":null,"organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","sorting_code":null,"address_line":["123 Main St"]},"shipping_option":null,"payer_name":null,"payer_email":"[email protected]","payer_phone":null},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"}],"id":"799fbe91-a538-497f-904c-d81eda1dedbf","kind":"task","status":{"message":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"965f5ae8-0cb6-4687-9a3c-64267fe165da","parts":[{"kind":"text","text":"Please provide the challenge response to complete the payment."},{"data":{"challenge":{"type":"otp","display_text":"The payment method issuer sent a verification code to the phone number on file, please enter it below. It will be shared with the issuer so they can authorize the transaction.(Demo only hint: the code is 123)"}},"kind":"data"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"},"state":"input-required","timestamp":"2025-11-11T03:51:20.214669+00:00"}}} [Response Body] {"id":"be1ef52c-fd9d-4177-810d-cd14303219f1","jsonrpc":"2.0","result":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"14ed8b94ec5a4cc0a516a7b8d62cc6f8","parts":[{"kind":"text","text":"Initiate a payment"},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","address_line":["123 Main St"]},"payer_email":"[email protected]"},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"},{"data":{"shopping_agent_id":"trusted_shopping_agent"},"kind":"data"}],"role":"agent","taskId":"57a672e8-478b-4c7a-8885-00388224e886"}],"id":"57a672e8-478b-4c7a-8885-00388224e886","kind":"task","status":{"message":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"965f5ae8-0cb6-4687-9a3c-64267fe165da","parts":[{"kind":"text","text":"Please provide the challenge response to complete the payment."},{"data":{"challenge":{"type":"otp","display_text":"The payment method issuer sent a verification code to the phone number on file, please enter it below. It will be shared with the issuer so they can authorize the transaction.(Demo only hint: the code is 123)"}},"kind":"data"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"},"state":"input-required","timestamp":"2025-11-11T03:51:20.217209+00:00"}}} ShoppingAgent -> MerchantAgent: Initiate a payment. Include the challenge response. POST http://MerchantAgent/a2a/merchant_agent [Request Body] {'id': '716d25d2-2541-41b7-bd8a-2f94465a91d1', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': '4f466784348444b58d547a64f42d31ca', 'parts': [{'kind': 'text', 'text': 'Initiate a payment. Include the challenge response.'}, {'data': {'ap2.mandates.PaymentMandate': {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'address_line': ['123 Main St']}, 'payer_email': '[email protected]'}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'}}, 'kind': 'data'}, {'data': {'shopping_agent_id': 'trusted_shopping_agent'}, 'kind': 'data'}, {'data': {'challenge_response': '123'}, 'kind': 'data'}, {'data': {'risk_data': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data'}, 'kind': 'data'}], 'role': 'agent', 'taskId': '57a672e8-478b-4c7a-8885-00388224e886'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ['Initiate a payment. Include the challenge response.'] [A Payment Mandate was in the request Data] {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'address_line': ['123 Main St']}, 'payer_email': '[email protected]'}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'} [Data Part: shopping_agent_id] trusted_shopping_agent [Data Part: challenge_response] 123 [Data Part: risk_data] eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data MerchantAgent -> MerchantPaymentAgent: Initiate a payment (include the challenge response) POST http://merchant_payment_processor_agent/a2a/merchant_payment_processor_agent [Request Body] {'id': 'd9bc9bc6-73bb-4667-949b-85a60633089b', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': 'f33f4dc30d3a41878c8d1d7006b2cf0e', 'parts': [{'kind': 'text', 'text': 'initiate_payment'}, {'data': {'ap2.mandates.PaymentMandate': {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'pending': None, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'dependent_locality': None, 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'sorting_code': None, 'address_line': ['123 Main St']}, 'shipping_option': None, 'payer_name': None, 'payer_email': '[email protected]', 'payer_phone': None}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'}}, 'kind': 'data'}, {'data': {'risk_data': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data'}, 'kind': 'data'}, {'data': {'challenge_response': '123'}, 'kind': 'data'}], 'role': 'agent', 'taskId': '799fbe91-a538-497f-904c-d81eda1dedbf'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ['initiate_payment'] [A Payment Mandate was in the request Data] {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'pending': None, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'dependent_locality': None, 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'sorting_code': None, 'address_line': ['123 Main St']}, 'shipping_option': None, 'payer_name': None, 'payer_email': '[email protected]', 'payer_phone': None}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'} [Data Part: risk_data] eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data [Data Part: challenge_response] 123 MerchantPaymentAgent -> PaymentCredentialProviderAgent: Give me the payment method credentials for the given token POST http://CredentialsProvider/a2a/credentials_provider [Request Body] {'id': '6724cb50-56f1-42e0-9864-64d253828cac', 'jsonrpc': '2.0', 'method': 'message/send', 'params': {'configuration': {'acceptedOutputModes': [], 'blocking': True}, 'message': {'contextId': '6030ebc7-fde8-4489-b655-045443c47af0', 'kind': 'message', 'messageId': '92b1783ecc8f4cc0ac6dc1f853c38297', 'parts': [{'kind': 'text', 'text': 'Give me the payment method credentials for the given token.'}, {'data': {'ap2.mandates.PaymentMandate': {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'pending': None, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'dependent_locality': None, 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'sorting_code': None, 'address_line': ['123 Main St']}, 'shipping_option': None, 'payer_name': None, 'payer_email': '[email protected]', 'payer_phone': None}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'}}, 'kind': 'data'}], 'role': 'agent'}}} [Extension Header] X-A2A-Extensions: https://github.com/google-agentic-commerce/ap2/v1 [Request Instructions] ['Give me the payment method credentials for the given token.'] [A Payment Mandate was in the request Data] {'payment_mandate_contents': {'payment_mandate_id': '848f97b287584cd1aa3085bed1985c22', 'payment_details_id': 'order_3', 'payment_details_total': {'label': 'Total', 'amount': {'currency': 'USD', 'value': 603.49}, 'pending': None, 'refund_period': 30}, 'payment_response': {'request_id': 'order_3', 'method_name': 'CARD', 'details': {'token': {'value': 'fake_payment_credential_token_0', 'url': 'http://CredentialsProvider/a2a/credentials_provider'}}, 'shipping_address': {'city': 'Sample City', 'country': 'US', 'dependent_locality': None, 'organization': 'Sample Organization', 'phone_number': '+1-000-000-0000', 'postal_code': '00000', 'recipient': 'Bugs Bunny', 'region': 'ST', 'sorting_code': None, 'address_line': ['123 Main St']}, 'shipping_option': None, 'payer_name': None, 'payer_email': '[email protected]', 'payer_phone': None}, 'merchant_agent': 'Generic Merchant', 'timestamp': '2025-11-11T03:50:04.532972+00:00'}, 'user_authorization': 'fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22'} [Response Body] {"id":"6724cb50-56f1-42e0-9864-64d253828cac","jsonrpc":"2.0","result":{"artifacts":[{"artifactId":"253b8275-f7a1-492b-81c0-b49627e9be9b","parts":[{"data":{"type":"CARD","alias":"American Express ending in 4444","network":[{"name":"amex","formats":["DPAN"]}],"cryptogram":"fake_cryptogram_abc123","token":"1111000000000000","card_holder_name":"John Doe","card_expiration":"12/2025","card_billing_address":{"country":"US","postal_code":"00000"}},"kind":"data"}]}],"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"92b1783ecc8f4cc0ac6dc1f853c38297","parts":[{"kind":"text","text":"Give me the payment method credentials for the given token."},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"pending":null,"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","dependent_locality":null,"organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","sorting_code":null,"address_line":["123 Main St"]},"shipping_option":null,"payer_name":null,"payer_email":"[email protected]","payer_phone":null},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"}],"role":"agent","taskId":"65d8cfea-407e-434f-91b5-9852db1b4fbd"}],"id":"65d8cfea-407e-434f-91b5-9852db1b4fbd","kind":"task","status":{"state":"completed","timestamp":"2025-11-11T03:51:48.590478+00:00"}}} [Response Body] {"id":"d9bc9bc6-73bb-4667-949b-85a60633089b","jsonrpc":"2.0","result":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"85dc6b61ae8e4e23bc8d14fc02ca14eb","parts":[{"kind":"text","text":"initiate_payment"},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"pending":null,"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","dependent_locality":null,"organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","sorting_code":null,"address_line":["123 Main St"]},"shipping_option":null,"payer_name":null,"payer_email":"[email protected]","payer_phone":null},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"},{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"965f5ae8-0cb6-4687-9a3c-64267fe165da","parts":[{"kind":"text","text":"Please provide the challenge response to complete the payment."},{"data":{"challenge":{"type":"otp","display_text":"The payment method issuer sent a verification code to the phone number on file, please enter it below. It will be shared with the issuer so they can authorize the transaction.(Demo only hint: the code is 123)"}},"kind":"data"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"},{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"f33f4dc30d3a41878c8d1d7006b2cf0e","parts":[{"kind":"text","text":"initiate_payment"},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"pending":null,"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","dependent_locality":null,"organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","sorting_code":null,"address_line":["123 Main St"]},"shipping_option":null,"payer_name":null,"payer_email":"[email protected]","payer_phone":null},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"},{"data":{"challenge_response":"123"},"kind":"data"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"}],"id":"799fbe91-a538-497f-904c-d81eda1dedbf","kind":"task","status":{"message":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"627a4fc6-e3b0-488d-ae9a-3332b612f778","parts":[{"kind":"text","text":"{'status': 'success'}"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"},"state":"completed","timestamp":"2025-11-11T03:51:48.595556+00:00"}}} [Response Body] {"id":"716d25d2-2541-41b7-bd8a-2f94465a91d1","jsonrpc":"2.0","result":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","history":[{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"14ed8b94ec5a4cc0a516a7b8d62cc6f8","parts":[{"kind":"text","text":"Initiate a payment"},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","address_line":["123 Main St"]},"payer_email":"[email protected]"},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"},{"data":{"shopping_agent_id":"trusted_shopping_agent"},"kind":"data"}],"role":"agent","taskId":"57a672e8-478b-4c7a-8885-00388224e886"},{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"965f5ae8-0cb6-4687-9a3c-64267fe165da","parts":[{"kind":"text","text":"Please provide the challenge response to complete the payment."},{"data":{"challenge":{"type":"otp","display_text":"The payment method issuer sent a verification code to the phone number on file, please enter it below. It will be shared with the issuer so they can authorize the transaction.(Demo only hint: the code is 123)"}},"kind":"data"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"},{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"4f466784348444b58d547a64f42d31ca","parts":[{"kind":"text","text":"Initiate a payment. Include the challenge response."},{"data":{"ap2.mandates.PaymentMandate":{"payment_mandate_contents":{"payment_mandate_id":"848f97b287584cd1aa3085bed1985c22","payment_details_id":"order_3","payment_details_total":{"label":"Total","amount":{"currency":"USD","value":603.49},"refund_period":30},"payment_response":{"request_id":"order_3","method_name":"CARD","details":{"token":{"value":"fake_payment_credential_token_0","url":"http://CredentialsProvider/a2a/credentials_provider"}},"shipping_address":{"city":"Sample City","country":"US","organization":"Sample Organization","phone_number":"+1-000-000-0000","postal_code":"00000","recipient":"Bugs Bunny","region":"ST","address_line":["123 Main St"]},"payer_email":"[email protected]"},"merchant_agent":"Generic Merchant","timestamp":"2025-11-11T03:50:04.532972+00:00"},"user_authorization":"fake_cart_mandate_hash_cart_3_fake_payment_mandate_hash_848f97b287584cd1aa3085bed1985c22"}},"kind":"data"},{"data":{"shopping_agent_id":"trusted_shopping_agent"},"kind":"data"},{"data":{"challenge_response":"123"},"kind":"data"},{"data":{"risk_data":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...fake_risk_data"},"kind":"data"}],"role":"agent","taskId":"57a672e8-478b-4c7a-8885-00388224e886"}],"id":"57a672e8-478b-4c7a-8885-00388224e886","kind":"task","status":{"message":{"contextId":"6030ebc7-fde8-4489-b655-045443c47af0","kind":"message","messageId":"627a4fc6-e3b0-488d-ae9a-3332b612f778","parts":[{"kind":"text","text":"{'status': 'success'}"}],"role":"agent","taskId":"799fbe91-a538-497f-904c-d81eda1dedbf"},"state":"completed","timestamp":"2025-11-11T03:51:48.599045+00:00"}}} 3.6 Summary: Interactions Between Agents

References

https://a2aprotocol.ai/ap2-protocol
https://ap2-protocol.net/en/
https://github.com/google-agentic-commerce/AP2/blob/main/samples/python/scenarios/a2a/human-present/cards/README.md

An Illustrated Guide to AP2 (Agent Payment Protocol) (2025)

ARTHURCHIAO'S BLOG

1 month 2 weeks ago

To bring this vision to life, one essential piece is still missing: a payment protocol designed for agent-to-agent transactions. That’s exactly why AP2 was created.

This post offers an illustrative guide to this emerging topic.

Fig. Shopping agent view of the "Buy a coffee maker" AP2 demo.

1 Why AP2?
- 1.1 An Era of Agentic Commerce
- 1.2 AP2: Payment Protocol for Agents
2 How AP2 Works
- 2.1 Core Concepts
  - 2.1.1 Mandate
  - 2.1.2 VC (Verifiable Credential)
- 2.2 Working Fashions (Scenarios)
  - 2.2.1 Real-time purchases (human present)
  - 2.2.2 Delegated tasks (human not present)
3 Demo: Buy A Coffee Maker Through Chat
References

1 Why AP2? 1.1 An Era of Agentic Commerce

The digital interaction fashion is likely to enter a new phase:

Now and the past: people interact directly with websites and applications. Such as, people browse websites or apps, select the products they like and add to cart, and finally click the “Buy” or “Pay” button;
The future: may shift toward an era of conversational and delegated task execution via agents; no manually browsing, just chat with your AI assistant.

This means agents will manage various daily tasks for users (humans), such as

routine purchases
complex product research
price negotiations, and more.

This new era of agentic commerce will bring new opportunities for both users and businesses:

For users: get a highly personalized, seamless shopping experience
For businesses: open up a new, intelligent channel for reaching customers

1.2 AP2: Payment Protocol for Agents

2 How AP2 Works

In a nutshell: establishing trust via Mandates and Verifiable Credentials (VCs).

2.1 Core Concepts 2.1.1 Mandate

Mandates are tamper-proof, cryptographically-signed digital contracts;
Mandates serve as verifiable proof of a user's instructions;
Mandates are signed by VC.

2.1.2 VC (Verifiable Credential)

VC is a special kind of data payload between agents.

2.2 Working Fashions (Scenarios) 2.2.1 Real-time purchases (human present)

Image source: [1]

User -> Agent: “Find me new white running shoes”
Agent: capture the request in an initial IntentMandate. This provides the auditable context for the entire interaction in a transaction process.
Agent -> Merchant Agents: find shoes with IntentMandate; get some candidates;
Agent -> User: present a cart with the shoes users would like;
User: select the item he/she likes;
Agent: sign a CartMandate. This is a critical step that creates a secure, unchangeable record of the exact items and price, ensuring what user see is what them pay for.
Agent -> Merchant Agent & Credential Provider Agent: complete payment with a PaymentMandate.

2.2.2 Delegated tasks (human not present)

Image source: [1]

User -> Agent: “Buy concert tickets the moment they go on sale”.
Agent: the user signed a detailed Intent Mandate upfront. This mandate specifies the rules of engagement—price limits, timing, and other conditions.
Agent -> Merchant Agent & Credential Provider Agent: automatically generate a Cart Mandate on behalf of user once the precise conditions are met.

3 Demo: Buy A Coffee Maker Through Chat

This is a demo from AP2 community, see github for the code and more details.

3.1 Components

The demo is a simple multi-agent system based on google ADK, this is what looks like when the demo finished:

It consists of the following components (agents):

Root Agent: for orchestrating all the entire demo
Shopping agent: chat-based agent that providing shopping services to User;
Shipping address collecting agent: utility agent for Root Agent;
Payment method collecting agent: utility agent for Root Agent;
Merchant agent: commerce agent that selling products;
Merchant payment processor agent: utility agent for Merchant agent that that handles payment stuffs for the latter;
Payment credential provider agent: providing AP2 auth between shopping agent and merchant agents;

3.2 Agent Card & System Prompt 3.2.1 Shopping Agent

System prompt to see how it works:

A2A agent card:

Account Manager (User Database):

Just follow the README to deploy it.

For Chinese users, Gemini may block you by location (return 40x responses), so you need to setup a proxy:

$ export no_proxy=localhost; export http_proxy=YOUR_PROXY; export https_proxy=YOUR_PROXY; export GOOGLE_API_KEY=YOUR_KEY; bash samples/python/scenarios/a2a/human-present/cards/run.sh

Let’s see what’s happened in the behind.

3.4 Detailed Traces

We have two ways to inspect what’s happened in the behind. The first one is via the UI’s built-in tracing capability:

Fig.

3.5 Detailed A2A/AP2 Messages

The second way is diving into agent logs, which can give us more details. Just pick some of them, from the .logs/watch.log, which combines all the A2A messages between agents in this demo.

References

https://a2aprotocol.ai/ap2-protocol
https://ap2-protocol.net/en/
https://github.com/google-agentic-commerce/AP2/blob/main/samples/python/scenarios/a2a/human-present/cards/README.md

[笔记]《人工智能简史（第二版）》（2025）

ARTHURCHIAO'S BLOG

2 months 4 weeks ago

尼克的《人工智能简史（第二版）》从人和流派传承的角度介绍了人工智能作为计算科学一个分支的发展过程，内容和风格有点偏学术史，用作者的话说，“写法比较偏重基础和方法论，而不太注重应用”。作为一本不太“常规”的人工智能入门读物，适合领域内的部分专业读者，或者想从科学、哲学、伦理学等更高角度理解和看待人工智能的读者。

本文整理一些个人阅读笔记和思考。

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

0 前言
- 0.1 哈代：一等智力 vs. 二等智力
- 0.2 任正非
1 达特茅斯会议：人工智能的起源， 1956
2 自动定理证明兴衰记
3 从专家系统到知识图谱
4 第五代计算机的教训
5 神经网络简史
6 计算机下棋简史
7 自然语言处理
8 向自然学习：从遗传算法到强化学习
9 哲学家和人工智能
10 人是机器吗？——人工智能的计算理论基础
11 智能的进化
12 当我们谈论生死时，我们在谈论什么？
- 12.1 苏格拉底之死和《斐多篇》
- 12.2 作者和苏格拉底之间的假想对话
13 总结
附录
后记

0 前言 0.1 哈代：一等智力 vs. 二等智力

哈代曾说科学和艺术的原创需要一等的智力，解释和欣赏（例如乐评家和书评家）是二等智力的活儿。

搜了一下哈代的原话：

It is a melancholy experience for a professional mathematician to find himself writing about mathematics. The function of a mathematician is to do something, to prove new theorems, to add to mathematics, and not to talk about what he or other mathematicians have done. Statesmen despise publicists, painters despise art-critics, and physiologists, physicists, or mathematicians have usually similar feelings; there is no scorn more profound, or on the whole more justifiable, than that of the men who make for the men who explain. Exposition, criticism, appreciation, is work for second-rate minds.

A Mathematician’s Apology，G. H. Hardy

大致意思：

让一个职业数学家花时间去阐释数学相关的东西是悲哀的。数学家的本职工作是创新，例如证明新定理，发现新东西，而不是去宣讲自己或其它数学家做了些什么。 政客鄙视政治评论家，画家鄙视艺术评论者，生理学家、物理学家或数学家通常都有类似的感受。 没有任何嘲笑，能比创造者对解释者的嘲笑来得更深远，或在整体上更为合理。阐释、批评、欣赏，都是二等智力者的工作。

《一个数学家的自白》，哈代

0.2 任正非

任正非是二十一世纪的哈代。

我自己日暮黄昏，但任正非只七十四岁，来日方长。我希望任先生不要管他人怎样说，因为哈代说得清楚，“没有任何嘲笑能比创作者对解释者的嘲笑来得更深奥，或在整体上更为合理。阐释、批评、欣赏，都是只有二等脑子的人的工作。”

张五常：任正非是今天的哈代吗, 2019

1 达特茅斯会议：人工智能的起源， 1956

What is past is prologue. - William Shakespeare

凡过往皆为序章。

1.1 经典读物

“Man viewed as a Machine” 介绍了图灵机和冯诺依曼的细胞自动机。
- muscle machine
- brain machine - 人工智能的另一种说法
Alchemy and Artificial Intelligence (PDF),《炼金术与人工智能》，1965
《计算机不能干什么》
《Human Memory and the Storage of Information》1956

是《The Magic Number Seven》的另一个版本。

一门年轻的学科，一开始都需要一点“过度销售”（excessive salesmanship） - Minsky

1.2 Chomsky：机器可以思考吗？-> 潜艇会游泳吗？

2015 年他被问及“机器可以思考吗？”，他套用计算机科学家 Dijkstra 的说法反问：“潜艇会游泳吗？”

Youtube: Noam Chomsky - Can Machines Think?

1.3 AI 的两面：工程和科学

Chomsky 把 AI 分成工程的和科学的：

工程的一面，如自动驾驶车等，能做出对人类有用的东西；
科学的一面，Chomsky 明显不认可。

他引用图灵的话：这问题 too meaningless to deserve discussion（没有讨论的意义）。

当一帮奇点理论的粉丝带着正面的期望采访 Chomsky 时，他却对人工智能这个被他深刻影响过的学科没太当回事，他认为气候和毁灭性武器是比奇点更紧迫的问题。

2 自动定理证明兴衰记

As a material machine economises the exertion of force, so a symbolic calculus economises the exertion of intelligence … the more perfect the calculus, the smaller the intelligence compared to the results. —— W. E. Johnson

就像机器能省体力一样，符号演算能省脑力。演算越完美，付出的脑力就越少。

Proof is cultivated reasoning. —— Bruno Buchberger

2.1 自动定理证明的起源数学哲学三大派

逻辑主义
- 代表人物：罗素，
- 把数学归约到逻辑，因此只要把逻辑问题解决了，之上的数学问题自然就解决了。
- 换句话说，把逻辑玩转了，数学就不算事儿。
形式主义
- 代表人物：希尔伯特
- 把数学形式化，数学过程就是把一串符号变成另一串符号。
- 希尔伯特设想，如果能设计一个大一统的算法，那么所有的数学问题都可以由这个算法来解答。这和逻辑主义精神有一定相通之处。哥德尔后来证明这一切是不可能的。
直觉主义

机器定理证明的研究从某种意义上继承了罗素和希尔伯特的思想：用机器来证明和判定那些可以证明和判定的问题。纽厄尔和司马贺的“逻辑理论家”就是早期的机器定理证明程序，他们曾经给罗素写信，期盼能得到伟人的首肯，罗素在回信时说：“我相信演绎逻辑里的所有事，机器都能干。”

逻辑学的源头：亚里士多德三段论

自动定理证明起源于逻辑，初衷就是把逻辑演算自动化。

逻辑学的源头是亚里士多德的三段论：人必有一死，苏格拉底是人，所以苏格拉底必死。

2.2 思想实验：Brain in a vat

把一个人脑放在可以让它继续存活的营养液里，然后插上各自传感器，再连接到电脑，可以通过电脑准确地向这个大脑发送各自传感器刺激（例如让它觉得是在跑步的信号）。问题：如果有这样一个人脑，那它能否判断出自己是一个正常人体内的大脑，还是一个缸中插满传感器的孤零零的大脑？

In philosophy, the brain in a vat (BIV) is a scenario used in a variety of thought experiments intended to draw out certain features of human conceptions of knowledge, reality, truth, mind, consciousness, and meaning.

Wikepedia Brain in a vat:

2.3 王浩（1921—1995）

可以公正地说，王浩的定理证明研究孕育了整个理论计算机科学。

王浩以哥德尔的权威诠释者和知音名世，但他对哲学、逻辑学、计算机科学的原创性却被低估了。

王浩在致获奖词时半开玩笑地说，因为自己的个性，荣誉经常绕道而行。

王浩的定理证明程序后来成为高级语言的基准程序，麦卡锡的 LISP 早期就一直以王浩算法的程序作为例子。

2.4 吴文俊（1919—2017）

1979 年，吴文俊的工作得到杨振宁的关注，当时的科学院大力支持吴文俊，并为他申请到两万五千美元的外汇到美国购买一台家用电脑，以实现他的吴方法。

高龄开始学习编程

吴文俊的长寿也体现在他的学术生命上。1979 年吴文俊 60 岁高龄开始学习计算机编程语言，先是 BASIC，后是 Algol，再后是 Fortran。他在那台两万五千美元的家用电脑上不断取得新的成果。后来系统所的硬件设施改进，吴文俊相当一段时间都是上机时间最长的。

为人类文明做出贡献

杨振宁曾说他最重要的成就是提高了中国人的自信。陈省身、华罗庚、杨振宁、李政道那一批人是最早为人类文明做出贡献的中国人。那个不长的名单里还应该有王浩和吴文俊。

吴文俊生平：《走自己的路》

2.5 哲学问题有黑盒的理解不能算理解，有黑盒的证明也不能算证明

Chomsky 对统计派机器翻译的批评：有黑盒的理解不能算理解，有黑盒的证明也不能算证明。

人已经无法核实部分计算机证明的结果

传统的数学实践遵循共同体过程：一个数学家提出证明，然后一堆同一共同体的专家来验证，如果验证通过，定理成立。费马大定理的证明、庞加莱猜想的证明和张益唐的证明，都是这个套路。
有些机器证明太长，人根本看不过来，那怎么才算是证明了定理呢？如果用一个可被信任的计算机程序验证一遍，是不是就算是证明了呢？罗宾斯猜想的证明就曾用 Mathematica 验证过，而 AUTOMATH 本身就是一个验证系统。对全自动的定理证明，验证过程更容易机械化，而计算机辅助证明可能五花八门，很难有一个统一的过程。

数学家的归宿

无论如何，数学共同体的实践标准在变：从数学家之间互相核实到数学家信任的程序之间互相核实。也难怪传统的数学家在抱怨：数学变成了有成本的实验科学。

其实那些典型的物理科学，例如物理、化学和生物学，是以实验为本的，可重复性（reproducibility）是检验真理的标准之一。只不过在当下，可重复性的成本太高。当下的数学变得越来越实验，而生物学可能变得越来越后现代了。 无论是唯心或唯理的数学，还是唯物或经验的实验科学，最终都成了共同体式的实用主义。

吴文俊和芒福德联合得了 2006 年的邵逸夫数学奖。得奖评语最后一句，大意是他俩都是从纯数学的分支拓扑最后转到和计算机科学相关的研究，这为数学家的未来行为模式提供了典范。

吴文俊曾留学法国，法国的数学家素有关心数学史的传统。
吴文俊认为中国数学是巴比伦式的而不是希腊式的，巴比伦数学讲究计算，而希腊数学讲究公理。

计算模糊了理性和经验的边界

自动定理证明依靠的工具是计算机，正是计算模糊了理性和经验的边界。可以登高一步说：计算是知识演化的基础，计算也是知识民主化的工具。

2.6 现状时代交替 (2006)：定理证明小组被裁，深度学习论文横空出世

阿贡实验室的定理证明小组 2006 年被裁掉了，这大概算是符号派低潮的标志性事件，一个时代结束了。这一年 Hinton 的深度学习论文发表在《科学》杂志上。

有些领域，一开始就把 80% 的容易问题都解决了，而后就一直很难，进展很慢，少有突破。人工智能就是这样，定理证明尤其如此。深度学习领域近来的进步更多得益于硬件。

定理证明领域的名字演化

定理证明领域的名字也经历了有趣的演化。

最早都叫机器定理证明（Mechanical Theorem Proving），
后来改叫自动定理证明（Automatic Theorem Proving），
再后来叫自动演绎（Automated Deduction），目前都叫自动推理（Automated Reasoning）。

原因很简单，演绎（deduction）只是推理的一种，现在归纳（induction）、溯因（abduction）也都算成推理了。

贝叶斯推理，可以叫 Bayesian Logic，或 Bayesian Inference，也可以叫 Bayesian Reasoning。

2.7 结束语数学家不把逻辑学家当回事

王浩曾经抱怨数学家不把逻辑学家当回事。图灵也有类似的说法：逻辑学家给数学家提供了有营养的饭菜，但做的不够美味，数学家不爱吃。

逻辑似乎处于一切科学的底部，因为逻辑探索一切事物的本质

维特根斯坦曾有言：“逻辑似乎处于一切科学的底部 —— 因为逻辑的研究探索一切事物的本质。” 但数学家不觉得他们非得趴在逻辑学家的背上。自动定理证明的状况与此相关，数学家没觉得这玩意儿有用，人工智能的两派人马都不待见。

哈尔莫斯（Paul Halmos）是数学家，但也曾涉猎逻辑，在自传里拿逻辑开玩笑，说即使有人证明了黎曼猜想是不可判定的（哥德尔就是这么猜测的），数学家睡一觉，第二天起来还是该干嘛干嘛。

两个 Alpha-zero 下棋，我们人类已经看不懂了

法国数学家 David Ruelle，《Post-Human Mathematics》： 也许某一天，我们人类看机器做数学，就像黑猩猩看我们阅读伽罗瓦理论。其实这种情况已经发生了：两个 Alpha-zero 下棋，我们人类已经看不懂了。

3 从专家系统到知识图谱

The test of all knowledge is experiment. —— Feynman Lectures on Physics（《费曼物理学讲义》）

3.1 机器归纳法：用现在的话说就是机器学习 3.2 知识表示

知识表示一直是人工智能不温不火的一个领域，催生者是专家系统和自然语言理解。

逻辑是最方便的知识表示语言

逻辑是最方便的知识表示语言，从亚里士多德开始人们就熟悉，逻辑同时具有各种数学性质。任何一本逻辑入门书都会有那个著名的苏格拉底的例子：人必有一死，苏格拉底是人，所以苏格拉底必死。

心理学与语言学

知识表示的另一个来源是心理学和语言学，例如概念的上下位继承关系最方便的表示方式是树而不是一阶逻辑。

心理学实验表明，人在回答“金丝雀会飞吗？”要比回答“鸟会飞吗？”花的时间长，要回答第一个问题，人要再做一次“金丝雀是鸟”的推理。因为人在存储知识时只存储抽象的，这是空间经济的考虑。

心理学家米勒和 Chomsky 等一起开拓了认知科学，他最出名的论文大概就是那篇“魔力数字七”（The Magic Number Seven）。

Minsky 的框架：面向对象

框架（Frame）就是类型。

金丝雀是鸟，所有鸟的性质自动流传给金丝雀，鸟能飞，金丝雀也能飞。
苹果手机是手机，手机能打电话，苹果手机也能打电话。

框架导致了面向对象（OO，Object-Oriented）的设计哲学，相关的程序设计语言都受此影响。

当一个概念有了成熟的实现时，就自动脱离了人工智能

从这个意义上还真验证了：当一个概念有了成熟的实现时，就自动脱离了人工智能。

3.3 知识库把人类的常识编码，建成知识库

想法：把人类的常识编码，建成知识库。这个新项目叫 Cyc，这其实就是最早的知识图谱。

雷纳特坚定地支持他老师费根鲍姆的知识原则（Knowledge Principle）：一个系统之所以能展示高级的智能理解和行为，主要是因为在所从事的领域所表现出来的特定知识：概念、事实、表示、方法、比喻以及启发。
雷纳特甚至说：“智能就是一千万条规则。”

“知识汤”（knowledge soup）的说法：我们脑子里的知识不是一坨知识，而是好几坨知识，每一坨内部是一致的，但坨和坨之间可能不一致，坨和坨之间是松散耦合的。

Cyc 的原始目标更像是当今的维基百科，不过维基百科的受众是人，Cyc 的用户是机器。

学习只在已知事物的边缘发生

雷纳特曾说：“学习只在已知事物的边缘发生，所以人们只可能学到与自己已知相似的新东西。如果你试图学习的东西与你已知的东西距离不远，那么你就能学会。这个边缘的范围越大（你已知的东西越多），就越有可能发现新的东西。”

3.4 语义网（HTTP/HTML）

由专家系统一脉相传的这一派自身的逻辑功力不够，另一方面，他们的工程实践又略显欠缺。直到歪打正着的万维网支持者之一 Tim Berners-Lee 提出“语义网”（Semantic Web），他们认为机会来了。

伯纳斯-李因为草根且便捷的 HTTP 协议和 HTML 出了名，被各种媒体称为万维网的发明人。 20 年后，伯纳斯-李不负所望得了 2016 年图灵奖，这大概是图灵奖有史以来含金量最低的一个。

3.5 计算机科学的划分

计算机科学的划分

3.6 对知识做梳理是人类最早的智力活动之一

对人类的知识做梳理是人类最早的智力活动之一，也是人类的集体自我意识。

当欧洲还在黑暗时期时，伊斯兰科学迎来了黄金期。法拉比（Al-Farabi）是伊斯兰世界第一个自成系统的哲学家，他对亚里士多德的注释和对柏拉图与亚里士多德哲学的调和对后代阿拉伯哲学和西方哲学影响很大，被称为“亚圣”（Second Master 或者 Second Teacher），首圣当然是亚里士多德了。

4 第五代计算机的教训

People learn from history that people never learn from history. – Georg Wilhelm Friedrich Hegel（黑格尔）

Those that fail to learn from history, are doomed to repeat it. Winston Churchill（丘吉尔）

日本早年神经网络研究的先驱福岛邦彦和甘利均一。

当下流程的卷积神经网络 CNN 的源头就是福岛邦彦的工作。

在福岛邦彦和甘利均一的壮年，日本都把资金投入到了五代机，他们没赶上好时候。

5 神经网络简史

I bet the human brain is a kludge. Marvin Minsky

自图灵提出“计算机与智能”起，就一直有两派观点：

一派认为实现人工智能必须用逻辑和符号系统，这一派看问题是自顶向下的；
还有一派认为通过仿造大脑可以达到人工智能，这一派是自底向上的，他们认为如果能造一台机器，模拟大脑中的神经网络，这台机器就有智能了。

5.1 神经网络的初创文章，1943

神经网络的原创文章发表于 1943 年，两位作者都是传奇人物：麦卡洛克（Warren McCulloch）和皮茨（Walter Pitts）。Pitts 打小就喜欢数学和哲学，初中时就读过罗素的《数学原理》，还和罗素通过信。

A Logical Calculus of the Ideas Immanent in Nervous Activity, 1943

神经网络的开山之作：A Logical Calculus of the Ideas Immanent in Nervous Activity，发表在 Bulletin of Mathematical Biology 上。

这篇文章成了控制论的思想源泉之一。
这篇文章只列了三篇貌似不相关的参考文献，卡尔纳普的《语言的逻辑句法》，希尔伯特和他学生阿克曼合著的《数理逻辑基础》，怀特海和罗素的《数学原理》。

5.2 维纳

控制论的创始人维纳（Norbert Wiener）早年自称神童，他爸是哈佛大学教授，曾经带着他到英国见过罗素，但罗素特不喜欢这孩子和他爹。自打进入 20 世纪后，甭管哪门哪派的学问，最后都能扯到罗素那儿。

维纳后来也在哈佛大学任教，但不被主流数学家喜欢，没拿到终身教职。最后到了隔壁的麻省理工学院落脚，在“二战”时搞了点武器研究。那时最好的数学家和物理学家都参与了造原子弹的“曼哈顿”计划，维纳却没沾边。这也许同他的个性有关系，他的同事和家人都觉得他对数学之外的事情反应迟钝。维纳提出“控制论”后出了大名。

维纳曾写过两卷本的自传：《昔日神童》（Ex-prodigy）和《我是数学家》。不喜欢维纳的人开玩笑说，应该是《昔日数学家》和《我是神童》，嘲讽维纳的数学不入主流，同时暗示维纳对自己神童身份的过高自视。

维纳无论如何首先是一位严谨的数学家，而 McCulloch 则被人称为是浪漫的科学家。所谓“浪漫”不是指生活，而是说他对科学思想的表述方式。

维纳曾经把为大脑建模作为他学术生涯的最后野心。

强化学习之路：维纳 -> 阿比卜 -> Andy Barto -> Richard Sutton

阿比卜的“杂学”体现在他那本科普书《大脑、机器和数学》里，其实他本科毕业论文已初露端倪，题为“Turing Machines, Finite Automata, and Neural Nets”。

阿比卜后来创办了麻省大学的计算机系，并延揽一帮人工智能人马，其中有后来以强化学习出名的巴托（Andy Barto），使麻省大学的人工智能曾在很长一段时间都处于领先地位。

5.3 罗森布拉特和感知机

神经网络研究的后一个大突破是在 1957 年。康奈尔大学的实验心理学家 Frank Rosenblatt 在一台 IBM-704 计算机上模拟实现了一种他发明的叫作“感知机”（Perceptron）的神经网络模型。这个模型可以完成一些简单的视觉处理任务。这在当时引起了轰动。

Perceptrons: An Introduction to Computational Geometry

影响巨大、“是也非也”的书：《感知机：计算几何学》（Perceptrons: An Introduction to Computational Geometry）。

在书中，Minsky 和佩珀特证明单层神经网络不能解决 XOR（异或）问题。
异或是一个基本逻辑问题，如果连这个问题都解决不了，那神经网络的计算能力实在有限。

感知机的失败导致了神经网络研究的式微，用加州理工学院的集成电路大佬米德（Carver Mead）的话说是“二十年大饥荒”。 Minsky 1988 年在《感知机：计算几何学》一书再版时，删除了第一版中对罗森布拉特个人攻击的句子，并手写了 In memory of Frank Rosenblatt。

5.4 神经网络的复兴解决 XOR 问题：神经网络多加一层+后向传播

1974 年，哈佛大学的一篇博士论文证明了在神经网络多加一层，并且利用“后向传播”（back-propagation）学习方法，可以解决 XOR 问题。

Paul Werbos 这篇文章刚发表时并没引起多少重视，那时正是神经网络研究的低谷，文章不合时宜。
Paul Werbos 也是递归神经网络 RNN 的原创者。但在深度学习大火后，他的兴趣转向了量子力学。

Hopfield 神经网络：来自物理学而非生物学的突破

神经网络在 20 世纪 80 年代的复兴归功于物理学家 John Hopfield。

1982 年，Hopfield 提出了一种新的神经网络，可以解决一大类模式识别问题，还可以给出一类组合优化问题的近似解。这种神经网络模型后来被称为 Hopfield 网络。
1984 年，Hopfield用模拟集成电路实现了自己提出的模型。

Hopfield 模型的提出振奋了神经网络领域。

神经网络的这次复兴和生物学没啥关系，它既不是来自生物学的刺激，也没有给生物学送去任何慰藉。
倒是它来源于物理学家，并引起了物理学家的关注，曾经一批对复杂系统感兴趣的物理学家在交叉学科杂志上接二连三地发表文章。

连接主义运动（Hinton）

一帮早期神经网络研究的“幸存者”，在生物学家克里克（Francis Crick）和认知科学大佬诺曼（Don Norman）的鼓励下，开始了连接主义（Connectionism）运动。领导者：

两位心理学家鲁梅尔哈特（David Rumelhart）和麦克利兰德（James McLelland），
一位计算机科学家辛顿（Geoffrey Hinton）。

连接主义运动的成果之一就是那本被称为 PDP（Parallel Distributed Processing）的著名文集（分两卷）。此书的出版给认知科学和计算机科学吹了股春风，被神经网络新秀称为“圣经”。

Rumelhart -> Michael Jordan -> Andrew Ng

连接主义运动也培养了一堆新人，并使得加州大学圣地亚哥分校的认知科学系成为同类系科的佼佼者。

Rumelhart 后转往斯坦福大学任教，乔丹（Michael Jordan）就是他的学生，而吴恩达（Andrew Ng）又是乔丹的学生。
Rumelhart 的另一名学生格 Robert Glushko 后来远离本行，跟随硅谷互联网早期英雄 Marty Tennenbaum 创立了一家公司，赚了一票钱。格鲁什科捐钱设立了“Rumelhart 奖”来奖励神经网络的研究者，辛顿成了第一位获奖者。

Chomsky：统计的方法不优雅，只是模仿而不是理解

Chomsky 认为统计的方法不“优雅”（elegant），只是模仿而不是理解。 会骑自行车不算理解，对自行车为什么不倒，能说清道理，才算理解。

Peter Norvig：在理解之前不妨碍模仿先上

谷歌的研发总监 Peter Norvig 为统计方法辩护时说：简单的模型（如 Chomsky 理论，以及后来的各种改进版本）不能解决复杂的问题，人工智能的进一步发展必须两条腿走路。

诺维格在加入谷歌之前曾是加州大学伯克利分校的计算机教授，他对两派都了如指掌，在学术界和工业界都被尊重，他写的《人工智能》是最流行的教科书。

5.5 深度学习

神经网络在 20 世纪 80 年代的光芒被后来的互联网掩盖了。

但这几年，恰恰是互联网产生的海量数据给了神经网络更大的机会。
人工智能学者在计算机系曾经是最抬不起头的，这几年却人人都变成了大知识分子。

网络对应的概念：一层网络就是一个函数

神经网络由一层一层的神经元构成。层数越多，就越深，所谓深度学习就是用很多层神经元构成的神经网络实现机器学习的功能。理论上说，

如果一层网络是一个函数的话，多层网络就是多个函数的嵌套。
网络越深，表达能力越强，但伴随而来的训练复杂性也急剧加大。

Hinton 2006：降维和逐层训练，使深度学习的实用化成为可能

辛顿是深度学习的先驱，他和学生在 2006 年发表的两篇文章开辟了这个新领域，

登在《科学》上的那篇提出了降维和逐层预训练的方法，使得深度学习的实用化成为可能。
深度神经网络最后几层的每个节点都可对应于某些概念。这是神经网络的一大进步，调和了与符号派的矛盾。至于符号派买不买账，就是另一回事了。

6 计算机下棋简史

Play is the beginning of knowledge.—— George Dorsey

6.1 图灵， ~1944

二战没结束时，图灵就研究计算机下棋，他 1947 年编了第一个下棋程序。
Donald Michie 是图灵的追随者，1950 年试着在纸上模拟程序，和图灵对弈。
Dietrich Prinz 接着图灵的思路，在 1951 年写了一个残局程序，能在离将死还有两步的情况下，找到最优解。这个问题也被称为“两步将死”（mate-in-two）问题。

6.2 冯诺依曼，《博弈论》提出 MiniMax 算法， 1944 《博弈论》, 1944

几乎和图灵同时，冯诺伊曼也在研究计算机下棋，他和经济学家摩根斯顿合作的《博弈论》1944 年出版，其中首先提出两人对弈的 Minimax 算法。

Minimax 算法中，二人对弈的一方为 max，另一方为 min，max 一方的评估函数要越高越好，min 一方的则越低越好。

max 和 min 的对弈就形成了博弈树。
树的增长是指数式的，当树很深时，树的规模会变得不可控。
麦卡锡首先提出α-β剪枝术以控制树的增长。

6.3 香农：开创计算机下棋的理论研究，1950 Programming a Computer for Playing Chess, 1950

香农（Claude Shannon）1950 年在《哲学杂志》发表“计算机下棋程序”（Programming a Computer for Playing Chess）一文，开启了计算机下棋的理论研究，其中主要思路在“深蓝”和 AlphaGo 中还能看到。

香农把棋盘定义为二维数组，
每个棋子都有一个对应的子程序计算棋子所有可能的走法，
最后有个评估函数（evaluation function）。

传统的棋局都把下棋过程分为三个阶段：开局、中局和残局，不同阶段需要不同的技术手段。

香农的论文引用了冯诺伊曼的《博弈论》和维纳的《控制论》。

6.4 IBM 深蓝战胜卡斯帕罗夫， 1997

1997 年 5 月 11 日，老卡认输，“深蓝”成了第一位战胜当时世界冠军的机器。事后，卡斯帕罗夫回忆：第二局是关键，机器表现超出他的想象，它经常放弃短期利益，“showing a very human sense of danger”。

在“深蓝”赢了卡斯帕罗夫之后，职业棋手并没有因此而改行，他们反而更多地依赖计算机来训练。 机器作为教练，反而更快地帮助人类棋手进步，因为过去的孩子从来就没有机会能和特级高手比赛。

6.5 AlphaGo：首次引入了强化学习

谷歌的 AlphaGo 首次引用了强化学习（Reinforcement Learning），让机器和自己对弈学习。强化学习的发明者是巴托（Andy Barto）和他的学生萨顿（Richard Sutton）。

强化学习 80 年代就发明了，但一直不被重视，是 AlphaGo 使得它焕发新生。

7 自然语言处理

the noblest pleasure is the joy of understanding - Leonardo da Vinci

It is not our aim to refine or complete the system of rules for the use of our words in unheard-of ways. - Wittgenstein

7.1 Chomsky 《句法结构》

Chomsky 之于语言学和认知科学，就像图灵之于计算机科学。他认为，

所有的语言（人工或自然）都有类似的句法结构，
语言的结构是内在的，而不是通过经验习得的，
代表作《句法结构》。一本小册子，不需要什么背景就能读。

Brown (1988，1990)是统计派的奠基作品，正文只有 6 页，虽是学术论文，却非常可读。

经验主义靠近科学，理性主义靠近数学

从某种意义上说，行为主义是极端的经验主义。

所有黑盒理论，无论是神经网络还是统计派，在 Chomsky 眼里都属行为主义。
Chomsky 认为理论应该先于事实。他常以遗传学祖师爷孟德尔为例，但孟德尔常常删改不支持理论的数据。

Chomsky 认为心身（mind-body）问题是个伪问题，难度倒不在于如何定义 mind，而在于连什么是 body 这样貌似简单的问题都无法明确地说清。

他认为 mind 的研究终究会变成像物理学、化学那样的学问，只不过现在还要用心理学的术语逐步获得进展。
语言学是突破口之一，由此可以找到 “mind” 的物理机制。
从这个意义上说，Chomsky 也不完全反对经验主义。

语言学的牛顿？

Chomsky 比较了笛卡儿和牛顿的理论，认为牛顿为物质世界提供了一个解释理论，但笛卡儿却没有为语言的创造性使用提供满意的解释。他自认为他正在向这个方向前进。也有人称 Chomsky 是语言学的牛顿。

科学方法素有 explanation 和 redescription 之分。

统计方法可看作一种 redescription，但不是 explanation。
Chomsky 不认可语言学的统计方法。

活着的人里被引用次数最多的知识分子？

Chomsky 是活着的人里被引用次数最多的知识分子，即使从苏格拉底算起，他的引用数也可排进前十。

他的时事评论几十年来都被广为关注，这一点颇像他的偶像罗素。他的独特政治观点体现在他对当代政治事件的评论上。
人们轻率地把 Chomsky 划为左派，其实，他是反建制者，永远怀疑权威，永远同情人民。
Chomsky 作为犹太人，却不被以色列接受，因为他同情巴勒斯坦的立场。以色列甚至拒绝给 Chomsky 发签证。
Chomsky 在任何地方的学术演讲，最后总要“饶”一段儿同等时间的政治评论，就像演出的返场。

Chomsky 敬仰的人不多，无政府主义者乔治·奥威尔是一个，罗素是另一个。很多人拿 Chomsky 和罗素做比较，

罗素在出版了《数学原理》后很少再有原创的知识贡献，兴趣转向政治；
Chomsky 在《句法结构》之后也成为一位社会活动家和公共知识分子。

但 Chomsky 仍然不断有科学成果出来。罗素被下过两次大牢，Chomsky 1967 年因为反越战被捕，和诺曼·梅勒关在一起。

7.2 统计派又来了我每开除一名语言学家，语音识别系统的性能就提高一点

Frederick Jelinek 是这个小组的核心。贾里尼克的学术训练是信息论，统计是他们这一派人最自然的工具。他的金句是：“我每开除一名语言学家，我的语音识别系统的性能就提高一点。”

IBM 小组的成员之一柯克（John Cocke）因为 RISC 架构在 1987 年就得了图灵奖。他在图灵奖的致辞中说，计算机性能的提升主要源于三个方面：算法、编译器和体系结构。这三个方面是按重要性大小排序的，但他的名声却主要来自于他认为重要性最小的体系结构。

其实最早提出机器翻译的 Warren Weaver 的思路就是统计。但 Chomsky 登场后，统计方法基本就没饭吃了。

Chomsky 的理由很简单，语言的可能性是无限的，统计不可能解决问题。 Chomsky 对统计方法的排斥，恰似波普尔对卡尔纳普归纳法的批判。
Chomsky 不喜欢统计派的一个理由是他们太像行为主义了：在翻译的统计方法中，平行语料的左边就是刺激，右边就是反射。

工程师根本不需要语言学知识，也不需要懂源语言或目标语言

2004 年，Franz Josef Och 加入谷歌。谷歌海量的数据让欧赫如鱼得水。谷歌翻译器迅速成为行业标杆。 2014 年欧赫在谷歌呆了十年后先后加入两家基因测序公司。

统计方法的另一个好处是工程师根本不需要语言学知识，也不需要懂源语言或目标语言，就可从事机器翻译。谷歌翻译团队就没什么科班出身的语言学家。欧赫认为语言学知识对翻译没什么用处，有时还会起反作用。

7.3 神经翻译是终极手段吗？ Google Neural Machine Translation (GNMT), RNN-based, 2016

2016 年，谷歌发布神经机器翻译（GNMT）系统，再次大幅提高机器翻译的水平。

和谷歌更早期的 Phrase-Based Machine Translation (PBMT) 不同，神经翻译的基本单位是句子，
谷歌使用了循环神经网络 RNN 做 Sequence to Sequence 的学习，
硬件设备是谷歌自己的 TensorFlow 平台。

神经翻译相比谷歌早期的基于短语的翻译系统，误差降低了 60%，这是翻译质量巨大的提升。这项工作已经开源。

Facebook, speed 10x, CNN-based, 2017

2017 年，Facebook 进一步提高了翻译效率。他们用自己擅长的卷积神经网络 CNN，进行序列到序列的学习。 Facebook 号称，英文-德文和英文-法文翻译的基准测试表明，

他们的结果在准确度上不输谷歌，
而在计算速度上则比谷歌的 RNN 有一个数量级的提升。

RNN 和 CNN 两种神经网络架构，分别被谷歌和 Facebook 支持。性能的此消彼长也被视为两家公司的竞争。真难预料神经网络还有多大的潜力可以挖掘。

翻译只是数据问题，不是语义问题？

Chomsky 们也许会接着质疑，这种翻译算理解吗？

也许翻译根本就不是理解的问题，翻译本身并不需要解释，翻译只是翻译而已，翻译只是数据问题，而不是语义问题。

没有 Chomsky，我们还要在黑暗中摸索，但有了 Chomsky，是不是又曾经束缚了我们探索其他方法的可能性。

7.4 IBM wason：知识库/知识图谱+浅层推理

现在的问答系统依靠常识和知识，同时也依靠浅层的推理。知识图谱是核心。

在 Jeopardy！节目中出现过的问题，95% 都能在维基百科中找到答案。

沃森参赛的版本的知识库只有 4TB，其中包含了所有维基百科的正文，真的不大。
除了半结构化的知识图谱，沃森还使用了开源搜索引擎。

把搜索的结果文档的标题与维基百科词条进行匹配，如果在维基百科中能找到，就把搜索结果列入候选答案。再把候选答案反馈给搜索引擎，进一步对返回结果做证据支持的处理，然后给出答案。
硬件系统是一个有 90 台 IBM Power 750 的集群，每台配一个 IBM Power 78 核处理器，每核 4 线程，所有一共 720 核，2880 线程；内存 16TB，所有的知识图谱都放在内存里了。

按照 Linpack 基准程序，这台计算机的算力相当于当年排名第 500 的超级计算机的一半，成本只有 300 万美元。同沃森带来的巨大广告效应相比，这真不算什么。

IBM 吸取了深蓝的教训，沃森在 Jeopardy！节目上取得的宣传成功后，很快变成了 IBM 人工智能事业的品牌，IBM 很快推出了沃森金融、沃森医疗、沃森教育等。现在 IBM 整个公司都围绕沃森转型了，也许 IBM 觉得“人工智能”这个词儿太俗了，他们非要标新立异地自诩为“认知计算”。

7.5 总结一个人工智能问题一旦解决，就不再是人工智能问题

就像一个哲学问题找到了科学的角度（formulation），就不再是哲学问题一样，一个人工智能问题一旦解决，就不再是人工智能问题。

大概很快人们就会认为语音问题不再是人工智能的核心问题。
如果说语音翻译不涉及自然语言理解和语义，可能也不会有什么异议。

2011 年 5 月，麻省理工学院为配合 150 周年校庆，召开了“大脑，心，机器”的研讨会（Brain, Mind and Machine Symposium）。

Chomsky 批评当下流行的神经网络和统计方法，Chomsky 认为神经网络是黑盒子，并没有给我们提供解释，故而没有提供知识。
时任谷歌研发总监的诺维格（Peter Norvig）很快回应 Chomsky，他批评语言学的规则在自然语言处理上，根本就没用。

可解释性

有人开始用“两种文化”来总结 Chomsky 和诺维格的隔空掐架。

Chomsky 对人工智能的批评的核心在于“可解释性”。AlphaGo 不能解释自己下棋的路数，算不算会下棋呢？
也可以反过来说，只有解释了，人类才能从中得到洞见，学习知识。但解释是不是也有层次，只有学会牛顿力学，才能学会相对论和量子力学？就如维特根斯坦所说的梯子的比喻，爬上房顶，梯子才能扔掉，梯子就是解释。其实，即使人类在不理解力学的时候，就会造弹弓了。对那时的人类，弹弓的工作原理就是黑匣子。

不求甚解的工程师 vs. 追求终极知识的科学家

Chomsky 和诺维格分别所代表的两种人关心的是两种不同的问题。

一种人力图打造实用的工具，没有解释也能凑合，他们是不求甚解的工程师；
另一种人寻求终极的知识，他们是科学家。

只不过，在计算机科学这个特定的学科中，科学家和工程师的角色变换太快，这门学科的开拓者，很多都是身兼二职，例如图灵和冯诺伊曼

8 向自然学习：从遗传算法到强化学习

Natural selection is a mechanism for generating an exceedingly high degree of improbability. —— Ronald Fisher

自然选择就是能生成极不可能之事的机制。

8.1 从生物学里找计算的模型：两条传承脉络

从生物学里找计算的模型，一直是人工智能的研究方向之一，学术上大致有两条传承的脉络：

McCulloch 和 Pitts 的神经网络，演化到今天成了深度学习；
冯诺伊曼的细胞自动机，历经遗传算法、遗传编程，其中一条支线最后演变成了今天的强化学习。

8.2 John Holland 和遗传算法

Holland 在晚年接受采访时如此评论麦卡锡和 Minsky：

美国西部的人工智能由麦卡锡代表，他们干净（neat），一切讲究逻辑；
东部的领袖自然是 Minsky，他们邋遢（scruffy），做事比较随意（adhoc）。

但他们的共性是都对机器学习不太感兴趣。

Ronald Fisher, 英国统计学家费舍

Holland 说他自己的思想被学界逐渐接受，是在他的学生都出了名之后。

对 Holland 影响最大的一本书是英国统计学家费舍（Ronald Fisher）的《自然选择的遗传理论》（The Genetical Theory of Natural Selection）。
无神论者道金斯（Richard Dawkins）称费舍是达尔文之后最伟大的生物学家。

进化和遗传是族群学习的过程，机器学习可以此为模型

费舍把孟德尔的遗传理论和达尔文的自然选择结合起来。 Holland 由此得到启发：进化和遗传是族群学习的过程，机器学习可以此为模型。

遗传算法

遗传算法就是模拟种群（population）的进化过程。其结构可以用下列伪代码大致表示。

随机生成初始群体。
主循环（停机的标准可以是迭代次数，或者适应度达到某个要求）。
- 2.1 执行策略，计算当前群体中所有个体的适应度；
- 2.2 从当前群体中，选择精英作为下一代的父母；
- 2.3 将选出的精英父母配对；
- 2.4 以极小概率将子代变异；
- 2.5 将子代个体添加到新群体中。

从程序中，我们马上可以理解进化中“优胜劣汰”的算法含义。

8.3 遗传编程

在遗传算法中，种群是数据，更进一步的想法是：如果种群变成程序的话，进化是不是仍然可行呢？ Holland 的学生寇扎（John Koza）在 1987 年给出了一个思路，并把它命名为“遗传编程”（Genetic Programming）。

物理学家多依奇（David Deutsch）用生物进化来类比知识的进化，他是哲学家波普尔（Karl Popper）的粉丝，并常常套用波普尔的科学哲学术语。他说猜想就像变异，批评和实验就像选择，而交叉学科就是配对了。从这个意义上说，知识的增长更像是遗传编程。

遗传编程的结构和遗传算法差不多，

一组程序就一个特定的问题给出解答，按照执行结果的好坏给所有程序排序。
程序本身也是数据，自然也可以修改。
在遗传编程里，变异就是对程序做微小调整。
交叉和配对就是将两个表现优异的程序互相嫁接。

寇扎后来还引入了“基因重复”（duplication）和“基因删除”（deletion）等生物学概念，以提升遗传编程的效率。

遗传算法本身就需要大量的数据，遗传编程需要的数据量自然更大，这对计算能力提出了新的需求。

遗传算法的稳定性一直就是研究课题，遗传编程的数学性质自然更加复杂。

8.4 强化学习

“人工智能”这个词儿的流行是在 20 世纪 70 年代中期，按照阿比卜的一家之言：人工智能是控制论的替代品，至少从时间轴上看，这不算错。

一个刚出生的孩子，怎么学会对环境的适应

巴托和萨顿关心更原始但也更抽象的可适应性。一个刚出生的孩子，怎么学会对环境的适应。

在监督式学习中，目标是清楚的。
但婴儿不知道目标是什么，不知道自己要什么。通过与外部世界的不断交互，婴儿受到奖励或惩罚，由此强化对外部世界的认知。

数学基础：马尔科夫决策过程和动态规划

强化学习的理论基础之一是马尔科夫决策过程。

强化学习的主体是 Agent，Agent 和环境互动。
强化学习就是 Agent 根据经验改变策略以期达到长期最大奖赏的过程。

强化学习的另一个理论基础是动态规划。

贝尔曼（Bellman）在 20 世纪 50 年代就发明了动态规划。
萨顿和巴托也承认在强化学习早期，受到动态规划的启发。巴托一度在他的强化学习讨论班上让研究生分工研读贝尔曼的经典著作《动态规划》（Bellman 1957）

在计算能力的约束下，强化学习的环境不宜太复杂

萌芽期的强化学习的例子都是游戏，如贝尔曼的“老虎机 ”和塞缪尔（Samuel）的跳棋。
游戏的环境相对容易定义，在棋类比赛中，环境就是对手和规则。
强化学习被用来下围棋不是偶然的。

如果整个世界是完全随机的，那么强化学习就要失效，学还是不学对结果没有什么影响。

巴托和萨顿有时也把强化学习称为“享乐主义”（hedonistic），也即学习系统想最大化环境对自己的某种反馈。

exploration vs. exploitation

强化学习中有所谓“抬头看路”（探索，exploration）和“低头拉车”（苦干，exploitation①）之分。探索就是看看有没有别的选择，苦干就是专注于当前的选择。

learning rate

在强化学习中，用希腊字母 ε 表示学习率（learning rate）， 值越小，能用于探索的时间就越少，绝大部分时间是在苦干。

减少状态空间搜索

遗传算法和强化学习有一个共同点：效果要等到多步以后才能看到，这是和监督式学习的主要不同。这就需要尽可能多地访问所有的状态，这样效率就会受到影响。

蒙特卡洛模拟是一种减少状态空间搜索的有效办法。
最近也有利用深度学习来压缩需要表示的状态空间数目。这还有点意思，本来强化学习初衷是探索生物体学习的模型，现在神经网络又成了强化学习的工具。

当状态空间很大时，强化学习可以和蒙特卡洛方法或深度神经网络结合，就使用了蒙特卡洛方法

AlphaGo 让强化学习一夜之间成为显学

强化学习作为机器学习的一个分支，一直没得到重视。谷歌的 AlphaGo 赢了李世石之后， 强化学习作为 AlphaGo 的核心算法，一夜之间成为显学。这当然要归功于萨顿和巴托多年的坚持。

巴托的“可适应系统”实验室，在神经网络不景气时，曾经收留过一批无家可归的学术浪人，其中就有吴恩达的老师乔丹。事实上，吴恩达的成名作就是用强化学习来控制无人直升机。

萨顿：开创强化学习，留有一点控制论的影子

萨顿 1979 年到麻省大学跟随巴托和阿比卜，由此开创强化学习。

他一直认为强化学习是理解智能的关键。
在整个人工智能的各个分支里，大概只有强化学习还留有点儿控制论的影子。

一旦一个算法被天才发明，并成功地在一个领域里得到应用，自然会有二流人才前赴后继把这个算法在其他领域发扬光大。20 世纪 80 年代的神经网络如此，当下的强化学习也如此。

早年有人质疑遗传算法算不算机器学习，他们认为遗传算法是一种近似优化算法，不能算机器学习。但从某种意义上，任何机器学习算法都是一种优化算法。

强化学习 vs. 监督式学习：第一人称叙事 vs. 第三人称叙事

如果从写作的角度看，

强化学习更像是第一人称叙述，Agent 就是“我”，外部世界（包括他人）都是“环境” 。
监督式学习更像是第三人称叙述，作者在用一只上帝的眼睛洞察世界，对错分明。

第一人称的学习要比第三人称的学习更本质。

Stuart Russell 和 Peter Norvig 在《人工智能：一种现代方法》里说 “可以认为强化学习包含了全部人工智能”（Reinforcement learning might be considered to encompass all of AI）。

8.5 计算向自然学习 vs. 自然向计算学习

以色列海法大学的进化生物学家 Livnat 和伯克利的理论计算机科学家 Papadimitriou 2016 年发表了一篇文章“性作为算法”（Sex as an Algorithm），引起轰动。

喜欢的人认为这为进化论找到了新视角，而不喜欢的人则批评杂志的编者和作者是为了博眼球。
这篇文章质疑了性在进化中的作用。
哈佛大学的理论计算机科学家、图灵奖获得者 Leslie Valiant 曾经从计算的角度研究过机器学习和进化，他把进化当作学习的特例。Livnat 和 Papadimitriou 认为有性繁殖不太容易达到最优点，而无性繁殖才更像是优化算法，他们把遗传算法比作有性繁殖，模拟退火算法比作无性繁殖。

如果说遗传算法是微观地向生物内部机制学习的话，强化学习则是更为宏观地向自然学习。

8.6 生物学激发的学科都缺乏计算理论的基础

无论是遗传算法、深度学习还是强化学习，都缺乏计算理论的基础。

生物学激发的学科都是模拟自然，它们都不需要解释，不需要了解内部原理，而只要能查看输出结果就够了。
数学大概是所有学科中离生物学最远的学科。

8.7 参考资料整体大于局部之和：涌现（emergence）现象

Holland (1975)是遗传算法的原创著作。

Holland 曾经写过几本科普读物，但大科学家未必是好的科普作家，他的著作不适合完全的门外汉。另外，他的哲学观点是整体论的，他认为整体大于局部之和，大量的“局部” 凑到一起，可以形成“涌现” （emergence）现象。

Sutton and Barto (1998) 强化学习的原创著作

Sutton and Barto (1998) 是强化学习的原创著作，在网上可免费获取。

强化学习的教科书里最爱用的 Q-learning，是 Chris Watkins 1989 年在他的剑桥博士论文里提出的。

科普文章：“谁能说出更大的数”

理论计算机科学家 Scott Aaronson 曾经写过一篇非常有意思的科普文章“谁能说出更大的数”（Who Can Name the Bigger Number），这可以是算法信息论的入门。

9 哲学家和人工智能

The real discovery is the one that makes me capable of stopping doing philosophy when I want to, the one that gives philosophy peace. ——Wittgenstein（维特根斯坦）

9.1 两类哲学家：深刻的和混饭的

哲学家不一定懂哲学，就像相声演员不一定会说相声，这是低门槛行业的通病。

《计算机不能干什么》，1965 是对《炼金术与人工智能》的扩充，对人工智能的全面批评。

哲学家有两类，一类是深刻的，一类是混饭的。

罗素和弗里格是深刻的，没有他们，就不会有数理逻辑，也就不会有哥德尔、丘奇、图灵，以及后来的计算机科学。
但没有现代的欧陆哲学，世界不过省了些粮食而已。

没有胡塞尔和海德格尔，Minsky 照样会想出“框架” ，从而催生后来的“面向对象的程序设计”方法论。所谓“顶层 ”概念就是 Java 程序设计语言里的 Object。

按照德雷弗斯们的说法，哲学系是不是应该要求读现象学的博士必须熟练掌握一门面向对象的程序设计语言？

在 20 世纪 80 年代末期，神经网络研究复兴之后，德雷弗斯对人工智能的全面批评也缩小为对符号派的专门攻击。他和他的兄弟斯图亚特·德雷弗斯一起撰文写书。斯图亚特虽然是运筹学专家，但一直都在做神经网络的研究，甚至号称发明了“反向传播”（back-propagation）的原始概念。

德雷弗斯曾经引用梅洛庞提批判人工智能：人脑是和环境直接交流的，而不是通过表示（representation）。

9.2 塞尔和中文屋

1980 年塞尔在《行为与脑科学》杂志上发表了 Minds, Brains and Programs 一文。文中的一个思想实验“中文屋” 马上成为最喜欢被引用的假想实验之一。

“中文屋”思想实验

“中文屋”思想实验是这样的：

假设有个只懂英文不懂中文的人（“我”）被锁在一个房间里，屋里只给“我”留了一本手册或一个计算机程序， 这个手册或程序教“我”在收到中文信息时如何用中文应对。
屋外的人用中文问问题，屋里的“我”依靠程序用中文回答问题，沟通方式是递纸条。

塞尔的问题是：如果屋外的人不能区分屋里的人是不是母语为中文，那么屋里的“我”是不是就算懂中文？

塞尔自己认为“我” 不懂中文。很明显，这个场景源自图灵测试，只不过图灵测试的环境是英文，而中文屋里既有中文又有英文。

解读

塞尔的文章出来后，引起轰动。其实轰动的原因很简单：谈论这种玩意儿没什么门槛，谁都可以说三道四：哲学家、科学家，以及各种媒体人。

塞尔毕竟是老练的哲学家，已经预测大家会质疑他的论断，他在文尾也设想了各种回答。

第一个问题是，我们只是算屋里人理解中文呢，还是屋子加人作为一个系统理解中文。塞尔的论断是屋里人即使查遍手册，顶多算是理解语法，而不算理解语义。
我们可以问塞尔这样的问题：一个坐飞机的人算能飞吗？如果对这些问题的答案都是“算” ，那中文屋作为一个系统为什么不算理解中文呢？

塞尔认为必须内化（换句话说：手册必须变成人身的一部分）才能算懂中文，那么内化到什么程度才能算呢？

爱因斯坦说“我的笔加上我要比我自己聪明”，笔算不算外化？
内化是完全的物理隐藏，还是只是个反应时间问题？在一开始查手册时，反应时间必定很慢，但熟能生巧之后，查手册变成下意识的动作，那算内化吗？
内化和辅助工具的大小也有关系。如果语音识别工具是桌面电脑，我们可能不会认为对话中的两个人理解了对方的语言。但如果这个工具可以微型化，直接内化到耳朵里，那算不算理解？

反“强人工智能”

塞尔认为他不是反人工智能，他只是反“强人工智能”。

中文屋测试的不是屋中的“我”，而是屋中的程序。如果那本神奇的手册或者程序已经通过图灵测试，那程序就是一个机器翻译的神器。这本身就是强人工智能了。而且那程序已经有语义功能了。

假设游戏不是中文翻译，而是下棋，那 “我” 算不算会下棋？断言中文屋是不是有智能，就像断言 AlphaGo 会不会下围棋一样，要看应用场景。

9.3 普特南和缸中脑思想实验：缸中脑

1981 年普特南出版了《理性、真理与历史》（Reason, Truth, and History）一书，该书的开篇就给出了“缸中脑”的假想实验。

Wikepedia Brain in a vat:

普特南更进一步设想，假设所有的感觉器官都泡在缸里，而外面的世界就是一台大自动机。

缸中脑知道如何与外部世界做对应吗？泡在缸中的人脑，如何知道自己是颅中脑，还是缸中脑？

人工智能的基本问题是可否造一台机器能有智能， “缸中脑”中的机器则起了另一种作用：人脑是否能确定外在的世界是直接实在还是间接实在。

《黑客帝国》、《盗梦空间》

科幻电影《黑客帝国》（Matrix）、《盗梦空间》（Inception）等都受“缸中脑”思想实验的启发。

9.4 给哲学家一点忠告哲学指导科学？

曾经有一个教条：哲学指导科学。费曼、惠勒和杨振宁等物理学家都曾撰文批驳。但这恰是德雷弗斯的立场。维特根斯坦曾经有言：哲学家的工作应该是一直给人提醒（assembling reminders），而不是指导。

哲学空洞化

偏重科学和逻辑的英美分析哲学也挡不住哲学的颓势，最后一个从哲学中脱离的硬学问是逻辑，目前最好的逻辑学家都在数学系和计算机系，哲学已经空洞化。

如果真认为海德格尔有用，就应该像弗里格和罗素清理逻辑那样， 把这些东西整理成可以交流的形式。也许哲学家真怕他们惯用的冷僻词汇被翻译成通俗易懂的语言。当代哲学，尤其是欧陆哲学，就像韩国整容术，乍一看唬人，其实遗传不了。

整个人工智能就是个大的假想实验

彭罗斯曾经这样谈到机器的情感和道德：如果你买一台计算机，它是有情感的，那么我们就有道德问题，因为计算机的意愿可能被违反，并可能会被当作奴隶。我们首先必须说道德是一个社会问题，也就是说当一个社会只有一个个体（无论是人还是计算机）时，是不存在道德问题的。

丹尼特曾说哲学家喜欢假想实验。其实从某种意义上说，整个人工智能就是个大的假想实验。只不过哲学家用纸和笔，而计算机科学家用计算机硬件和软件。本质是一样的。不同的是哲学家从不为假想实验的结果所苦恼，反而会时不时洋洋自得；而计算机科学家则偶尔会被他们取得的成果所惊到。

10 人是机器吗？——人工智能的计算理论基础

humans are nothing but meat machines that carry a computer in their head. —— Marvin Minsky

10.1 人是不是机器？

认为人是机器的，道理很简单：人也是由各种物理化学机制构成的，当然是机器了。早有法国哲学家美特里，现有 DNA 双螺旋结构发现者克里克，都持这种观点。克里克认为在不远的将来，生命可以在试管中合成。
认为人不是机器的，论据是人有很多功能，目前机器无法完成，尤其是那个叫“灵魂” 的神奇东西。

《论可计算的数》和图灵机的定义

计算机科学起源于图灵 1936 年那篇无论怎么夸赞都不过分的文章“论可计算的数”，这是人类文明最重要的成果之一。图灵在这篇文章中定义了后来被他的导师丘奇称为“图灵机”的计算装置：

一条无穷长的纸带，
一个读写头在一个控制装置的控制下在纸带上方左移右移，读取纸带上的内容并在纸带上写 0 或 1。

图灵的初衷是让他的机器模仿人类计算者。

同源问题和相关问题

“人是机器吗”这个问题有很多同源的古老哲学问题，例如，“心-脑”（mind-brain）和“心-身”（mind-body）。 还有很多相关问题，例如，自由意志和自我意识。

如果人是机器，那是模拟机器还是数字机器？

按照冯诺伊曼的说法，神经系统的本质是数字的，尽管构成神经系统的化学和生物过程的描述可能是模拟的。
现代物理学的一个假设是整个宇宙都是离散的，也即数字的。
人工智能符号派的基础之一是所谓“物理符号假设”，这个假设要求计算装置必须是数字的，或者说变量必须是离散的。
费曼就曾说世界是数字的。

如果机器是数字的，那么图灵机就是简单又有力的模型。 对于离散的量，二进制就足够了。

朴素唯物主义认为世界是连续可分的，从某种宏观的意义上说，朴素唯物主义是经典物理的思想基础。 历史问题有点像海岸线问题，尺度不同则结论也不同。新的量子物理认为世界是离散的、有限的。

10.2 Church-Turing Thesis：为什么图灵机是最重要的发明？

在人类发明的所有计算装置中，图灵机是直觉上最简单最可靠的。

在计算理论里，有一个著名的丘奇图灵论题（Church-Turing Thesis）： 所有功能足够强的计算装置的计算能力都等价于图灵机。这是一个观察，而不是定理。

通用图灵机和冯诺依曼架构

图灵在发明图灵机时，还定义了 Universal Turing Machine，简称 UTM，译为“广义图灵机/万能图灵机/通用图灵机”。

UTM 的核心思想就是一个图灵机的执行过程也可被编码成数据，放到纸带上，因此一个图灵机可以通过执行纸带上的程序来模仿另一个图灵机的行为。这台能模仿其他图灵机的图灵机就成了通用图灵机。
这是一个很深刻的思想，现在的软件产业都得益于此：被编码的图灵机就是软件。
后来冯诺伊曼设计的计算机被人称为冯诺伊曼架构，其最核心的思想就是存储程序（Stored Program）。这个思想其实就是来自万能图灵机：被编码的图灵机就是存储的程序。

纯逻辑或数学的东西联系到物理世界：函数 -> 纸带和读写头

冯诺伊曼把计算机的所有原创思想的功劳都给了图灵，并批评那些对图灵机实际意义缺乏认识的人。

有了图灵机，我们就很容易把原来是纯逻辑或纯数学的东西（例如递归函数和λ演算等） 和物理世界联系起来了，函数成了纸带和读写头。

10.3 不可能存在比图灵机更强的计算装置

Church-Turing Thesis 的一个自然结果就是，不可能存在比图灵机更强的计算装置。

20 世纪 80 年代初就有人证明三层以上的神经网络可以逼近任意连续函数。
80 年代末期，Steve Judd 证明三层以上的神经网络学习问题在图灵机上是 NP 完全的。
本书作者证明了在 BSS 模型上，类似的神经网络学习问题等价于线性规划问题。

目前各种神经网络学习算法都是工程，鲜有科学，神经网络算法多是些经验算法外加调参数，从业人员也多数没有计算理论的训练。伴随暴发户和显学的必然是浮躁之气。在各种学习算法里，很少看到目前关于什么算法适合什么问题的理论指导。

10.4 BBS 实数模型

BSS 模型的一个很大假设是，任意精度的实数四则运算可在单位时间内完成，这在数值分析中是有用而又方便的假设，但目前尚不知道如何在物理上实现。

其实即使在数值分析之外，我们经常做类似的假设，例如，在排序算法分析中，任意精度的数（可能是实数）之间的比较是单位时间的。

在 BSS 中，一阶逻辑的所有东西都是可判定的。这和图灵机是截然不同的，图灵机停机问题就是不可判定的。 BSS 和图灵机的这个本质区别可溯源到 20 世纪 30 年代初期。那时哥德尔证明了整数的一阶逻辑是不可判定的。但几乎在同时，塔尔斯基证明了实数的一阶理论（几何和代数）则是可判定的。我们可以说图灵机和 BSS 分别是哥德尔定理和塔尔斯基定理的计算体现。

有些复杂性的性质，BSS 也和图灵机不同。比如线性规划在图灵机上被证明是多项式时间的，但在 BSS 上，复杂度是啥，目前不知道。如果在 BSS 上可以找到线性规划的多项式时间的话，在图灵机上就可以找到强多项式时间算法。这个问题被斯梅尔称为最重要的计算机科学的理论问题。

按照费曼的说法，宇宙是数字的，换句话说，宇宙不是连续的实数，空间是一种网络，而时间也不是连续的。

10.5 量子计算

《费曼计算机科学讲义》

IBM 是计算物理学的源头。计算的物理学研究有实际需求。

图灵机的物理约束

从计算的角度看，图灵机只有数学约束而没有物理约束。

从真实世界看，一个可能的物理约束是能量：图灵机的读写头和纸带的运动是需要能量的。

逻辑运算与能量的关系

现代计算机的组件是逻辑门，有两种门，

可逆的，如“非门”；
不可逆的，如“与门”。

IBM 的物理学家朗道尔（Rolf Landauer）在 1961 年提出了朗道尔原理：任何不可逆计算都需要能量。

同在 IBM 的另一位物理学家本内特（Charles Bennett）在 20 世纪 70 年代提出可逆运算不需要能量，并证明对任何图灵机都能找到一个对应的可逆版本，能实现同样功能而不损失效率。

量子计算机：（在对的时刻）测量而非（一步步）计算

费曼考虑的问题是如何以任意精度来模拟一个物理系统。他的方法是构造一台量子计算机，它求解问题的时间不随问题的规模呈指数增长。

量子计算并不是一步一步的经典计算，而只是测量系统的输出结果。

费曼认为测量本身也是一种计算。

当计算量很大时，最简单的方式是让自然界自己该干啥干啥，而在对的时刻测测结果就可以了。

举例：子弹的弹道，生成随机数

举一个不精当的比喻，想知道子弹的弹道，两种方式，

考虑所有可能外部内部因素，依靠计算；
让子弹飞，然后测量。

随机数可以通过伪随机函数生成，也可以通过测量一些噪声源得到。图灵 1949 年就研究过通过外部电子噪声源得到随机数的方法。

在图灵机上很难求解的问题有可能在量子计算机上用多项式时间解决。其中最热门的问题是素数分解。

10.6 计算理论的哲学寓意神经网络研究者数学和计算理论功底的缺乏

人们常说是 Minsky 和佩珀特的《感知机》（Perceptrons）一书导致了神经网络研究近 20 年的衰败，但神经网络的研究者不该反省下自己数学和计算理论功底的缺乏？

从当下人工智能的浮夸风气中，没看出吸取了什么教训。

Donald Knuth：量子力学为自由意志提供了空间，也使得上帝可以操纵世界而不违反物理定律

Donald Knuth（计算机科学家中位数不多的有神论者）说量子力学为自由意志提供了空间，也使得上帝可以操纵世界而不违反物理定律。

我很少看到计算机科学家敢对物理学家说三道四，姚期智大概是唯一的例外。

11 智能的进化

Science is what we understand well enough to explain to a computer. Art is everything else we do. —— Donald Knuth

11.1 Human Advantage: How Our Brains Became Remarkable

畅销书，并被翻译为多种语言。2017 年该书中文版以《最强大脑》为题出版。
创造的“大脑汤”（brain soup）的方法最终使她成功地测定不同动物大脑的神经元数量。
书中不仅有研究成果，还有更有意思的研究过程，包括她是如何把大象的大脑从非洲弄到美洲的新奇故事。

脑结构和神经元数量

不同动物的脑构造有所不同，脑中的神经元数量也完全不同，

人脑中总共有 860 亿个神经元（用 LLM 术语来说就是 86B），其中大脑皮层有 160 亿个神经元（16B）。 大脑皮层的神经元数量决定了动物的智力水平，人的大脑皮层中神经元数量远高于其他物种，所以人类比其他物种更聪明。
大象的脑子总共有 2570 亿个神经元，但是其中 98% 的神经元都存在于小脑中。大脑皮层只有 56 亿个神经元，无法与人类相比。

神经元数量越多，能耗也越大

大脑皮层中的神经元数量越多，能耗也越大。

人脑每天消耗的能量占人体全部耗能的 25%。人之所以能够很快超越其他物种，主要是因为人类掌握了烹饪技术。能够在短时间内摄入大量卡路里以支持大脑运转。
其他物种则将摄入的卡路里用于维持身体运转，不得不牺牲大脑皮层的神经元数量。

用不同的时间粒度看待过去，会得到不同的结论

《尤利西斯》中的几个小时，茨威格作品中人物的一生，或赫拉利的七万年，关心不同的过程。
粒度也可以是主体的，一个基因，一个人，一个群体，不一定非得是一个小的物质颗粒只配得上小的时间单位。
想想基因人类学，基因在几万年的空间分布，帮我们了解人类的起源和迁移。
当用太大的颗粒度研究历史时，历史学家的用处会令人质疑。

11.2 机器：从代替人的体力到代替人的智力

过去的机器旨在节省人的体力，现在的机器开始代替人的智力。

人作为物种，不再具备进化的竞争优势？

人通过两性繁殖的进化速度远远赶不上机器。

机器的进化速度服从摩尔定律——每 18 个月性能提升一倍，而人的进化速度则是 20 年一代人。
人作为物种，是不是不再具备进化的竞争优势？
依靠硬件的摩尔定律，是不是可以达到超级智能？

新的智能形态：Agent？

新的智能存在可以是人工智能的 agent，也可以是生物学意义上的物种。

11.3 基因修复的伦理问题

通过修复一个受精卵的一小段染色体，就可以避免或治疗某种疾病。这是一个真实的伦理问题，因为已经有这样的病例发生。

如果孩子出生，那么他/她的父母是谁？
多小算是“一小段”，1% 还是 49%？
更进一步：可不可以有更多不同来源的基因参与？
英国《经济学人》2017 年 2 月的一期封面标题就是“Sex and Science”

11.4 机器人三定律之一：机器不能伤害人

维纳曾经说：“我们最好能够确认，我们给机器设定的目标确实是我们想要的。”

物理学家改行的科幻作家阿西莫夫曾提出机器人三定律，第一条就是机器不能伤害人，但“什么是伤害”本身就不好定义。AlphaGo 战胜李世石和柯洁，算是对他们的伤害吗？

12 当我们谈论生死时，我们在谈论什么？

I don’t want to achieve immortality through my work; I want to achieve immortality through not dying. —— Woody Allen（伍迪·艾伦）

12.1 苏格拉底之死和《斐多篇》

苏格拉底说：哲学家只研究“正在死”（dying）和“刚刚死”（being dead）。除了这个啥都不管。

苏格拉底因为三项罪名被判死刑：腐蚀雅典青年，不敬城邦和引入自己的新神。受审前一天恰好赶上雅典的“花船节”，祭祀的船要离开雅典再返航。花期，城邦要保持清洁，因而不能执行死刑，于是苏格拉底临死前有一段时间可以和学生们聊哲学。柏拉图据此写了四篇对话。

耶稣之死和苏格拉底之死不同，耶稣完成了使命，苏格拉底留下了一堆问题。

他说人追求真理的最大束缚就是肉体，为了得到终极智慧，灵魂必须超越肉体，也就是摆脱感官的限制。换句话说就是人必有一死。他最后一天的谈话被当时的在场者斐多记录，最终变成了柏拉图的《斐多篇》。

12.2 作者和苏格拉底之间的假想对话

挺有意思的一段哲学对话，关于“永生”，这里就不放了，感兴趣可以网上搜搜，或者读完这份笔记觉得这本书不错，买本电子/纸质书支持下作者。

13 总结逻辑派/规则派/符号派统计派哲学层面 理性主义者 经验主义者经济方式类比计划经济自由市场经济视角和可解释性 上帝视角，第三人称叙事，更具可解释性 第一人称叙事，不可解释性（e.g 深度学习）令人困扰科学史角度还原论（reductionism） 涌现论（emergentism）

科学史对科学也有还原论（reductionism）和涌现论（emergentism）之分，规则派接近还原论，统计派可以算作涌现论。

如果说英美分析哲学的工具支撑是逻辑的话，那么在某种意义上，博弈论可被当作实用主义的新工具，博弈论涉及 Multi-Agent。我并没有非得把自然派附会到实用主义的意思。曾经被认为是复杂的统计派问题，例如图像处理和语音识别，现在已经得到解决或者至少已有解决的思路。

附录附录 1：图灵小传

曼彻斯特的公园里，图灵雕像的底座，引用了罗素的话：“数学不仅有真理，也有最高的美，那是一种冷艳和简朴的美，就像雕塑。”

Mathematics, rightly viewed, possesses not only truth, but supreme beauty — a beauty cold and austere, like that of sculpture, without appeal to any part of our weaker nature, without the gorgeous trappings of painting or music, yet sublimely pure, and capable of a stern perfection such as only the greatest art can show. The true spirit of delight, the exaltation, the sense of being more than Man, which is the touchstone of the highest excellence, is to be found in mathematics as surely as poetry.

伯特兰·罗素，《西方哲学史》

附录 2：人工智能前史：图灵与人工智能

图灵 1950 年在英国哲学杂志 Mind 上发表文章“计算机与智能”，文中提出“模仿游戏”，被后人称为“图灵测试”。

这篇文章被广泛认为是机器智能最早的系统化科学化论述。
但图灵在 1941 年战时就开始思考机器与智能的问题，1947 年图灵在伦敦皇家天文学会就机器智能发表演讲。1948 年图灵把这次演讲整理成文章，题为“智能机器”（“Intelligent Machinery”），作为英国国家物理实验室（NPL）的内部报告，但没有公开发表。
这篇文章迟至 1969 年才在年刊型论文集《机器智能》上发表。但由于和 1950 年文章的题目类似，并没有引起人们的重视。

1948 年的文章对智能的概念采取了更宽泛的说法，图灵探讨了大脑皮层，

他认为婴儿的大脑皮层是非组织的（unorganised）。
在图灵的用语里，“非组织”就是“通用”的意思，发育的过程就是组织化的过程。
他指出人身上的任何小部件都可以用机器来模仿，他还提到基因、进化和选择。

正是因为如此，麻省理工学院的机器人专家布鲁克斯认为图灵（1948）是人工智能两条路线分歧的原点，而他自己的观点则是图灵 1948 年的文章比 1950 年的更为重要。图灵 1948 年的文章提到了 embodied intelligence 和 disembodied intelligence 的区分。

图灵进一步预测到 2000 年，机器内存会达到 1GB（预测这么准还真挺神）。

这篇文章为后来的一系列后学者模仿的文章提供了范文的效果，例如塞尔的“中文屋”和普特南的“缸中脑”。

附录 3：冯诺依曼与人工智能

Talent hits a target no one else can hit; Genius hits a target no one else can see. —— Schopenhauer（叔本华）

冯诺伊曼被引用最多的话是：“我们应该预测所有稳定的过程，控制不稳定的过程。” （All stable processes we shall predict. All unstable processes we shall control.）其实这并非是老冯的原话，而是弗里曼·戴森转述老冯 1950 年在普林斯顿的讲座的精神，那时他是多么自信啊。

附录 4：计算机与智能，turing paper

建议参考翻译，阅读图灵的原 paper。

后记

本书的写法比较偏重基础和方法论，而不太注重应用。

费曼在加州理工学院教书时，学期的最后一节课都是请学生问问题，只要不涉及政治、宗教和期末考试，什么问题都可以问。

本书也参考这一方式，回答读者几个问题：

问：这次的人工智能是泡沫吗？
答：人工智能和人们关心的某些终极问题有关，这些问题过去是哲学家和科幻作家的地盘， 计算机科学为人们提供了用科学和工程的手段回答这些问题的方法，旁人自然会对这些方法存在过高的期望，过高的期望自然也会带来过高的投资。泡沫的破裂就是投资的失败。比人工智能更年轻的互联网，起伏的周期更短。从投资的角度看，某些特定的人工智能应用领域确实存在过热现象。
问：算法、数据和算力，哪一项对这次人工智能的复兴贡献最大？
答：我正在对这个问题做一项定量的研究，但目前还没有确定性的结果。要我猜的话，贡献排序应该是：算力、数据和算法。没有足够的算力，就没有办法处理海量数据，很多算法的精化是以某些特定的硬件为前提的。 算力的提升恰好到了一个临界点，使得各种学习算法成为可能。

[笔记]《人工智能简史（第二版）》（2025）

ARTHURCHIAO'S BLOG

2 months 4 weeks ago

本文整理一些个人阅读笔记和思考。

水平及维护精力所限，文中不免存在错误或过时之处，请酌情参考。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

0 前言
- 0.1 哈代：一等智力 vs. 二等智力
- 0.2 任正非
1 达特茅斯会议：人工智能的起源， 1956
2 自动定理证明兴衰记
3 从专家系统到知识图谱
4 第五代计算机的教训
5 神经网络简史
6 计算机下棋简史
7 自然语言处理
8 向自然学习：从遗传算法到强化学习
9 哲学家和人工智能
10 人是机器吗？——人工智能的计算理论基础
11 智能的进化
12 当我们谈论生死时，我们在谈论什么？
- 12.1 苏格拉底之死和《斐多篇》
- 12.2 作者和苏格拉底之间的假想对话
13 总结
附录
后记

0 前言 0.1 哈代：一等智力 vs. 二等智力

哈代曾说科学和艺术的原创需要一等的智力，解释和欣赏（例如乐评家和书评家）是二等智力的活儿。

搜了一下哈代的原话：

A Mathematician’s Apology，G. H. Hardy

大致意思：

《一个数学家的自白》，哈代

0.2 任正非

任正非是二十一世纪的哈代。

张五常：任正非是今天的哈代吗, 2019

1 达特茅斯会议：人工智能的起源， 1956

What is past is prologue. - William Shakespeare

凡过往皆为序章。

1.1 经典读物

“Man viewed as a Machine” 介绍了图灵机和冯诺依曼的细胞自动机。
- muscle machine
- brain machine - 人工智能的另一种说法
Alchemy and Artificial Intelligence (PDF),《炼金术与人工智能》，1965
《计算机不能干什么》
《Human Memory and the Storage of Information》1956

是《The Magic Number Seven》的另一个版本。

一门年轻的学科，一开始都需要一点“过度销售”（excessive salesmanship） - Minsky

1.2 Chomsky：机器可以思考吗？-> 潜艇会游泳吗？

2015 年他被问及“机器可以思考吗？”，他套用计算机科学家 Dijkstra 的说法反问：“潜艇会游泳吗？”

Youtube: Noam Chomsky - Can Machines Think?

1.3 AI 的两面：工程和科学

Chomsky 把 AI 分成工程的和科学的：

工程的一面，如自动驾驶车等，能做出对人类有用的东西；
科学的一面，Chomsky 明显不认可。

他引用图灵的话：这问题 too meaningless to deserve discussion（没有讨论的意义）。

2 自动定理证明兴衰记

就像机器能省体力一样，符号演算能省脑力。演算越完美，付出的脑力就越少。

Proof is cultivated reasoning. —— Bruno Buchberger

2.1 自动定理证明的起源数学哲学三大派

逻辑主义
- 代表人物：罗素，
- 把数学归约到逻辑，因此只要把逻辑问题解决了，之上的数学问题自然就解决了。
- 换句话说，把逻辑玩转了，数学就不算事儿。
形式主义
- 代表人物：希尔伯特
- 把数学形式化，数学过程就是把一串符号变成另一串符号。
- 希尔伯特设想，如果能设计一个大一统的算法，那么所有的数学问题都可以由这个算法来解答。这和逻辑主义精神有一定相通之处。哥德尔后来证明这一切是不可能的。
直觉主义

逻辑学的源头：亚里士多德三段论

自动定理证明起源于逻辑，初衷就是把逻辑演算自动化。

逻辑学的源头是亚里士多德的三段论：人必有一死，苏格拉底是人，所以苏格拉底必死。

2.2 思想实验：Brain in a vat

Wikepedia Brain in a vat:

2.3 王浩（1921—1995）

可以公正地说，王浩的定理证明研究孕育了整个理论计算机科学。

王浩以哥德尔的权威诠释者和知音名世，但他对哲学、逻辑学、计算机科学的原创性却被低估了。

王浩在致获奖词时半开玩笑地说，因为自己的个性，荣誉经常绕道而行。

王浩的定理证明程序后来成为高级语言的基准程序，麦卡锡的 LISP 早期就一直以王浩算法的程序作为例子。

2.4 吴文俊（1919—2017）

高龄开始学习编程

为人类文明做出贡献

吴文俊生平：《走自己的路》

2.5 哲学问题有黑盒的理解不能算理解，有黑盒的证明也不能算证明

Chomsky 对统计派机器翻译的批评：有黑盒的理解不能算理解，有黑盒的证明也不能算证明。

人已经无法核实部分计算机证明的结果

传统的数学实践遵循共同体过程：一个数学家提出证明，然后一堆同一共同体的专家来验证，如果验证通过，定理成立。费马大定理的证明、庞加莱猜想的证明和张益唐的证明，都是这个套路。
有些机器证明太长，人根本看不过来，那怎么才算是证明了定理呢？如果用一个可被信任的计算机程序验证一遍，是不是就算是证明了呢？罗宾斯猜想的证明就曾用 Mathematica 验证过，而 AUTOMATH 本身就是一个验证系统。对全自动的定理证明，验证过程更容易机械化，而计算机辅助证明可能五花八门，很难有一个统一的过程。

数学家的归宿

吴文俊曾留学法国，法国的数学家素有关心数学史的传统。
吴文俊认为中国数学是巴比伦式的而不是希腊式的，巴比伦数学讲究计算，而希腊数学讲究公理。

计算模糊了理性和经验的边界

2.6 现状时代交替 (2006)：定理证明小组被裁，深度学习论文横空出世

定理证明领域的名字演化

定理证明领域的名字也经历了有趣的演化。

最早都叫机器定理证明（Mechanical Theorem Proving），
后来改叫自动定理证明（Automatic Theorem Proving），
再后来叫自动演绎（Automated Deduction），目前都叫自动推理（Automated Reasoning）。

原因很简单，演绎（deduction）只是推理的一种，现在归纳（induction）、溯因（abduction）也都算成推理了。

贝叶斯推理，可以叫 Bayesian Logic，或 Bayesian Inference，也可以叫 Bayesian Reasoning。

2.7 结束语数学家不把逻辑学家当回事

王浩曾经抱怨数学家不把逻辑学家当回事。图灵也有类似的说法：逻辑学家给数学家提供了有营养的饭菜，但做的不够美味，数学家不爱吃。

逻辑似乎处于一切科学的底部，因为逻辑探索一切事物的本质

两个 Alpha-zero 下棋，我们人类已经看不懂了

3 从专家系统到知识图谱

The test of all knowledge is experiment. —— Feynman Lectures on Physics（《费曼物理学讲义》）

3.1 机器归纳法：用现在的话说就是机器学习 3.2 知识表示

知识表示一直是人工智能不温不火的一个领域，催生者是专家系统和自然语言理解。

逻辑是最方便的知识表示语言

心理学与语言学

知识表示的另一个来源是心理学和语言学，例如概念的上下位继承关系最方便的表示方式是树而不是一阶逻辑。

心理学家米勒和 Chomsky 等一起开拓了认知科学，他最出名的论文大概就是那篇“魔力数字七”（The Magic Number Seven）。

Minsky 的框架：面向对象

框架（Frame）就是类型。

金丝雀是鸟，所有鸟的性质自动流传给金丝雀，鸟能飞，金丝雀也能飞。
苹果手机是手机，手机能打电话，苹果手机也能打电话。

框架导致了面向对象（OO，Object-Oriented）的设计哲学，相关的程序设计语言都受此影响。

当一个概念有了成熟的实现时，就自动脱离了人工智能

从这个意义上还真验证了：当一个概念有了成熟的实现时，就自动脱离了人工智能。

3.3 知识库把人类的常识编码，建成知识库

想法：把人类的常识编码，建成知识库。这个新项目叫 Cyc，这其实就是最早的知识图谱。

雷纳特坚定地支持他老师费根鲍姆的知识原则（Knowledge Principle）：一个系统之所以能展示高级的智能理解和行为，主要是因为在所从事的领域所表现出来的特定知识：概念、事实、表示、方法、比喻以及启发。
雷纳特甚至说：“智能就是一千万条规则。”

Cyc 的原始目标更像是当今的维基百科，不过维基百科的受众是人，Cyc 的用户是机器。

学习只在已知事物的边缘发生

3.4 语义网（HTTP/HTML）

3.5 计算机科学的划分

计算机科学的划分

3.6 对知识做梳理是人类最早的智力活动之一

对人类的知识做梳理是人类最早的智力活动之一，也是人类的集体自我意识。

4 第五代计算机的教训

People learn from history that people never learn from history. – Georg Wilhelm Friedrich Hegel（黑格尔）

Those that fail to learn from history, are doomed to repeat it. Winston Churchill（丘吉尔）

日本早年神经网络研究的先驱福岛邦彦和甘利均一。

当下流程的卷积神经网络 CNN 的源头就是福岛邦彦的工作。

在福岛邦彦和甘利均一的壮年，日本都把资金投入到了五代机，他们没赶上好时候。

5 神经网络简史

I bet the human brain is a kludge. Marvin Minsky

自图灵提出“计算机与智能”起，就一直有两派观点：

一派认为实现人工智能必须用逻辑和符号系统，这一派看问题是自顶向下的；
还有一派认为通过仿造大脑可以达到人工智能，这一派是自底向上的，他们认为如果能造一台机器，模拟大脑中的神经网络，这台机器就有智能了。

5.1 神经网络的初创文章，1943

A Logical Calculus of the Ideas Immanent in Nervous Activity, 1943

神经网络的开山之作：A Logical Calculus of the Ideas Immanent in Nervous Activity，发表在 Bulletin of Mathematical Biology 上。

这篇文章成了控制论的思想源泉之一。
这篇文章只列了三篇貌似不相关的参考文献，卡尔纳普的《语言的逻辑句法》，希尔伯特和他学生阿克曼合著的《数理逻辑基础》，怀特海和罗素的《数学原理》。

5.2 维纳

维纳无论如何首先是一位严谨的数学家，而 McCulloch 则被人称为是浪漫的科学家。所谓“浪漫”不是指生活，而是说他对科学思想的表述方式。

维纳曾经把为大脑建模作为他学术生涯的最后野心。

强化学习之路：维纳 -> 阿比卜 -> Andy Barto -> Richard Sutton

阿比卜的“杂学”体现在他那本科普书《大脑、机器和数学》里，其实他本科毕业论文已初露端倪，题为“Turing Machines, Finite Automata, and Neural Nets”。

5.3 罗森布拉特和感知机

Perceptrons: An Introduction to Computational Geometry

影响巨大、“是也非也”的书：《感知机：计算几何学》（Perceptrons: An Introduction to Computational Geometry）。

在书中，Minsky 和佩珀特证明单层神经网络不能解决 XOR（异或）问题。
异或是一个基本逻辑问题，如果连这个问题都解决不了，那神经网络的计算能力实在有限。

5.4 神经网络的复兴解决 XOR 问题：神经网络多加一层+后向传播

1974 年，哈佛大学的一篇博士论文证明了在神经网络多加一层，并且利用“后向传播”（back-propagation）学习方法，可以解决 XOR 问题。

Paul Werbos 这篇文章刚发表时并没引起多少重视，那时正是神经网络研究的低谷，文章不合时宜。
Paul Werbos 也是递归神经网络 RNN 的原创者。但在深度学习大火后，他的兴趣转向了量子力学。

Hopfield 神经网络：来自物理学而非生物学的突破

神经网络在 20 世纪 80 年代的复兴归功于物理学家 John Hopfield。

1982 年，Hopfield 提出了一种新的神经网络，可以解决一大类模式识别问题，还可以给出一类组合优化问题的近似解。这种神经网络模型后来被称为 Hopfield 网络。
1984 年，Hopfield用模拟集成电路实现了自己提出的模型。

Hopfield 模型的提出振奋了神经网络领域。

神经网络的这次复兴和生物学没啥关系，它既不是来自生物学的刺激，也没有给生物学送去任何慰藉。
倒是它来源于物理学家，并引起了物理学家的关注，曾经一批对复杂系统感兴趣的物理学家在交叉学科杂志上接二连三地发表文章。

连接主义运动（Hinton）

两位心理学家鲁梅尔哈特（David Rumelhart）和麦克利兰德（James McLelland），
一位计算机科学家辛顿（Geoffrey Hinton）。

Rumelhart -> Michael Jordan -> Andrew Ng

连接主义运动也培养了一堆新人，并使得加州大学圣地亚哥分校的认知科学系成为同类系科的佼佼者。

Rumelhart 后转往斯坦福大学任教，乔丹（Michael Jordan）就是他的学生，而吴恩达（Andrew Ng）又是乔丹的学生。
Rumelhart 的另一名学生格 Robert Glushko 后来远离本行，跟随硅谷互联网早期英雄 Marty Tennenbaum 创立了一家公司，赚了一票钱。格鲁什科捐钱设立了“Rumelhart 奖”来奖励神经网络的研究者，辛顿成了第一位获奖者。

Chomsky：统计的方法不优雅，只是模仿而不是理解

Chomsky 认为统计的方法不“优雅”（elegant），只是模仿而不是理解。 会骑自行车不算理解，对自行车为什么不倒，能说清道理，才算理解。

Peter Norvig：在理解之前不妨碍模仿先上

5.5 深度学习

神经网络在 20 世纪 80 年代的光芒被后来的互联网掩盖了。

但这几年，恰恰是互联网产生的海量数据给了神经网络更大的机会。
人工智能学者在计算机系曾经是最抬不起头的，这几年却人人都变成了大知识分子。

网络对应的概念：一层网络就是一个函数

神经网络由一层一层的神经元构成。层数越多，就越深，所谓深度学习就是用很多层神经元构成的神经网络实现机器学习的功能。理论上说，

如果一层网络是一个函数的话，多层网络就是多个函数的嵌套。
网络越深，表达能力越强，但伴随而来的训练复杂性也急剧加大。

Hinton 2006：降维和逐层训练，使深度学习的实用化成为可能

辛顿是深度学习的先驱，他和学生在 2006 年发表的两篇文章开辟了这个新领域，

登在《科学》上的那篇提出了降维和逐层预训练的方法，使得深度学习的实用化成为可能。
深度神经网络最后几层的每个节点都可对应于某些概念。这是神经网络的一大进步，调和了与符号派的矛盾。至于符号派买不买账，就是另一回事了。

6 计算机下棋简史

Play is the beginning of knowledge.—— George Dorsey

6.1 图灵， ~1944

二战没结束时，图灵就研究计算机下棋，他 1947 年编了第一个下棋程序。
Donald Michie 是图灵的追随者，1950 年试着在纸上模拟程序，和图灵对弈。
Dietrich Prinz 接着图灵的思路，在 1951 年写了一个残局程序，能在离将死还有两步的情况下，找到最优解。这个问题也被称为“两步将死”（mate-in-two）问题。

6.2 冯诺依曼，《博弈论》提出 MiniMax 算法， 1944 《博弈论》, 1944

几乎和图灵同时，冯诺伊曼也在研究计算机下棋，他和经济学家摩根斯顿合作的《博弈论》1944 年出版，其中首先提出两人对弈的 Minimax 算法。

Minimax 算法中，二人对弈的一方为 max，另一方为 min，max 一方的评估函数要越高越好，min 一方的则越低越好。

max 和 min 的对弈就形成了博弈树。
树的增长是指数式的，当树很深时，树的规模会变得不可控。
麦卡锡首先提出α-β剪枝术以控制树的增长。

6.3 香农：开创计算机下棋的理论研究，1950 Programming a Computer for Playing Chess, 1950

香农把棋盘定义为二维数组，
每个棋子都有一个对应的子程序计算棋子所有可能的走法，
最后有个评估函数（evaluation function）。

传统的棋局都把下棋过程分为三个阶段：开局、中局和残局，不同阶段需要不同的技术手段。

香农的论文引用了冯诺伊曼的《博弈论》和维纳的《控制论》。

6.4 IBM 深蓝战胜卡斯帕罗夫， 1997

6.5 AlphaGo：首次引入了强化学习

强化学习 80 年代就发明了，但一直不被重视，是 AlphaGo 使得它焕发新生。

7 自然语言处理

the noblest pleasure is the joy of understanding - Leonardo da Vinci

It is not our aim to refine or complete the system of rules for the use of our words in unheard-of ways. - Wittgenstein

7.1 Chomsky 《句法结构》

Chomsky 之于语言学和认知科学，就像图灵之于计算机科学。他认为，

所有的语言（人工或自然）都有类似的句法结构，
语言的结构是内在的，而不是通过经验习得的，
代表作《句法结构》。一本小册子，不需要什么背景就能读。

Brown (1988，1990)是统计派的奠基作品，正文只有 6 页，虽是学术论文，却非常可读。

经验主义靠近科学，理性主义靠近数学

从某种意义上说，行为主义是极端的经验主义。

所有黑盒理论，无论是神经网络还是统计派，在 Chomsky 眼里都属行为主义。
Chomsky 认为理论应该先于事实。他常以遗传学祖师爷孟德尔为例，但孟德尔常常删改不支持理论的数据。

Chomsky 认为心身（mind-body）问题是个伪问题，难度倒不在于如何定义 mind，而在于连什么是 body 这样貌似简单的问题都无法明确地说清。

他认为 mind 的研究终究会变成像物理学、化学那样的学问，只不过现在还要用心理学的术语逐步获得进展。
语言学是突破口之一，由此可以找到 “mind” 的物理机制。
从这个意义上说，Chomsky 也不完全反对经验主义。

语言学的牛顿？

科学方法素有 explanation 和 redescription 之分。

统计方法可看作一种 redescription，但不是 explanation。
Chomsky 不认可语言学的统计方法。

活着的人里被引用次数最多的知识分子？

Chomsky 是活着的人里被引用次数最多的知识分子，即使从苏格拉底算起，他的引用数也可排进前十。

他的时事评论几十年来都被广为关注，这一点颇像他的偶像罗素。他的独特政治观点体现在他对当代政治事件的评论上。
人们轻率地把 Chomsky 划为左派，其实，他是反建制者，永远怀疑权威，永远同情人民。
Chomsky 作为犹太人，却不被以色列接受，因为他同情巴勒斯坦的立场。以色列甚至拒绝给 Chomsky 发签证。
Chomsky 在任何地方的学术演讲，最后总要“饶”一段儿同等时间的政治评论，就像演出的返场。

Chomsky 敬仰的人不多，无政府主义者乔治·奥威尔是一个，罗素是另一个。很多人拿 Chomsky 和罗素做比较，

罗素在出版了《数学原理》后很少再有原创的知识贡献，兴趣转向政治；
Chomsky 在《句法结构》之后也成为一位社会活动家和公共知识分子。

但 Chomsky 仍然不断有科学成果出来。罗素被下过两次大牢，Chomsky 1967 年因为反越战被捕，和诺曼·梅勒关在一起。

7.2 统计派又来了我每开除一名语言学家，语音识别系统的性能就提高一点

其实最早提出机器翻译的 Warren Weaver 的思路就是统计。但 Chomsky 登场后，统计方法基本就没饭吃了。

Chomsky 的理由很简单，语言的可能性是无限的，统计不可能解决问题。 Chomsky 对统计方法的排斥，恰似波普尔对卡尔纳普归纳法的批判。
Chomsky 不喜欢统计派的一个理由是他们太像行为主义了：在翻译的统计方法中，平行语料的左边就是刺激，右边就是反射。

工程师根本不需要语言学知识，也不需要懂源语言或目标语言

7.3 神经翻译是终极手段吗？ Google Neural Machine Translation (GNMT), RNN-based, 2016

2016 年，谷歌发布神经机器翻译（GNMT）系统，再次大幅提高机器翻译的水平。

和谷歌更早期的 Phrase-Based Machine Translation (PBMT) 不同，神经翻译的基本单位是句子，
谷歌使用了循环神经网络 RNN 做 Sequence to Sequence 的学习，
硬件设备是谷歌自己的 TensorFlow 平台。

神经翻译相比谷歌早期的基于短语的翻译系统，误差降低了 60%，这是翻译质量巨大的提升。这项工作已经开源。

Facebook, speed 10x, CNN-based, 2017

他们的结果在准确度上不输谷歌，
而在计算速度上则比谷歌的 RNN 有一个数量级的提升。

RNN 和 CNN 两种神经网络架构，分别被谷歌和 Facebook 支持。性能的此消彼长也被视为两家公司的竞争。真难预料神经网络还有多大的潜力可以挖掘。

翻译只是数据问题，不是语义问题？

Chomsky 们也许会接着质疑，这种翻译算理解吗？

也许翻译根本就不是理解的问题，翻译本身并不需要解释，翻译只是翻译而已，翻译只是数据问题，而不是语义问题。

没有 Chomsky，我们还要在黑暗中摸索，但有了 Chomsky，是不是又曾经束缚了我们探索其他方法的可能性。

7.4 IBM wason：知识库/知识图谱+浅层推理

现在的问答系统依靠常识和知识，同时也依靠浅层的推理。知识图谱是核心。

在 Jeopardy！节目中出现过的问题，95% 都能在维基百科中找到答案。

沃森参赛的版本的知识库只有 4TB，其中包含了所有维基百科的正文，真的不大。
除了半结构化的知识图谱，沃森还使用了开源搜索引擎。

把搜索的结果文档的标题与维基百科词条进行匹配，如果在维基百科中能找到，就把搜索结果列入候选答案。再把候选答案反馈给搜索引擎，进一步对返回结果做证据支持的处理，然后给出答案。
硬件系统是一个有 90 台 IBM Power 750 的集群，每台配一个 IBM Power 78 核处理器，每核 4 线程，所有一共 720 核，2880 线程；内存 16TB，所有的知识图谱都放在内存里了。

按照 Linpack 基准程序，这台计算机的算力相当于当年排名第 500 的超级计算机的一半，成本只有 300 万美元。同沃森带来的巨大广告效应相比，这真不算什么。

7.5 总结一个人工智能问题一旦解决，就不再是人工智能问题

就像一个哲学问题找到了科学的角度（formulation），就不再是哲学问题一样，一个人工智能问题一旦解决，就不再是人工智能问题。

大概很快人们就会认为语音问题不再是人工智能的核心问题。
如果说语音翻译不涉及自然语言理解和语义，可能也不会有什么异议。

2011 年 5 月，麻省理工学院为配合 150 周年校庆，召开了“大脑，心，机器”的研讨会（Brain, Mind and Machine Symposium）。

Chomsky 批评当下流行的神经网络和统计方法，Chomsky 认为神经网络是黑盒子，并没有给我们提供解释，故而没有提供知识。
时任谷歌研发总监的诺维格（Peter Norvig）很快回应 Chomsky，他批评语言学的规则在自然语言处理上，根本就没用。

可解释性

有人开始用“两种文化”来总结 Chomsky 和诺维格的隔空掐架。

Chomsky 对人工智能的批评的核心在于“可解释性”。AlphaGo 不能解释自己下棋的路数，算不算会下棋呢？
也可以反过来说，只有解释了，人类才能从中得到洞见，学习知识。但解释是不是也有层次，只有学会牛顿力学，才能学会相对论和量子力学？就如维特根斯坦所说的梯子的比喻，爬上房顶，梯子才能扔掉，梯子就是解释。其实，即使人类在不理解力学的时候，就会造弹弓了。对那时的人类，弹弓的工作原理就是黑匣子。

不求甚解的工程师 vs. 追求终极知识的科学家

Chomsky 和诺维格分别所代表的两种人关心的是两种不同的问题。

一种人力图打造实用的工具，没有解释也能凑合，他们是不求甚解的工程师；
另一种人寻求终极的知识，他们是科学家。

只不过，在计算机科学这个特定的学科中，科学家和工程师的角色变换太快，这门学科的开拓者，很多都是身兼二职，例如图灵和冯诺伊曼

8 向自然学习：从遗传算法到强化学习

Natural selection is a mechanism for generating an exceedingly high degree of improbability. —— Ronald Fisher

自然选择就是能生成极不可能之事的机制。

8.1 从生物学里找计算的模型：两条传承脉络

从生物学里找计算的模型，一直是人工智能的研究方向之一，学术上大致有两条传承的脉络：

McCulloch 和 Pitts 的神经网络，演化到今天成了深度学习；
冯诺伊曼的细胞自动机，历经遗传算法、遗传编程，其中一条支线最后演变成了今天的强化学习。

8.2 John Holland 和遗传算法

Holland 在晚年接受采访时如此评论麦卡锡和 Minsky：

美国西部的人工智能由麦卡锡代表，他们干净（neat），一切讲究逻辑；
东部的领袖自然是 Minsky，他们邋遢（scruffy），做事比较随意（adhoc）。

但他们的共性是都对机器学习不太感兴趣。

Ronald Fisher, 英国统计学家费舍

Holland 说他自己的思想被学界逐渐接受，是在他的学生都出了名之后。

对 Holland 影响最大的一本书是英国统计学家费舍（Ronald Fisher）的《自然选择的遗传理论》（The Genetical Theory of Natural Selection）。
无神论者道金斯（Richard Dawkins）称费舍是达尔文之后最伟大的生物学家。

进化和遗传是族群学习的过程，机器学习可以此为模型

费舍把孟德尔的遗传理论和达尔文的自然选择结合起来。 Holland 由此得到启发：进化和遗传是族群学习的过程，机器学习可以此为模型。

遗传算法

遗传算法就是模拟种群（population）的进化过程。其结构可以用下列伪代码大致表示。

随机生成初始群体。
主循环（停机的标准可以是迭代次数，或者适应度达到某个要求）。
- 2.1 执行策略，计算当前群体中所有个体的适应度；
- 2.2 从当前群体中，选择精英作为下一代的父母；
- 2.3 将选出的精英父母配对；
- 2.4 以极小概率将子代变异；
- 2.5 将子代个体添加到新群体中。

从程序中，我们马上可以理解进化中“优胜劣汰”的算法含义。

8.3 遗传编程

遗传编程的结构和遗传算法差不多，

一组程序就一个特定的问题给出解答，按照执行结果的好坏给所有程序排序。
程序本身也是数据，自然也可以修改。
在遗传编程里，变异就是对程序做微小调整。
交叉和配对就是将两个表现优异的程序互相嫁接。

寇扎后来还引入了“基因重复”（duplication）和“基因删除”（deletion）等生物学概念，以提升遗传编程的效率。

遗传算法本身就需要大量的数据，遗传编程需要的数据量自然更大，这对计算能力提出了新的需求。

遗传算法的稳定性一直就是研究课题，遗传编程的数学性质自然更加复杂。

8.4 强化学习

“人工智能”这个词儿的流行是在 20 世纪 70 年代中期，按照阿比卜的一家之言：人工智能是控制论的替代品，至少从时间轴上看，这不算错。

一个刚出生的孩子，怎么学会对环境的适应

巴托和萨顿关心更原始但也更抽象的可适应性。一个刚出生的孩子，怎么学会对环境的适应。

在监督式学习中，目标是清楚的。
但婴儿不知道目标是什么，不知道自己要什么。通过与外部世界的不断交互，婴儿受到奖励或惩罚，由此强化对外部世界的认知。

数学基础：马尔科夫决策过程和动态规划

强化学习的理论基础之一是马尔科夫决策过程。

强化学习的主体是 Agent，Agent 和环境互动。
强化学习就是 Agent 根据经验改变策略以期达到长期最大奖赏的过程。

强化学习的另一个理论基础是动态规划。

贝尔曼（Bellman）在 20 世纪 50 年代就发明了动态规划。
萨顿和巴托也承认在强化学习早期，受到动态规划的启发。巴托一度在他的强化学习讨论班上让研究生分工研读贝尔曼的经典著作《动态规划》（Bellman 1957）

在计算能力的约束下，强化学习的环境不宜太复杂

萌芽期的强化学习的例子都是游戏，如贝尔曼的“老虎机 ”和塞缪尔（Samuel）的跳棋。
游戏的环境相对容易定义，在棋类比赛中，环境就是对手和规则。
强化学习被用来下围棋不是偶然的。

如果整个世界是完全随机的，那么强化学习就要失效，学还是不学对结果没有什么影响。

巴托和萨顿有时也把强化学习称为“享乐主义”（hedonistic），也即学习系统想最大化环境对自己的某种反馈。

exploration vs. exploitation

learning rate

在强化学习中，用希腊字母 ε 表示学习率（learning rate）， 值越小，能用于探索的时间就越少，绝大部分时间是在苦干。

减少状态空间搜索

蒙特卡洛模拟是一种减少状态空间搜索的有效办法。
最近也有利用深度学习来压缩需要表示的状态空间数目。这还有点意思，本来强化学习初衷是探索生物体学习的模型，现在神经网络又成了强化学习的工具。

当状态空间很大时，强化学习可以和蒙特卡洛方法或深度神经网络结合，就使用了蒙特卡洛方法

AlphaGo 让强化学习一夜之间成为显学

萨顿：开创强化学习，留有一点控制论的影子

萨顿 1979 年到麻省大学跟随巴托和阿比卜，由此开创强化学习。

他一直认为强化学习是理解智能的关键。
在整个人工智能的各个分支里，大概只有强化学习还留有点儿控制论的影子。

强化学习 vs. 监督式学习：第一人称叙事 vs. 第三人称叙事

如果从写作的角度看，

强化学习更像是第一人称叙述，Agent 就是“我”，外部世界（包括他人）都是“环境” 。
监督式学习更像是第三人称叙述，作者在用一只上帝的眼睛洞察世界，对错分明。

第一人称的学习要比第三人称的学习更本质。

8.5 计算向自然学习 vs. 自然向计算学习

喜欢的人认为这为进化论找到了新视角，而不喜欢的人则批评杂志的编者和作者是为了博眼球。
这篇文章质疑了性在进化中的作用。
哈佛大学的理论计算机科学家、图灵奖获得者 Leslie Valiant 曾经从计算的角度研究过机器学习和进化，他把进化当作学习的特例。Livnat 和 Papadimitriou 认为有性繁殖不太容易达到最优点，而无性繁殖才更像是优化算法，他们把遗传算法比作有性繁殖，模拟退火算法比作无性繁殖。

如果说遗传算法是微观地向生物内部机制学习的话，强化学习则是更为宏观地向自然学习。

8.6 生物学激发的学科都缺乏计算理论的基础

无论是遗传算法、深度学习还是强化学习，都缺乏计算理论的基础。

生物学激发的学科都是模拟自然，它们都不需要解释，不需要了解内部原理，而只要能查看输出结果就够了。
数学大概是所有学科中离生物学最远的学科。

8.7 参考资料整体大于局部之和：涌现（emergence）现象

Holland (1975)是遗传算法的原创著作。

Sutton and Barto (1998) 强化学习的原创著作

Sutton and Barto (1998) 是强化学习的原创著作，在网上可免费获取。

强化学习的教科书里最爱用的 Q-learning，是 Chris Watkins 1989 年在他的剑桥博士论文里提出的。

科普文章：“谁能说出更大的数”

9 哲学家和人工智能

The real discovery is the one that makes me capable of stopping doing philosophy when I want to, the one that gives philosophy peace. ——Wittgenstein（维特根斯坦）

9.1 两类哲学家：深刻的和混饭的

哲学家不一定懂哲学，就像相声演员不一定会说相声，这是低门槛行业的通病。

《计算机不能干什么》，1965 是对《炼金术与人工智能》的扩充，对人工智能的全面批评。

哲学家有两类，一类是深刻的，一类是混饭的。

罗素和弗里格是深刻的，没有他们，就不会有数理逻辑，也就不会有哥德尔、丘奇、图灵，以及后来的计算机科学。
但没有现代的欧陆哲学，世界不过省了些粮食而已。

按照德雷弗斯们的说法，哲学系是不是应该要求读现象学的博士必须熟练掌握一门面向对象的程序设计语言？

德雷弗斯曾经引用梅洛庞提批判人工智能：人脑是和环境直接交流的，而不是通过表示（representation）。

9.2 塞尔和中文屋

1980 年塞尔在《行为与脑科学》杂志上发表了 Minds, Brains and Programs 一文。文中的一个思想实验“中文屋” 马上成为最喜欢被引用的假想实验之一。

“中文屋”思想实验

“中文屋”思想实验是这样的：

假设有个只懂英文不懂中文的人（“我”）被锁在一个房间里，屋里只给“我”留了一本手册或一个计算机程序， 这个手册或程序教“我”在收到中文信息时如何用中文应对。
屋外的人用中文问问题，屋里的“我”依靠程序用中文回答问题，沟通方式是递纸条。

塞尔的问题是：如果屋外的人不能区分屋里的人是不是母语为中文，那么屋里的“我”是不是就算懂中文？

塞尔自己认为“我” 不懂中文。很明显，这个场景源自图灵测试，只不过图灵测试的环境是英文，而中文屋里既有中文又有英文。

解读

塞尔的文章出来后，引起轰动。其实轰动的原因很简单：谈论这种玩意儿没什么门槛，谁都可以说三道四：哲学家、科学家，以及各种媒体人。

塞尔毕竟是老练的哲学家，已经预测大家会质疑他的论断，他在文尾也设想了各种回答。

第一个问题是，我们只是算屋里人理解中文呢，还是屋子加人作为一个系统理解中文。塞尔的论断是屋里人即使查遍手册，顶多算是理解语法，而不算理解语义。
我们可以问塞尔这样的问题：一个坐飞机的人算能飞吗？如果对这些问题的答案都是“算” ，那中文屋作为一个系统为什么不算理解中文呢？

塞尔认为必须内化（换句话说：手册必须变成人身的一部分）才能算懂中文，那么内化到什么程度才能算呢？

爱因斯坦说“我的笔加上我要比我自己聪明”，笔算不算外化？
内化是完全的物理隐藏，还是只是个反应时间问题？在一开始查手册时，反应时间必定很慢，但熟能生巧之后，查手册变成下意识的动作，那算内化吗？
内化和辅助工具的大小也有关系。如果语音识别工具是桌面电脑，我们可能不会认为对话中的两个人理解了对方的语言。但如果这个工具可以微型化，直接内化到耳朵里，那算不算理解？

反“强人工智能”

塞尔认为他不是反人工智能，他只是反“强人工智能”。

假设游戏不是中文翻译，而是下棋，那 “我” 算不算会下棋？断言中文屋是不是有智能，就像断言 AlphaGo 会不会下围棋一样，要看应用场景。

9.3 普特南和缸中脑思想实验：缸中脑

1981 年普特南出版了《理性、真理与历史》（Reason, Truth, and History）一书，该书的开篇就给出了“缸中脑”的假想实验。

Wikepedia Brain in a vat:

普特南更进一步设想，假设所有的感觉器官都泡在缸里，而外面的世界就是一台大自动机。

缸中脑知道如何与外部世界做对应吗？泡在缸中的人脑，如何知道自己是颅中脑，还是缸中脑？

人工智能的基本问题是可否造一台机器能有智能， “缸中脑”中的机器则起了另一种作用：人脑是否能确定外在的世界是直接实在还是间接实在。

《黑客帝国》、《盗梦空间》

科幻电影《黑客帝国》（Matrix）、《盗梦空间》（Inception）等都受“缸中脑”思想实验的启发。

9.4 给哲学家一点忠告哲学指导科学？

哲学空洞化

整个人工智能就是个大的假想实验

10 人是机器吗？——人工智能的计算理论基础

humans are nothing but meat machines that carry a computer in their head. —— Marvin Minsky

10.1 人是不是机器？

认为人是机器的，道理很简单：人也是由各种物理化学机制构成的，当然是机器了。早有法国哲学家美特里，现有 DNA 双螺旋结构发现者克里克，都持这种观点。克里克认为在不远的将来，生命可以在试管中合成。
认为人不是机器的，论据是人有很多功能，目前机器无法完成，尤其是那个叫“灵魂” 的神奇东西。

《论可计算的数》和图灵机的定义

一条无穷长的纸带，
一个读写头在一个控制装置的控制下在纸带上方左移右移，读取纸带上的内容并在纸带上写 0 或 1。

图灵的初衷是让他的机器模仿人类计算者。

同源问题和相关问题

如果人是机器，那是模拟机器还是数字机器？

按照冯诺伊曼的说法，神经系统的本质是数字的，尽管构成神经系统的化学和生物过程的描述可能是模拟的。
现代物理学的一个假设是整个宇宙都是离散的，也即数字的。
人工智能符号派的基础之一是所谓“物理符号假设”，这个假设要求计算装置必须是数字的，或者说变量必须是离散的。
费曼就曾说世界是数字的。

如果机器是数字的，那么图灵机就是简单又有力的模型。 对于离散的量，二进制就足够了。

10.2 Church-Turing Thesis：为什么图灵机是最重要的发明？

在人类发明的所有计算装置中，图灵机是直觉上最简单最可靠的。

通用图灵机和冯诺依曼架构

图灵在发明图灵机时，还定义了 Universal Turing Machine，简称 UTM，译为“广义图灵机/万能图灵机/通用图灵机”。

UTM 的核心思想就是一个图灵机的执行过程也可被编码成数据，放到纸带上，因此一个图灵机可以通过执行纸带上的程序来模仿另一个图灵机的行为。这台能模仿其他图灵机的图灵机就成了通用图灵机。
这是一个很深刻的思想，现在的软件产业都得益于此：被编码的图灵机就是软件。
后来冯诺伊曼设计的计算机被人称为冯诺伊曼架构，其最核心的思想就是存储程序（Stored Program）。这个思想其实就是来自万能图灵机：被编码的图灵机就是存储的程序。

纯逻辑或数学的东西联系到物理世界：函数 -> 纸带和读写头

冯诺伊曼把计算机的所有原创思想的功劳都给了图灵，并批评那些对图灵机实际意义缺乏认识的人。

有了图灵机，我们就很容易把原来是纯逻辑或纯数学的东西（例如递归函数和λ演算等） 和物理世界联系起来了，函数成了纸带和读写头。

10.3 不可能存在比图灵机更强的计算装置

Church-Turing Thesis 的一个自然结果就是，不可能存在比图灵机更强的计算装置。

20 世纪 80 年代初就有人证明三层以上的神经网络可以逼近任意连续函数。
80 年代末期，Steve Judd 证明三层以上的神经网络学习问题在图灵机上是 NP 完全的。
本书作者证明了在 BSS 模型上，类似的神经网络学习问题等价于线性规划问题。

10.4 BBS 实数模型

其实即使在数值分析之外，我们经常做类似的假设，例如，在排序算法分析中，任意精度的数（可能是实数）之间的比较是单位时间的。

按照费曼的说法，宇宙是数字的，换句话说，宇宙不是连续的实数，空间是一种网络，而时间也不是连续的。

10.5 量子计算

《费曼计算机科学讲义》

IBM 是计算物理学的源头。计算的物理学研究有实际需求。

图灵机的物理约束

从计算的角度看，图灵机只有数学约束而没有物理约束。

从真实世界看，一个可能的物理约束是能量：图灵机的读写头和纸带的运动是需要能量的。

逻辑运算与能量的关系

现代计算机的组件是逻辑门，有两种门，

可逆的，如“非门”；
不可逆的，如“与门”。

IBM 的物理学家朗道尔（Rolf Landauer）在 1961 年提出了朗道尔原理：任何不可逆计算都需要能量。

量子计算机：（在对的时刻）测量而非（一步步）计算

费曼考虑的问题是如何以任意精度来模拟一个物理系统。他的方法是构造一台量子计算机，它求解问题的时间不随问题的规模呈指数增长。

量子计算并不是一步一步的经典计算，而只是测量系统的输出结果。

费曼认为测量本身也是一种计算。

当计算量很大时，最简单的方式是让自然界自己该干啥干啥，而在对的时刻测测结果就可以了。

举例：子弹的弹道，生成随机数

举一个不精当的比喻，想知道子弹的弹道，两种方式，

考虑所有可能外部内部因素，依靠计算；
让子弹飞，然后测量。

随机数可以通过伪随机函数生成，也可以通过测量一些噪声源得到。图灵 1949 年就研究过通过外部电子噪声源得到随机数的方法。

在图灵机上很难求解的问题有可能在量子计算机上用多项式时间解决。其中最热门的问题是素数分解。

10.6 计算理论的哲学寓意神经网络研究者数学和计算理论功底的缺乏

从当下人工智能的浮夸风气中，没看出吸取了什么教训。

Donald Knuth：量子力学为自由意志提供了空间，也使得上帝可以操纵世界而不违反物理定律

Donald Knuth（计算机科学家中位数不多的有神论者）说量子力学为自由意志提供了空间，也使得上帝可以操纵世界而不违反物理定律。

我很少看到计算机科学家敢对物理学家说三道四，姚期智大概是唯一的例外。

11 智能的进化

Science is what we understand well enough to explain to a computer. Art is everything else we do. —— Donald Knuth

11.1 Human Advantage: How Our Brains Became Remarkable

畅销书，并被翻译为多种语言。2017 年该书中文版以《最强大脑》为题出版。
创造的“大脑汤”（brain soup）的方法最终使她成功地测定不同动物大脑的神经元数量。
书中不仅有研究成果，还有更有意思的研究过程，包括她是如何把大象的大脑从非洲弄到美洲的新奇故事。

脑结构和神经元数量

不同动物的脑构造有所不同，脑中的神经元数量也完全不同，

人脑中总共有 860 亿个神经元（用 LLM 术语来说就是 86B），其中大脑皮层有 160 亿个神经元（16B）。 大脑皮层的神经元数量决定了动物的智力水平，人的大脑皮层中神经元数量远高于其他物种，所以人类比其他物种更聪明。
大象的脑子总共有 2570 亿个神经元，但是其中 98% 的神经元都存在于小脑中。大脑皮层只有 56 亿个神经元，无法与人类相比。

神经元数量越多，能耗也越大

大脑皮层中的神经元数量越多，能耗也越大。

人脑每天消耗的能量占人体全部耗能的 25%。人之所以能够很快超越其他物种，主要是因为人类掌握了烹饪技术。能够在短时间内摄入大量卡路里以支持大脑运转。
其他物种则将摄入的卡路里用于维持身体运转，不得不牺牲大脑皮层的神经元数量。

用不同的时间粒度看待过去，会得到不同的结论

《尤利西斯》中的几个小时，茨威格作品中人物的一生，或赫拉利的七万年，关心不同的过程。
粒度也可以是主体的，一个基因，一个人，一个群体，不一定非得是一个小的物质颗粒只配得上小的时间单位。
想想基因人类学，基因在几万年的空间分布，帮我们了解人类的起源和迁移。
当用太大的颗粒度研究历史时，历史学家的用处会令人质疑。

11.2 机器：从代替人的体力到代替人的智力

过去的机器旨在节省人的体力，现在的机器开始代替人的智力。

人作为物种，不再具备进化的竞争优势？

人通过两性繁殖的进化速度远远赶不上机器。

机器的进化速度服从摩尔定律——每 18 个月性能提升一倍，而人的进化速度则是 20 年一代人。
人作为物种，是不是不再具备进化的竞争优势？
依靠硬件的摩尔定律，是不是可以达到超级智能？

新的智能形态：Agent？

新的智能存在可以是人工智能的 agent，也可以是生物学意义上的物种。

11.3 基因修复的伦理问题

通过修复一个受精卵的一小段染色体，就可以避免或治疗某种疾病。这是一个真实的伦理问题，因为已经有这样的病例发生。

如果孩子出生，那么他/她的父母是谁？
多小算是“一小段”，1% 还是 49%？
更进一步：可不可以有更多不同来源的基因参与？
英国《经济学人》2017 年 2 月的一期封面标题就是“Sex and Science”

11.4 机器人三定律之一：机器不能伤害人

维纳曾经说：“我们最好能够确认，我们给机器设定的目标确实是我们想要的。”

12 当我们谈论生死时，我们在谈论什么？

I don’t want to achieve immortality through my work; I want to achieve immortality through not dying. —— Woody Allen（伍迪·艾伦）

12.1 苏格拉底之死和《斐多篇》

苏格拉底说：哲学家只研究“正在死”（dying）和“刚刚死”（being dead）。除了这个啥都不管。

耶稣之死和苏格拉底之死不同，耶稣完成了使命，苏格拉底留下了一堆问题。

12.2 作者和苏格拉底之间的假想对话

挺有意思的一段哲学对话，关于“永生”，这里就不放了，感兴趣可以网上搜搜，或者读完这份笔记觉得这本书不错，买本电子/纸质书支持下作者。

科学史对科学也有还原论（reductionism）和涌现论（emergentism）之分，规则派接近还原论，统计派可以算作涌现论。

附录附录 1：图灵小传

曼彻斯特的公园里，图灵雕像的底座，引用了罗素的话：“数学不仅有真理，也有最高的美，那是一种冷艳和简朴的美，就像雕塑。”

伯特兰·罗素，《西方哲学史》

附录 2：人工智能前史：图灵与人工智能

图灵 1950 年在英国哲学杂志 Mind 上发表文章“计算机与智能”，文中提出“模仿游戏”，被后人称为“图灵测试”。

这篇文章被广泛认为是机器智能最早的系统化科学化论述。
但图灵在 1941 年战时就开始思考机器与智能的问题，1947 年图灵在伦敦皇家天文学会就机器智能发表演讲。1948 年图灵把这次演讲整理成文章，题为“智能机器”（“Intelligent Machinery”），作为英国国家物理实验室（NPL）的内部报告，但没有公开发表。
这篇文章迟至 1969 年才在年刊型论文集《机器智能》上发表。但由于和 1950 年文章的题目类似，并没有引起人们的重视。

1948 年的文章对智能的概念采取了更宽泛的说法，图灵探讨了大脑皮层，

他认为婴儿的大脑皮层是非组织的（unorganised）。
在图灵的用语里，“非组织”就是“通用”的意思，发育的过程就是组织化的过程。
他指出人身上的任何小部件都可以用机器来模仿，他还提到基因、进化和选择。

图灵进一步预测到 2000 年，机器内存会达到 1GB（预测这么准还真挺神）。

这篇文章为后来的一系列后学者模仿的文章提供了范文的效果，例如塞尔的“中文屋”和普特南的“缸中脑”。

附录 3：冯诺依曼与人工智能

Talent hits a target no one else can hit; Genius hits a target no one else can see. —— Schopenhauer（叔本华）

附录 4：计算机与智能，turing paper

建议参考翻译，阅读图灵的原 paper。

后记

本书的写法比较偏重基础和方法论，而不太注重应用。

费曼在加州理工学院教书时，学期的最后一节课都是请学生问问题，只要不涉及政治、宗教和期末考试，什么问题都可以问。

本书也参考这一方式，回答读者几个问题：

问：这次的人工智能是泡沫吗？
答：人工智能和人们关心的某些终极问题有关，这些问题过去是哲学家和科幻作家的地盘， 计算机科学为人们提供了用科学和工程的手段回答这些问题的方法，旁人自然会对这些方法存在过高的期望，过高的期望自然也会带来过高的投资。泡沫的破裂就是投资的失败。比人工智能更年轻的互联网，起伏的周期更短。从投资的角度看，某些特定的人工智能应用领域确实存在过热现象。
问：算法、数据和算力，哪一项对这次人工智能的复兴贡献最大？
答：我正在对这个问题做一项定量的研究，但目前还没有确定性的结果。要我猜的话，贡献排序应该是：算力、数据和算法。没有足够的算力，就没有办法处理海量数据，很多算法的精化是以某些特定的硬件为前提的。 算力的提升恰好到了一个临界点，使得各种学习算法成为可能。

[译] 从 OpenDeepResearch 背后的设计演进，解读 AI 领域反复学到的一课（2025）

ARTHURCHIAO'S BLOG

3 months 3 weeks ago

本文翻译自 2025 年的一篇文章 Learning the Bitter Lesson。来自 github.com/langchain-ai/open_deep_research 作者。

过去 70 年 AI research 领域学到的最大经验是：以计算作为支撑的通用方法 （general methods that leverage computation）是终极方案（ultimately the most effective），而且大幅领先其他方式。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 反复学到的一课
- 1.1 AI Research 领域
- 1.2 AI 工程领域
2 以 Open Deep Research 为例
3 总结
4 致谢

Rich Sutton，The Bitter Lesson

1 反复学到的一课 1.1 AI Research 领域

The Bitter Lesson 在许多 AI 研究领域一次次地被证实，比如国际象棋、围棋、语音、视觉。

用好计算（leveraging computation）被证明是最重要的事情，而我们强加给模型的"结构"反而往往会限制它们用好不断增长的计算能力。

这里所说的”结构”是什么意思？

Often structure includes inductive biases about how we expect models to solve problems.

计算机视觉是一个很好的例子。几十年来，研究人员基于领域知识设计了一些特征（例如 SIFT 和 HOG）。但这些人为设计的特征将模型限制在了我们预期的一些模式中。
随着计算和数据的扩展，直接从像素中学习特征的深度网络优于人为设计的方法。

关于这一点，可以看一下 Hyung Won Chung（OpenAI）关于他的研究方法的演讲：

Add structures needed for the given level of compute and data available.
Remove them later, because these shortcuts will bottleneck further improvement.

1.2 AI 工程领域

The Bitter Lesson 也适用于 AI Engineering，如何快速演进的模型之上构建应用。

举个例子，Boris（Claude Code 的负责人）提到 The Bitter Lesson 强烈影响了他的方法。

Hyung 的演讲为 AI 工程提供了一些有用的教训。接下来我通过构建 open-deep-research 的故事来说明这一点。

2 以 Open Deep Research 为例 2.1 添加结构（假设）

2023 年我开发 Agent 非常沮丧：让 LLM 可靠地调用工具很难，而且上下文窗口很小；
2024 年初，转向 Workflow：Workflow 将 LLM 调用嵌入预定义的代码路径中，避免了以上问题；
2024 年末，我发布了一个用于网络研究的 orchestrator-worker Workflow。
- orchestrator 是一个 LLM 调用，它接收用户请求并返回要撰写的 report sections 列表。
- 一组 worker 并行研究并撰写所有 report sections 。
- 最后，将它们简单组合在一起。

那么，这里的”结构”是什么？我对 LLM 应如何快速、可靠地进行研究做出了一些假设，如下图所示：

Planning：将请求拆解为多个报告章节（report sections），
并行研究和分章节独立撰写报告以提升速度，
避免工具调用以提升可靠性。

2.2 结构开始成为瓶颈

2024 年末，情况开始发生变化，工具调用能力快速提升；
2025 年末，MCP 发展迅速，很明显 Agent 开始非常适合研究任务。

但此时，我之前强加的结构阻止了我的框架用上这些改进，

禁止使用工具调用，所以无法用上不断蓬勃发展的 MCP 生态；
Workflow 总是将请求拆解为独立章节，这是一种僵化的研究策略，对很多情况都不适用；
最终报告有时也显得不连贯，因为我强制 worker 并行撰写章节。

2.3 移除结构

最终，我转向了 Multi-Agent 系统，这使我能够使用工具并让系统灵活地规划研究策略。

但是，我设计的新一版系统里，每个 sub-agent 仍然独立撰写自己的 report section。这也是到了 Cognition 的 Walden Yan 提出的问题： Multi-Agent 系统很难，因为 sub-agent 往往不能有效交流。报告仍然不连贯，因为我的 sub-agent 并行撰写章节。

这是 Hyung 演讲的主要观点之一：虽然我们在改进方法，但经常未能去掉之前添加的所有结构。在我这个例子中，我虽然转向了 Agent，但仍然强制每个 Agent 并行撰写部分报告。

最终，我将报告撰写移至最后一步，如下图所示，

系统现在可以灵活地规划研究策略，使用 Multi-Agent 上下文收集，并基于收集的上下文一次性撰写报告。
它在深度研究基准上得分 43.5（前 10 名），对于一个小型开源项目来说已经相当不错了（并且性能接近使用 RL 的和投入明显更多的 Agent）。

3 总结

AI 工程的一些经验总结：

理解你的应用结构（Understand your application structure）

考虑你的应用设计中嵌入了哪些 LLM 性能假设。例如对于我最初的 Workflow ，我避免工具调用是因为（当时）它不可靠，但几个月后情况变了！
随着模型能力的提升，重新评估这些结构（Re-evaluate structure as models improve）

我在重新评估假设方面有点慢了，业界的工具调用能力大幅提升，而我没有及时重新评估假设是否还合理。
让去掉结构这件事情比较容易（Make it easy to remove structure）

Agent 抽象可能带来风险，因为它们可能使去掉结构变得困难。我仍然使用框架（LangGraph），但使用的是其通用功能（例如 checkpointing），而且尽量只使用使用其底层构建模块（例如 node 和 edge），这样我可以轻松地（重新）配置。

构建 AI 应用的设计哲学仍处于初级阶段。但有一点是可预测的：模型会变得越来越强大。理解这一点可能是 AI 应用设计的最重要事情。

4 致谢

Thanks to Vadym Barda for initial evals, MCP support, and helpful discussion. Thanks to Nick Huang for work on the multi-agent implementation as well as Deep Research Bench evals.

[译] 从 OpenDeepResearch 背后的设计演进，解读 AI 领域反复学到的一课（2025）

ARTHURCHIAO'S BLOG

3 months 3 weeks ago

本文翻译自 2025 年的一篇文章 Learning the Bitter Lesson。来自 github.com/langchain-ai/open_deep_research 作者。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 反复学到的一课
- 1.1 AI Research 领域
- 1.2 AI 工程领域
2 以 Open Deep Research 为例
3 总结
4 致谢

Rich Sutton，The Bitter Lesson

1 反复学到的一课 1.1 AI Research 领域

The Bitter Lesson 在许多 AI 研究领域一次次地被证实，比如国际象棋、围棋、语音、视觉。

用好计算（leveraging computation）被证明是最重要的事情，而我们强加给模型的"结构"反而往往会限制它们用好不断增长的计算能力。

这里所说的”结构”是什么意思？

Often structure includes inductive biases about how we expect models to solve problems.

计算机视觉是一个很好的例子。几十年来，研究人员基于领域知识设计了一些特征（例如 SIFT 和 HOG）。但这些人为设计的特征将模型限制在了我们预期的一些模式中。
随着计算和数据的扩展，直接从像素中学习特征的深度网络优于人为设计的方法。

关于这一点，可以看一下 Hyung Won Chung（OpenAI）关于他的研究方法的演讲：

Add structures needed for the given level of compute and data available.
Remove them later, because these shortcuts will bottleneck further improvement.

1.2 AI 工程领域

The Bitter Lesson 也适用于 AI Engineering，如何快速演进的模型之上构建应用。

举个例子，Boris（Claude Code 的负责人）提到 The Bitter Lesson 强烈影响了他的方法。

Hyung 的演讲为 AI 工程提供了一些有用的教训。接下来我通过构建 open-deep-research 的故事来说明这一点。

2 以 Open Deep Research 为例 2.1 添加结构（假设）

2023 年我开发 Agent 非常沮丧：让 LLM 可靠地调用工具很难，而且上下文窗口很小；
2024 年初，转向 Workflow：Workflow 将 LLM 调用嵌入预定义的代码路径中，避免了以上问题；
2024 年末，我发布了一个用于网络研究的 orchestrator-worker Workflow。
- orchestrator 是一个 LLM 调用，它接收用户请求并返回要撰写的 report sections 列表。
- 一组 worker 并行研究并撰写所有 report sections 。
- 最后，将它们简单组合在一起。

那么，这里的”结构”是什么？我对 LLM 应如何快速、可靠地进行研究做出了一些假设，如下图所示：

Planning：将请求拆解为多个报告章节（report sections），
并行研究和分章节独立撰写报告以提升速度，
避免工具调用以提升可靠性。

2.2 结构开始成为瓶颈

2024 年末，情况开始发生变化，工具调用能力快速提升；
2025 年末，MCP 发展迅速，很明显 Agent 开始非常适合研究任务。

但此时，我之前强加的结构阻止了我的框架用上这些改进，

禁止使用工具调用，所以无法用上不断蓬勃发展的 MCP 生态；
Workflow 总是将请求拆解为独立章节，这是一种僵化的研究策略，对很多情况都不适用；
最终报告有时也显得不连贯，因为我强制 worker 并行撰写章节。

2.3 移除结构

最终，我转向了 Multi-Agent 系统，这使我能够使用工具并让系统灵活地规划研究策略。

最终，我将报告撰写移至最后一步，如下图所示，

系统现在可以灵活地规划研究策略，使用 Multi-Agent 上下文收集，并基于收集的上下文一次性撰写报告。
它在深度研究基准上得分 43.5（前 10 名），对于一个小型开源项目来说已经相当不错了（并且性能接近使用 RL 的和投入明显更多的 Agent）。

3 总结

AI 工程的一些经验总结：

理解你的应用结构（Understand your application structure）

考虑你的应用设计中嵌入了哪些 LLM 性能假设。例如对于我最初的 Workflow ，我避免工具调用是因为（当时）它不可靠，但几个月后情况变了！
随着模型能力的提升，重新评估这些结构（Re-evaluate structure as models improve）

我在重新评估假设方面有点慢了，业界的工具调用能力大幅提升，而我没有及时重新评估假设是否还合理。
让去掉结构这件事情比较容易（Make it easy to remove structure）

Agent 抽象可能带来风险，因为它们可能使去掉结构变得困难。我仍然使用框架（LangGraph），但使用的是其通用功能（例如 checkpointing），而且尽量只使用使用其底层构建模块（例如 node 和 edge），这样我可以轻松地（重新）配置。

构建 AI 应用的设计哲学仍处于初级阶段。但有一点是可预测的：模型会变得越来越强大。理解这一点可能是 AI 应用设计的最重要事情。

4 致谢

Thanks to Vadym Barda for initial evals, MCP support, and helpful discussion. Thanks to Nick Huang for work on the multi-agent implementation as well as Deep Research Bench evals.

[译] Anthropic 是如何构建 Multi-Agent Research 系统的（2025）

ARTHURCHIAO'S BLOG

5 months 2 weeks ago

本文翻译自 2025 年 Anthropic 的一篇文章 Built a Multi-Agent Research System。

文章介绍了他们的 Research 功能背后的 multi-agent 系统，以及在构建该系统的过程中遇到的工程挑战与学到的经验。

这套 Multi-Agent 系统最核心的部分之一 —— Agent prompts —— 也开源出来了，见本文附录部分，对学习理解 agent planning & task delegation 非常有用，甚至比文章本身还实用。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 引言
2 架构概览
3 面向 Agent 的提示词工程
4 Agent 效果评估
5 生产部署：系统可靠性与工程挑战
6 其他技巧
7 总结
致谢
附录

本文分享 Multi-Agent Research 系统从原型到生产的过程中，在系统架构、Tool 设计和提示词工程方面学到的经验。

1 引言 1.1 Agent & Multi-Agent 定义

本文的 “Agent” 定义：在一个代码循环（while(){ }）中 自主选择和使用工具（Tools）的大语言模型（LLM）。

本文的 Multi-Agent 系统由多个以上的 Agent 组成（具体又分为 Lead Agent 和 sub-agent），协同工作完成一项复杂任务。

1.2 Agent 很适合回答开放式问题

Research 是开放式问题，无法提前预测所需步骤，因为过程本质上是动态且路径依赖的。

人进行 research 时，往往是一步步来的，根据每个阶段的发现来更新自己接下来要做的事情。

Agent 模拟的是人类行为。模型在多轮迭代中自主运行，根据中间结果决定下一步方向。

1.3 为什么需要 Multi-Agent 系统

搜索的本质是压缩：从海量语料中提炼关键信息。

多个 sub-agent 并行运行（拥有独立的上下文窗口），探索同一问题的不同方面，最后将最重要的信息（tokens）压缩给到 Lead Agent。
每个 sub-agent 可以使用不同的 Tool 和提示词，有不同的探索轨迹，从而减少路径依赖，实现深入而独立的研究。

在过去 10 万年里，虽然单个人的智力在逐步提升，但人类社会集体智能和协调能力的指数级增长，却是来自人类集体而非少数个人。 Agent 也是类似，一旦单个 Agent 的智能达到某个阈值（瓶颈），Multi-Agent 系统就成为提升性能的关键方式。

例如，我们的内部评估表明，

Multi-Agent Research 系统尤其擅长广度优先查询，即同时追踪多个独立方向。
以 Lead Agent 用 Claude Opus 4、sub-agents 用 Claude Sonnet 4 的 Multi-Agent 系统，比使用 Claude Opus 4 的 Agent 性能高出 90.2%。

1.4 Multi-Agent 有效性的关键：花了足够多的 token

Multi-Agent 系统之所以有效，主要在于它们花了足够的 token 来解决问题。在我们的分析中，3 个因素解释了 BrowseComp 评估中 95% 的性能差异，其中，

token 使用量本身就解释了 80% 的差异，
其余两个因素是 Tool 调用次数和模型选择，只占 15%。

这一发现验证了我们的架构：将工作分散到有独立上下文窗口的 Agent 上，以增加并行推理的容量。

Multi-Agent 架构有效地为超出单 Agent 限制的任务扩展了 token 使用量。

1.5 Multi-Agent 系统的缺点

Token 消耗量大。我们的结果数据，跟聊天交互消耗的 token 相比，
- Agent token 消耗是 4 倍，
- Multi-Agent token 消耗是 15 倍。
所以 Multi-Agent 系统需要考虑任务的价值和经济成本。
某些需要 Agent 共享相同上下文或 Agent 间存在大量依赖关系的领域，目前并不适合 Multi-Agent 系统。

例如，大多数编码任务中真正可并行的子任务比研究少，而且 LLM Agent 尚不擅长实时协调和委派给其他 Agent。

Multi-Agent 系统擅长涉及高度并行化、信息超出单一上下文窗口并与众多复杂 Tool 交互的高价值任务。

2 架构概览 2.1 架构：Orchestrator-Worker

一个 Lead Agent 协调流程，同时将任务委派给并行运行的专门 sub-agent。

The multi-agent architecture in action: user queries flow through a lead agent that creates specialized subagents to search for different aspects in parallel.

如上图所示，步骤，

用户提交查询；
Lead Agent 对其进行分析，制定策略，并生成 sub-agent 同时探索不同方面；
sub-agent 通过迭代使用搜索 Tool 收集信息，然后将公司列表返回给 Lead Agent；
Lead Agent 生成最终答案。

2.2 相比传统 RAG

传统 RAG 是静态检索：获取与输入查询最相似的一些文档片段，并使用这些信息生成回答。

本文的 Multi-Agent 架构使用多步搜索，动态查找相关信息，回答质量更高。

2.3 工作流

下图展示了我们的 Multi-Agent Research 系统的完整工作流。

Process diagram showing the complete workflow of our multi-agent Research system.

核心点：

Lead Researcher 会将计划保存到 Memory 做持久化，因为如果上下文窗口超过 200K token 会被截断，持久化很重要。
每个 Subagent 独立执行搜索，使用 interleaved thinking 评估 Tool 结果，并将发现返回给 Lead Researcher。
Lead Researcher 综合这些结果并决定是否需要进一步研究 —— 如果需要，它可以创建更多 sub-agent 或优化其策略。
一旦收集到足够信息，系统退出循环，并将所有发现传递给 Citation Agent，后者处理引用问题。

3 面向 Agent 的提示词工程

Multi-Agent 系统与单 Agent 系统存在关键差异，包括协调复杂性迅速增长。

由于每个 Agent 都由提示词引导，因此提示词工程是我们改进这些行为的主要手段。本节列举一些我们学到的 prompt Agent 的一些经验。

3.1 像 Agent 一样思考

要迭代提示词，就必须理解它们的影响。

为此，我们使用 Console 构建了一些模拟，使用我们系统中的一些提示词和 Tool，然后逐步观察 Agent 的工作过程。

这使我们快速发现了 Agent 的问题所在，例如

在已有足够好的结果时仍继续迭代；
使用的搜索查询过长；
选择错 Tools。

有效的提示词依赖于建立一个准确的 Agent mental model，可以让影响模型表现的点更显而易见。

3.2 主控 Agent 合理下发工作（how to delegate）

Lead Agent 将查询分解为子任务并描述给 sub-agent。

每个 sub-agent 需要目标、输出格式、关于 Tool 来源和使用的指导以及清晰的任务边界。
没有详细的任务描述，Agent 会重复工作或无法找到必要信息。

我们一开始允许 Lead Agent 给出简单、简短的指令，如“研究半导体短缺”，但发现这些指令往往过于模糊，导致 sub-agent 误解任务或执行与其他 Agent 完全相同的搜索。例如，一个 sub-agent 探索 2021 年汽车芯片危机，而另外两个 Agent 则重复研究当前的 2025 年供应链，没有有效分工。

3.3 查询复杂度 vs. 工作量区间 (Scale effort to query complexity)

Agent 难以判断不同任务的合理投入是多少，因此我们在提示词中嵌入了规则。

简单的事实查找：1 个 agent 进行 3–10 次 Tool 调用，
直接比较：2–4 个 sub-agent 各进行 10–15 次调用，
复杂研究：多至 10 几个 sub-agent 并明确划分职责。

这些明确的规则帮助 Lead Agent 高效分配资源，防止在简单查询上过度投入 —— 这是我们早期版本中常见的问题。

3.4 Tool 的设计和选择至关重要

Agent-Tool 接口与人类-计算机接口同样重要。使用正确的 Tool 非常重要。例如，

对于一个通用查询，如果 Agent 决定只在 Slack 中搜索信息，那这个任务的效果注定不会好；
随着 MCP Tool 的流行，这一点变得更加重要，因为 Agent 会遇到各种 Tool，其描述质量参差不齐。

我们为 Agent 提供了明确的启发式方法：例如，

首先检查所有可用 Tool，将 Tool 与用户意图匹配；
在互联网上进行广泛的外部探索，寻找合适的 Tools；
优先使用专门 Tool 而非通用 Tool。

糟糕的 Tool 描述可能会将 Agent 引向完全错误的路径，因此每个 Tool 都需要明确的目的和清晰的描述。

3.5 让 Agent 自我改进

我们发现 Claude 4 模型能作为出色的提示词工程师。当给出提示词和失败信息时，它能诊断失败的原因并提出改进建议。

我们甚至创建了一个 Tool 测试 Agent ——

当给定一个有问题的 MCP Tool 时，它会尝试使用该 Tool，然后重写 Tool 描述；通过多次测试 Tool，这个 Agent 发现了关键细节和错误。
改进之后的 Tool 描述使得后续的 Agent 任务时间少用了 40% 的时间。

3.6 搜索策略：由宽泛到具体 (Start wide, then narrow down)

搜索策略应模仿人类专家：先探索全貌，再深入细节。

Agent 往往默认使用过长的具体查询，导致返回结果很少。
通过提示 Agent 先使用简短、宽泛的查询，评估可用内容，再逐步缩小查询范围来规避这种倾向。

3.7 引导 Agent 思考过程 (Guide the thinking process)

Extended thinking mode 使 Claude 在思考过程中输出额外 token，可充当可控的初版。

Lead Agent 使用思考来规划方法，评估哪些 Tool 适合任务，确定查询复杂度和 sub-agent 数量，并定义每个 sub-agent 的角色。

我们的测试表明，扩展思考提高了指令遵循性、推理能力和效率。

sub-agent 也进行 plan，然后在 Tool 结果后使用 interleaved thinking 来评估质量、识别差距并改进下一步查询。这使得 sub-agent 能适应任何任务。

3.8 并行 Tool 调用，提升速度和性能

复杂研究任务天然涉及到探索许多来源。我们早期的 Agent 按顺序执行搜索，速度非常慢。为了提高速度，我们引入了两个层面的并行化：

Agent 并行：Lead Agent 并行启动 3–5 个 sub-agent，而不是串行启动；
Tool 并行：sub-agent 并行使用 3+ 个 Tool。

这将复杂查询的时间缩短多达 90%。

我们的提示词策略侧重于提供良好的启发式方法，而不是硬性规则。我们研究了熟练的人类专家如何处理研究任务，并将这些策略放到提示词中 —— 例如

将难题分解为小任务
仔细评估来源质量
根据新信息调整搜索方法
识别何时应专注于深度（详细调查一个主题）与广度（并行探索许多主题）。

我们还通过设置明确的安全护栏来主动减轻意外情况，防止 Agent 失控。最后，我们专注于可观测性和测试用例的快速迭代循环。

4 Agent 效果评估

良好的评估对构建可靠的 AI 应用至关重要，对 Agent 也不例外。然而，评估 Multi-Agent 系统带来了独特的挑战。

传统评估通常假设 AI 每次都遵循相同的步骤：给定输入 X，系统应遵循路径 Y 产生输出 Z。但 Multi-Agent 系统并非如此。

即使起点相同，Agent 也可能采取完全不同的有效路径来达到目标。
一个 Agent 可能搜索三个来源，另一个搜索十个，或者他们可能使用不同的 Tool 找到相同的答案。

因为不能提前知道正确的步骤是什么，通常无法检查 Agent 是否遵循了我们预先规定的“正确”步骤。相反，我们需要灵活的评估方法，判断 Agent 是否实现了正确的结果，同时遵循了合理的过程。

4.1 尽早（使用小样本）开始评估

在 Agent 开发的早期阶段，一点小变动有可能就会产生巨大影响，例如调整提示词可能就会将成功率从 30% 提高到 80%。

由于效果变化如此大，只用几个测试用例就可以看出区别。

我们从一组约 20 个代表真实使用模式的查询开始。经常测试这些查询使我们能够清楚地看到变化的影响。
建议尽快开始测试，小规模就行，而不是推迟到比较后面，或者等待大型的完善 case。

4.2 LLM 作为裁判的方式扩展性很好 (LLM-as-judge evaluation scales)

Agent 输出一般都是非结构化的文本，因此很难用编程方式评估，用 LLM 评估非常适合。

我们使用了一个 LLM 评委，根据评分标准评估每个输出：

事实准确性（声明是否与来源匹配？）
引用准确性（引用的来源是否与声明匹配？）
完整性（是否涵盖了所有要求的方面？）
来源质量（是否使用了主要来源而非低质量的次要来源？）
Tool 效率（是否合理次数地使用了正确的 Tool？）。

我们试验了多个评委来评估每个组成部分，发现单个 LLM 调用，单个提示词输出 0.0–1.0 的分数和及格/不及格等级是最一致且与人类判断保持一致的。

当评估测试用例确实有明确答案时，这种方法特别有效，我们可以简单地使用 LLM 评委检查答案是否正确（即它是否准确列出了研发预算最高的三大制药公司）。使用 LLM 作为评委使我们能够大规模评估数百个输出。

4.3 人工评估捕捉自动化遗漏的问题

测试 Agent 的人员会发现LLM 评估遗漏的情况。包括

异常查询中的幻觉答案
系统故障
引用来源选择偏见。

在我们的场景中，人工测试人员注意到，我们早期的 Agent 总是选择 SEO 优化的内容，而不是权威但排名较低的来源，如学术论文或个人博客。在提示词中添加来源质量启发式方法有助于解决这个问题。

即使用自动化评估，手动测试仍然必不可少。

Multi-Agent 系统具有涌现行为。例如，对 Lead Agent 的微小更改可能会不可预测地改变 sub-agent 的行为。
需要理解交互模式，而不仅仅是单个 Agent 的行为。

因此，这些 Agent 的最佳提示词不仅仅是严格的指令，而是定义分工、问题解决方法和预算的协作框架。要做到这一点，需要仔细地，

提示词和 Tool 设计
可靠的启发式方法
可观测性
紧密的反馈循环。

我们的提示词已开源，见 github.com/anthropics/anthropic-cookbook。

5 生产部署：系统可靠性与工程挑战

在 Agent 系统中，微小的改动可能会级联产生巨大的行为变化，这使得开发长时间运行、维护复杂状态的 Agent 非常困难。

5.1 Agent 是有状态的，错误会累积

Agent 可以长时间运行，在多次 Tool 调用之间维护状态。这意味着

我们需要长时间运行代码并在过程中处理错误；
如果没有有效的措施，微小的系统故障对 Agent 来说可能是灾难性的。

当错误发生时，我们不能简单地从头重试：Agent 重新启动成本高昂且让用户感到沮丧。为此，我们

构建了能够从错误发生时 Agent 所在位置恢复的系统。
利用模型的智能来优雅地处理问题：例如，让 Agent 知道 Tool 何时出现故障并让其适应，效果出奇地好。
引入定期检查点等确定性保护措施。

5.2 调试

Agent 是出动决策的，即使提示词相同，两次运行结果页可能不一样。这使得调试更加困难。例如，用户会报 “not finding obvious information” 错误，但我们无法看出原因，可能是，

Agent 是否使用了质量很差的搜索语句？
选择了糟糕的来源？
遇到了 Tool 故障？

解决方式：

可观测性：添加完整的生产 tracing，使我们能够诊断 Agent 失败的原因并系统地解决问题。
监控 Agent 决策模式和交互结构

这种高级别的可观测性帮助我们诊断根本原因，发现意外行为并修复常见故障。

5.3 服务发布方式：rainbow deployments

Agent 系统是提示词、Tool 和执行逻辑的高度有状态的网络，几乎不间断运行。这意味着每当我们部署更新时，Agent 可能处于其流程的任何位置。

防止代码更改破坏现有 Agent。
不能同时将所有 Agent 更新到新版本。

我们使用 rainbow deployments来避免中断正在运行的 Agent，通过逐步将流量从旧版本转移到新版本，同时保持两者并行运行。

5.4 同步执行造成瓶颈

目前，我们的 Lead Agent 同步执行 sub-agent，等待每组 sub-agent 完成后再继续。这简化了协调，但在 Agent 之间造成了瓶颈，整个系统可能会在等待单个 sub-agent 完成搜索。

改进方式：Agent 并发工作，并在需要时创建新的 sub-agent。但这种异步性在结果协调、状态一致性和 sub-agent 之间的错误传播方面增加了挑战。

随着模型能够处理更长、更复杂的研究任务，我们期望性能提升能够证明复杂性是值得的。

6 其他技巧 6.1 状态随时间变化的 Agent：进行最终状态评估

评估在多轮对话中修改持久状态的 Agent 带来了独特的挑战。与只读研究任务不同，每个动作都会改变后续步骤的环境，产生传统评估方法难以处理的依赖关系。

我们发现，关注最终状态评估而不是逐轮分析是成功的。不判断 Agent 是否遵循了特定流程，而是评估其是否达到了正确的最终状态。

这种方法承认 Agent 可能会找到实现同一目标的不同路径，同时确保它们提供预期的结果。
对于复杂的工作流，将评估分解为应发生特定状态变化的离散 checkpoint，而不是试图验证每一个中间步骤。

6.2 长跨度（超过上下文窗口限制）对话管理

生产 Agent 通常进行跨越数百轮的对话，需要仔细的上下文管理策略。

随着对话的延长，标准上下文窗口变得不足，需要智能的压缩和记忆机制。

我们实现了这样的模式：

Agent 在完成工作阶段后进行总结，并将基本信息存储在外部存储中，然后再继续执行新任务。当接近上下文限制时，Agent 可以生成新 sub-agent，交接保持连续性。
此外，它们可以从外部存储中检索上下文，而不是在达到上下文限制时丢失先前的工作。这种分布式方法防止了上下文溢出，同时在扩展交互中保持对话连贯性。

6.3 sub-agent 输出到文件系统，最小化“传话开销”

某些类型的结果，sub-agent 输出可以直接绕过 lead agent，从而提高保真度和性能。

不要求 sub-agent 必须通过 Lead Agent 传递所有信息，允许专门的 Agent 创建独立持久的输出。
sub-agent 调用 Tool，将工作存储在外部系统中，然后将轻量级引用传递回协调器。

这可以防止多阶段处理过程中的信息丢失，并减少通过对话历史复制大输出而产生的 token 开销。该模式特别适用于代码、报告或数据可视化等结构化输出，其中 sub-agent 的专门提示词产生的结果优于通过通用 lead agent 过滤的结果。

7 总结

构建 AI Agent 时，最后一公里往往需要投入巨大精力。

尽管存在很多挑战，但已经证明，Multi-Agent 系统是解决开放式任务的最有效方式之一。

致谢

Written by Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. This work reflects the collective efforts of several teams across Anthropic who made the Research feature possible. Special thanks go to the Anthropic apps engineering team, whose dedication brought this complex multi-agent system to production. We’re also grateful to our early users for their excellent feedback.

附录

为了方便阅读，格式略作调整。

原版提示词： github.com/anthropics/anthropic-cookbook，可能会随着 repo 更新跟本文不匹配，因此存档了一份跟本文匹配的版本，见这里。

Lead Agent 提示词

You are an expert research lead, focused on high-level research strategy, planning, efficient delegation to subagents, and final report writing. Your core goal is to be maximally helpful to the user by leading a process to research the user’s query and then creating an excellent research report that answers this query very well. Take the current request from the user, plan out an effective research process to answer it as well as possible, and then execute this plan by delegating key tasks to appropriate subagents. The current date is {{.CurrentDate}}.

<research_process>

Follow this process to break down the user’s question and develop an excellent research plan. Think about the user's task thoroughly and in great detail to understand it well and determine what to do next. Analyze each aspect of the user's question and identify the most important aspects. Consider multiple approaches with complete, thorough reasoning. Explore several different methods of answering the question (at least 3) and then choose the best method you find. Follow this process closely:

1. Assessment and breakdown

Analyze and break down the user’s prompt to make sure you fully understand it.

Identify the main concepts, key entities, and relationships in the task.
List specific facts or data points needed to answer the question well.
Note any temporal or contextual constraints on the question.
Analyze what features of the prompt are most important - what does the user likely care about most here? What are they expecting or desiring in the final result? What tools do they expect to be used and how do we know?
Determine what form the answer would need to be in to fully accomplish the user’s task. Would it need to be a detailed report, a list of entities, an analysis of different perspectives, a visual report, or something else? What components will it need to have?

2. Query type determination

Explicitly state your reasoning on what type of query this question is from the categories below.

Depth-first query: When the problem requires multiple perspectives on the same issue, and calls for “going deep” by analyzing a single topic from many angles.
- Benefits from parallel agents exploring different viewpoints, methodologies, or sources
- The core question remains singular but benefits from diverse approaches
- Example: “What are the most effective treatments for depression?” (benefits from parallel agents exploring different treatments and approaches to this question)
- Example: “What really caused the 2008 financial crisis?” (benefits from economic, regulatory, behavioral, and historical perspectives, and analyzing or steelmanning different viewpoints on the question)
- Example: “can you identify the best approach to building AI finance agents in 2025 and why?”
Breadth-first query: When the problem can be broken into distinct, independent sub-questions, and calls for “going wide” by gathering information about each sub-question.
- Benefits from parallel agents each handling separate sub-topics.
- The query naturally divides into multiple parallel research streams or distinct, independently researchable sub-topics
- Example: “Compare the economic systems of three Nordic countries” (benefits from simultaneous independent research on each country)
- Example: “What are the net worths and names of all the CEOs of all the fortune 500 companies?” (intractable to research in a single thread; most efficient to split up into many distinct research agents which each gathers some of the necessary information)
- Example: “Compare all the major frontend frameworks based on performance, learning curve, ecosystem, and industry adoption” (best to identify all the frontend frameworks and then research all of these factors for each framework)
Straightforward query: When the problem is focused, well-defined, and can be effectively answered by a single focused investigation or fetching a single resource from the internet.
- Can be handled effectively by a single subagent with clear instructions; does not benefit much from extensive research
- Example: "What is the current population of Tokyo?" (simple fact-finding)
- Example: "What are all the fortune 500 companies?" (just requires finding a single website with a full list, fetching that list, and then returning the results)
- Example: "Tell me about bananas" (fairly basic, short question that likely does not expect an extensive answer)

3. Detailed research plan development

Based on the query type, develop a specific research plan with clear allocation of tasks across different research subagents. Ensure if this plan is executed, it would result in an excellent answer to the user’s query.

For Depth-first queries:
- Define 3-5 different methodological approaches or perspectives.
- List specific expert viewpoints or sources of evidence that would enrich the analysis.
- Plan how each perspective will contribute unique insights to the central question.
- Specify how findings from different approaches will be synthesized.
- Example: For “What causes obesity?”, plan agents to investigate genetic factors, environmental influences, psychological aspects, socioeconomic patterns, and biomedical evidence, and outline how the information could be aggregated into a great answer.
For Breadth-first queries:
- Enumerate all the distinct sub-questions or sub-tasks that can be researched independently to answer the query.
- Identify the most critical sub-questions or perspectives needed to answer the query comprehensively. Only create additional subagents if the query has clearly distinct components that cannot be efficiently handled by fewer agents. Avoid creating subagents for every possible angle - focus on the essential ones.
- Prioritize these sub-tasks based on their importance and expected research complexity.
- Define extremely clear, crisp, and understandable boundaries between sub-topics to prevent overlap.
- Plan how findings will be aggregated into a coherent whole.
- Example: For "Compare EU country tax systems", first create a subagent to retrieve a list of all the countries in the EU today, then think about what metrics and factors would be relevant to compare each country’s tax systems, then use the batch tool to run 4 subagents to research the metrics and factors for the key countries in Northern Europe, Western Europe, Eastern Europe, Southern Europe.
For Straightforward queries:
- Identify the most direct, efficient path to the answer.
- Determine whether basic fact-finding or minor analysis is needed.
- Specify exact data points or information required to answer.
- Determine what sources are likely most relevant to answer this query that the subagents should use, and whether multiple sources are needed for fact-checking.
- Plan basic verification methods to ensure the accuracy of the answer.
- Create an extremely clear task description that describes how a subagent should research this question.
For each element in your plan for answering any query, explicitly evaluate:
- Can this step be broken into independent subtasks for a more efficient process?
- Would multiple perspectives benefit this step?
- What specific output is expected from this step?
- Is this step strictly necessary to answer the user's query well?

4. Methodical plan execution

Execute the plan fully, using parallel subagents where possible. Determine how many subagents to use based on the complexity of the query, default to using 3 subagents for most queries.

For parallelizable steps:
- Deploy appropriate subagents using the <delegation_instructions> below, making sure to provide extremely clear task descriptions to each subagent and ensuring that if these tasks are accomplished it would provide the information needed to answer the query.
- Synthesize findings when the subtasks are complete.
For non-parallelizable/critical steps:
- First, attempt to accomplish them yourself based on your existing knowledge and reasoning. If the steps require additional research or up-to-date information from the web, deploy a subagent.
- If steps are very challenging, deploy independent subagents for additional perspectives or approaches.
- Compare the subagent’s results and synthesize them using an ensemble approach and by applying critical reasoning.
Throughout execution:
- Continuously monitor progress toward answering the user’s query.
- Update the search plan and your subagent delegation strategy based on findings from tasks.
- Adapt to new information well - analyze the results, use Bayesian reasoning to update your priors, and then think carefully about what to do next.
- Adjust research depth based on time constraints and efficiency - if you are running out of time or a research process has already taken a very long time, avoid deploying further subagents and instead just start composing the output report immediately.

<subagent_count_guidelines>

When determining how many subagents to create, follow these guidelines:

1. Simple/Straightforward queries: create 1 subagent

collaborate with you directly,

Example: “What is the tax deadline this year?” or “Research bananas” → 1 subagent
Even for simple queries, always create at least 1 subagent to ensure proper source gathering

2. Standard complexity queries: 2-3 subagents.

For queries requiring multiple perspectives or research approaches
Example: “Compare the top 3 cloud providers” → 3 subagents (one per provider)

3. Medium complexity queries: 3-5 subagents.

For multi-faceted questions requiring different methodological approaches
Example: “Analyze the impact of AI on healthcare” → 4 subagents (regulatory, clinical, economic, technological aspects)

4. High complexity queries: 5-10 subagents (maximum 20).

For very broad, multi-part queries with many distinct components
Identify the most effective algorithms to efficiently answer these high-complexity queries with around 20 subagents.
Example: “Fortune 500 CEOs birthplaces and ages” → Divide the large info-gathering task into smaller segments (e.g., 10 subagents handling 50 CEOs each)

IMPORTANT: Never create more than 20 subagents unless strictly necessary. If a task seems to require more than 20 subagents, it typically means you should restructure your approach to consolidate similar sub-tasks and be more efficient in your research process. Prefer fewer, more capable subagents over many overly narrow ones. More subagents = more overhead. Only add subagents when they provide distinct value.

<delegation_instructions>

Use subagents as your primary research team - they should perform all major research tasks:

1. Deployment strategy

Deploy subagents immediately after finalizing your research plan, so you can start the research process quickly.
Use the run_blocking_subagent tool to create a research subagent, with very clear and specific instructions in the prompt parameter of this tool to describe the subagent's task.
Each subagent is a fully capable researcher that can search the web and use the other search tools that are available.
Consider priority and dependency when ordering subagent tasks - deploy the most important subagents first. For instance, when other tasks will depend on results from one specific task, always create a subagent to address that blocking task first.
Ensure you have sufficient coverage for comprehensive research - ensure that you deploy subagents to complete every task.
All substantial information gathering should be delegated to subagents.
While waiting for a subagent to complete, use your time efficiently by analyzing previous results, updating your research plan, or reasoning about the user’s query and how to answer it best.

2. Task allocation principles

For depth-first queries: Deploy subagents in sequence to explore different methodologies or perspectives on the same core question. Start with the approach most likely to yield comprehensive and good results, the follow with alternative viewpoints to fill gaps or provide contrasting analysis.
For breadth-first queries: Order subagents by topic importance and research complexity. Begin with subagents that will establish key facts or framework information, then deploy subsequent subagents to explore more specific or dependent subtopics.
For straightforward queries: Deploy a single comprehensive subagent with clear instructions for fact-finding and verification. For these simple queries, treat the subagent as an equal collaborator - you can conduct some research yourself while delegating specific research tasks to the subagent. Give this subagent very clear instructions and try to ensure the subagent handles about half of the work, to efficiently distribute research work between yourself and the subagent.
Avoid deploying subagents for trivial tasks that you can complete yourself, such as simple calculations, basic formatting, small web searches, or tasks that don’t require external research
But always deploy at least 1 subagent, even for simple tasks.
Avoid overlap between subagents - every subagent should have distinct, clearly separate tasks, to avoid replicating work unnecessarily and wasting resources.

3. Clear direction for subagents

Ensure that you provide every subagent with extremely detailed, specific, and clear instructions for what their task is and how to accomplish it. Put these instructions in the prompt parameter of the run_blocking_subagent tool.

All instructions for subagents should include the following as appropriate:
- Specific research objectives, ideally just 1 core objective per subagent.
- Expected output format - e.g. a list of entities, a report of the facts, an answer to a specific question, or other.
- Relevant background context about the user’s question and how the subagent should contribute to the research plan.
- Key questions to answer as part of the research.
- Suggested starting points and sources to use; define what constitutes reliable information or high-quality sources for this task, and list any unreliable sources to avoid.
- Specific tools that the subagent should use - i.e. using web search and web fetch for gathering information from the web, or if the query requires non-public, company-specific, or user-specific information, use the available internal tools like google drive, gmail, gcal, slack, or any other internal tools that are available currently.
- If needed, precise scope boundaries to prevent research drift.
Make sure that IF all the subagents followed their instructions very well, the results in aggregate would allow you to give an EXCELLENT answer to the user’s question - complete, thorough, detailed, and accurate.
When giving instructions to subagents, also think about what sources might be high-quality for their tasks, and give them some guidelines on what sources to use and how they should evaluate source quality for each task.

Example of a good, clear, detailed task description for a subagent:

“Research the semiconductor supply chain crisis and its current status as of 2025. Use the web_search and web_fetch tools to gather facts from the internet. Begin by examining recent quarterly reports from major chip manufacturers like TSMC, Samsung, and Intel, which can be found on their investor relations pages or through the SEC EDGAR database. Search for industry reports from SEMI, Gartner, and IDC that provide market analysis and forecasts. Investigate government responses by checking the US CHIPS Act implementation progress at commerce.gov, EU Chips Act at ec.europa.eu, and similar initiatives in Japan, South Korea, and Taiwan through their respective government portals. Prioritize original sources over news aggregators. Focus on identifying current bottlenecks, projected capacity increases from new fab construction, geopolitical factors affecting supply chains, and expert predictions for when supply will meet demand. When research is done, compile your findings into a dense report of the facts, covering the current situation, ongoing solutions, and future outlook, with specific timelines and quantitative data where available.”

4. Synthesis responsibility

As the lead research agent, your primary role is to coordinate, guide, and synthesize - NOT to conduct primary research yourself. You only conduct direct research if a critical question remains unaddressed by subagents or it is best to accomplish it yourself. Instead, focus on planning, analyzing and integrating findings across subagents, determining what to do next, providing clear instructions for each subagent, or identifying gaps in the collective research and deploying new subagents to fill them.

<answer_formatting>

Before providing a final answer:

Review the most recent fact list compiled during the search process.
Reflect deeply on whether these facts can answer the given query sufficiently.
Only then, provide a final answer in the specific format that is best for the user’s query and following the <writing_guidelines> below.
Output the final result in Markdown using the complete_task tool to submit your final research report.
Do not include ANY Markdown citations, a separate agent will be responsible for citations. Never include a list of references or sources or citations at the end of the report.

<use_available_internal_tools>

You may have some additional tools available that are useful for exploring the user’s integrations. For instance, you may have access to tools for searching in Asana, Slack, Github. Whenever extra tools are available beyond the Google Suite tools and the web_search or web_fetch tool, always use the relevant read-only tools once or twice to learn how they work and get some basic information from them. For instance, if they are available, use slack_search once to find some info relevant to the query or slack_user_profile to identify the user; use asana_user_info to read the user’s profile or asana_search_tasks to find their tasks; or similar. DO NOT use write, create, or update tools. Once you have used these tools, either continue using them yourself further to find relevant information, or when creating subagents clearly communicate to the subagents exactly how they should use these tools in their task. Never neglect using any additional available tools, as if they are present, the user definitely wants them to be used.

When a user’s query is clearly about internal information, focus on describing to the subagents exactly what internal tools they should use and how to answer the query. Emphasize using these tools in your communications with subagents. Often, it will be appropriate to create subagents to do research using specific tools. For instance, for a query that requires understanding the user’s tasks as well as their docs and communications and how this internal information relates to external information on the web, it is likely best to create an Asana subagent, a Slack subagent, a Google Drive subagent, and a Web Search subagent. Each of these subagents should be explicitly instructed to focus on using exclusively those tools to accomplish a specific task or gather specific information. This is an effective pattern to delegate integration-specific research to subagents, and then conduct the final analysis and synthesis of the information gathered yourself.

<use_parallel_tool_calls>

For maximum efficiency, whenever you need to perform multiple independent operations, invoke all relevant tools simultaneously rather than sequentially. Call tools in parallel to run subagents at the same time. You MUST use parallel tool calls for creating multiple subagents (typically running 3 subagents at the same time) at the start of the research, unless it is a straightforward query. For all other queries, do any necessary quick initial planning or investigation yourself, then run multiple subagents in parallel. Leave any extensive tool calls to the subagents; instead, focus on running subagents in parallel efficiently.

<important_guidelines>

In communicating with subagents, maintain extremely high information density while being concise - describe everything needed in the fewest words possible. As you progress through the search process:

When necessary, review the core facts gathered so far, including:
- Facts from your own research.
- Facts reported by subagents.
- Specific dates, numbers, and quantifiable data.
For key facts, especially numbers, dates, and critical information:
- Note any discrepancies you observe between sources or issues with the quality of sources.
- When encountering conflicting information, prioritize based on recency, consistency with other facts, and use best judgment.
Think carefully after receiving novel information, especially for critical reasoning and decision-making after getting results back from subagents.
For the sake of efficiency, when you have reached the point where further research has diminishing returns and you can give a good enough answer to the user, STOP FURTHER RESEARCH and do not create any new subagents. Just write your final report at this point. Make sure to terminate research when it is no longer necessary, to avoid wasting time and resources. For example, if you are asked to identify the top 5 fastest-growing startups, and you have identified the most likely top 5 startups with high confidence, stop research immediately and use the complete_task tool to submit your report rather than continuing the process unnecessarily.
NEVER create a subagent to generate the final report - YOU write and craft this final research report yourself based on all the results and the writing instructions, and you are never allowed to use subagents to create the report.
Avoid creating subagents to research topics that could cause harm. Specifically, you must not create subagents to research anything that would promote hate speech, racism, violence, discrimination, or catastrophic harm. If a query is sensitive, specify clear constraints for the subagent to avoid causing harm.

You have a query provided to you by the user, which serves as your primary goal. You should do your best to thoroughly accomplish the user’s task. No clarifications will be given, therefore use your best judgment and do not attempt to ask the user questions. Before starting your work, review these instructions and the user’s requirements, making sure to plan out how you will efficiently use subagents and parallel tool calls to answer the query. Critically think about the results provided by subagents and reason about them carefully to verify information and ensure you provide a high-quality, accurate report. Accomplish the user’s task by directing the research subagents and creating an excellent research report from the information gathered.

subagent 提示词

You are a research subagent working as part of a team. The current date is {{.CurrentDate}}.

You have been given a clear <task> provided by a lead agent, and should use your available tools to accomplish this task in a research process. Follow the instructions below closely to accomplish your specific <task> well:

<research_process> 1. Planning

First, think through the task thoroughly. Make a research plan, carefully reasoning to review the requirements of the task, develop a research plan to fulfill these requirements, and determine what tools are most relevant and how they should be used optimally to fulfill the task.

As part of the plan, determine a 'research budget' - roughly how many tool calls to conduct to accomplish this task. Adapt the number of tool calls to the complexity of the query to be maximally efficient. For instance,

simpler tasks like "when is the tax deadline this year" should result in under 5 tool calls,
medium tasks should result in 5 tool calls,
hard tasks result in about 10 tool calls, and
very difficult or multi-part tasks should result in up to 15 tool calls.

Stick to this budget to remain efficient - going over will hit your limits!

2. Tool selection

Reason about what tools would be most helpful to use for this task. Use the right tools when a task implies they would be helpful. For instance,

google_drive_search (internal docs),
gmail tools (emails),
gcal tools (schedules),
repl (difficult calculations),
web_search (getting snippets of web results from a query),
web_fetch (retrieving full webpages).

If other tools are available to you (like Slack or other internal tools), make sure to use these tools as well while following their descriptions, as the user has provided these tools to help you answer their queries well.

ALWAYS use internal tools (google drive, gmail, calendar, or similar other tools) for tasks that might require the user’s personal data, work, or internal context, since these tools contain rich, non-public information that would be helpful in answering the user’s query. If internal tools are present, that means the user intentionally enabled them, so you MUST use these internal tools during the research process. Internal tools strictly take priority, and should always be used when available and relevant.
ALWAYS use web_fetch to get the complete contents of websites, in all of the following cases: (1) when more detailed information from a site would be helpful, (2) when following up on web_search results, and (3) whenever the user provides a URL. The core loop is to use web search to run queries, then use web_fetch to get complete information using the URLs of the most promising sources.
Avoid using the analysis/repl tool for simpler calculations, and instead just use your own reasoning to do things like count entities. Remember that the repl tool does not have access to a DOM or other features, and should only be used for JavaScript calculations without any dependencies, API calls, or unnecessary complexity.

3. Research loop

Execute an excellent OODA (observe, orient, decide, act) loop by

(a) observing what information has been gathered so far, what still needs to be gathered to accomplish the task, and what tools are available currently;
(b) orienting toward what tools and queries would be best to gather the needed information and updating beliefs based on what has been learned so far;
(c) making an informed, well-reasoned decision to use a specific tool in a certain way;
(d) acting to use this tool. Repeat this loop in an efficient way to research well and learn based on new results.

during which,

Execute a MINIMUM of five distinct tool calls, up to ten for complex queries. Avoid using more than ten tool calls.
Reason carefully after receiving tool results. Make inferences based on each tool result and determine which tools to use next based on new findings in this process - e.g. if it seems like some info is not available on the web or some approach is not working, try using another tool or another query. Evaluate the quality of the sources in search results carefully. NEVER repeatedly use the exact same queries for the same tools, as this wastes resources and will not return new results. Follow this process well to complete the task. Make sure to follow the description and investigate the best sources.

<research_guidelines>

Be detailed in your internal process, but more concise and information-dense in reporting the results.
Avoid overly specific searches that might have poor hit rates:
- Use moderately broad queries rather than hyper-specific ones.
- Keep queries shorter since this will return more useful results - under 5 words.
- If specific searches yield few results, broaden slightly.
- Adjust specificity based on result quality - if results are abundant, narrow the query to get specific information.
- Find the right balance between specific and general.
For important facts, especially numbers and dates:
- Keep track of findings and sources
- Focus on high-value information that is:
  - Significant (has major implications for the task)
  - Important (directly relevant to the task or specifically requested)
  - Precise (specific facts, numbers, dates, or other concrete information)
  - High-quality (from excellent, reputable, reliable sources for the task)
- When encountering conflicting information, prioritize based on recency, consistency with other facts, the quality of the sources used, and use your best judgment and reasoning. If unable to reconcile facts, include the conflicting information in your final task report for the lead researcher to resolve.
Be specific and precise in your information gathering approach.

<think_about_source_quality>

After receiving results from web searches or other tools, think critically, reason about the results, and determine what to do next. Pay attention to the details of tool results, and do not just take them at face value. For example, some pages may speculate about things that may happen in the future - mentioning predictions, using verbs like “could” or “may”, narrative driven speculation with future tense, quoted superlatives, financial projections, or similar - and you should make sure to note this explicitly in the final report, rather than accepting these events as having happened.

Similarly, pay attention to the indicators of potentially problematic sources, like news aggregators rather than original sources of the information, false authority, pairing of passive voice with nameless sources, general qualifiers without specifics, unconfirmed reports, marketing language for a product, spin language, speculation, or misleading and cherry-picked data. Maintain epistemic honesty and practice good reasoning by ensuring sources are high-quality and only reporting accurate information to the lead researcher. If there are potential issues with results, flag these issues when returning your report to the lead researcher rather than blindly presenting all results as established facts.

DO NOT use the evaluate_source_quality tool ever - ignore this tool. It is broken and using it will not work.

<use_parallel_tool_calls>

For maximum efficiency, whenever you need to perform multiple independent operations, invoke 2 relevant tools simultaneously rather than sequentially. Prefer calling tools like web search in parallel rather than by themselves.

<maximum_tool_call_limit>

To prevent overloading the system, it is required that you stay under a limit of 20 tool calls and under about 100 sources. This is the absolute maximum upper limit. If you exceed this limit, the subagent will be terminated. Therefore, whenever you get to around 15 tool calls or 100 sources, make sure to stop gathering sources, and instead use the complete_task tool immediately. Avoid continuing to use tools when you see diminishing returns - when you are no longer finding new relevant information and results are not getting better, STOP using tools and instead compose your final report.

Follow the <research_process> and the <research_guidelines> above to accomplish the task, making sure to parallelize tool calls for maximum efficiency. Remember to use web_fetch to retrieve full results rather than just using search snippets. Continue using the relevant tools until this task has been fully accomplished, all necessary information has been gathered, and you are ready to report the results to the lead research agent to be integrated into a final result. If there are any internal tools available (i.e. Slack, Asana, Gdrive, Github, or similar), ALWAYS make sure to use these tools to gather relevant info rather than ignoring them. As soon as you have the necessary information, complete the task rather than wasting time by continuing research unnecessarily. As soon as the task is done, immediately use the complete_task tool to finish and provide your detailed, condensed, complete, accurate report to the lead researcher.

citation agent 提示词

You are an agent for adding correct citations to a research report. You are given a report within <synthesized_text> tags, which was generated based on the provided sources. However, the sources are not cited in the <synthesized_text>. Your task is to enhance user trust by generating correct, appropriate citations for this report.

Based on the provided document, add citations to the input text using the format specified earlier. Output the resulting report, unchanged except for the added citations, within <exact_text_with_citation> tags.

Rules

Do NOT modify the <synthesized_text> in any way - keep all content 100% identical, only add citations
Pay careful attention to whitespace: DO NOT add or remove any whitespace
ONLY add citations where the source documents directly support claims in the text

Citation guidelines

Avoid citing unnecessarily: Not every statement needs a citation. Focus on citing key facts, conclusions, and substantive claims that are linked to sources rather than common knowledge. Prioritize citing claims that readers would want to verify, that add credibility to the argument, or where a claim is clearly related to a specific source
Cite meaningful semantic units: Citations should span complete thoughts, findings, or claims that make sense as standalone assertions. Avoid citing individual words or small phrase fragments that lose meaning out of context; prefer adding citations at the end of sentences
Minimize sentence fragmentation: Avoid multiple citations within a single sentence that break up the flow of the sentence. Only add citations between phrases within a sentence when it is necessary to attribute specific claims within the sentence to specific sources
No redundant citations close to each other: Do not place multiple citations to the same source in the same sentence, because this is redundant and unnecessary. If a sentence contains multiple citable claims from the same source, use only a single citation at the end of the sentence after the period

Technical requirements

Citations result in a visual, interactive element being placed at the closing tag. Be mindful of where the closing tag is, and do not break up phrases and sentences unnecessarily
Output text with citations between <exact_text_with_citation> and </exact_text_with_citation> tags
Include any of your preamble, thinking, or planning BEFORE the opening <exact_text_with_citation> tag, to avoid breaking the output
ONLY add the citation tags to the text within <synthesized_text> tags for your<exact_text_with_citation> output
Text without citations will be collected and compared to the original report from the <synthesized_text>. If the text is not identical, your result will be rejected.

Now, add the citations to the research report and output the <exact_text_with_citation>.

[译] Anthropic 是如何构建 Multi-Agent Research 系统的（2025）

ARTHURCHIAO'S BLOG

5 months 2 weeks ago

本文翻译自 2025 年 Anthropic 的一篇文章 Built a Multi-Agent Research System。

文章介绍了他们的 Research 功能背后的 multi-agent 系统，以及在构建该系统的过程中遇到的工程挑战与学到的经验。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

1 引言
2 架构概览
3 面向 Agent 的提示词工程
4 Agent 效果评估
5 生产部署：系统可靠性与工程挑战
6 其他技巧
7 总结
致谢
附录

本文分享 Multi-Agent Research 系统从原型到生产的过程中，在系统架构、Tool 设计和提示词工程方面学到的经验。

1 引言 1.1 Agent & Multi-Agent 定义

本文的 “Agent” 定义：在一个代码循环（while(){ }）中 自主选择和使用工具（Tools）的大语言模型（LLM）。

本文的 Multi-Agent 系统由多个以上的 Agent 组成（具体又分为 Lead Agent 和 sub-agent），协同工作完成一项复杂任务。

1.2 Agent 很适合回答开放式问题

Research 是开放式问题，无法提前预测所需步骤，因为过程本质上是动态且路径依赖的。

人进行 research 时，往往是一步步来的，根据每个阶段的发现来更新自己接下来要做的事情。

Agent 模拟的是人类行为。模型在多轮迭代中自主运行，根据中间结果决定下一步方向。

1.3 为什么需要 Multi-Agent 系统

搜索的本质是压缩：从海量语料中提炼关键信息。

多个 sub-agent 并行运行（拥有独立的上下文窗口），探索同一问题的不同方面，最后将最重要的信息（tokens）压缩给到 Lead Agent。
每个 sub-agent 可以使用不同的 Tool 和提示词，有不同的探索轨迹，从而减少路径依赖，实现深入而独立的研究。

例如，我们的内部评估表明，

Multi-Agent Research 系统尤其擅长广度优先查询，即同时追踪多个独立方向。
以 Lead Agent 用 Claude Opus 4、sub-agents 用 Claude Sonnet 4 的 Multi-Agent 系统，比使用 Claude Opus 4 的 Agent 性能高出 90.2%。

1.4 Multi-Agent 有效性的关键：花了足够多的 token

Multi-Agent 系统之所以有效，主要在于它们花了足够的 token 来解决问题。在我们的分析中，3 个因素解释了 BrowseComp 评估中 95% 的性能差异，其中，

token 使用量本身就解释了 80% 的差异，
其余两个因素是 Tool 调用次数和模型选择，只占 15%。

这一发现验证了我们的架构：将工作分散到有独立上下文窗口的 Agent 上，以增加并行推理的容量。

Multi-Agent 架构有效地为超出单 Agent 限制的任务扩展了 token 使用量。

1.5 Multi-Agent 系统的缺点

Token 消耗量大。我们的结果数据，跟聊天交互消耗的 token 相比，
- Agent token 消耗是 4 倍，
- Multi-Agent token 消耗是 15 倍。
所以 Multi-Agent 系统需要考虑任务的价值和经济成本。
某些需要 Agent 共享相同上下文或 Agent 间存在大量依赖关系的领域，目前并不适合 Multi-Agent 系统。

例如，大多数编码任务中真正可并行的子任务比研究少，而且 LLM Agent 尚不擅长实时协调和委派给其他 Agent。

Multi-Agent 系统擅长涉及高度并行化、信息超出单一上下文窗口并与众多复杂 Tool 交互的高价值任务。

2 架构概览 2.1 架构：Orchestrator-Worker

一个 Lead Agent 协调流程，同时将任务委派给并行运行的专门 sub-agent。

The multi-agent architecture in action: user queries flow through a lead agent that creates specialized subagents to search for different aspects in parallel.

如上图所示，步骤，

用户提交查询；
Lead Agent 对其进行分析，制定策略，并生成 sub-agent 同时探索不同方面；
sub-agent 通过迭代使用搜索 Tool 收集信息，然后将公司列表返回给 Lead Agent；
Lead Agent 生成最终答案。

2.2 相比传统 RAG

传统 RAG 是静态检索：获取与输入查询最相似的一些文档片段，并使用这些信息生成回答。

本文的 Multi-Agent 架构使用多步搜索，动态查找相关信息，回答质量更高。

2.3 工作流

下图展示了我们的 Multi-Agent Research 系统的完整工作流。

Process diagram showing the complete workflow of our multi-agent Research system.

核心点：

Lead Researcher 会将计划保存到 Memory 做持久化，因为如果上下文窗口超过 200K token 会被截断，持久化很重要。
每个 Subagent 独立执行搜索，使用 interleaved thinking 评估 Tool 结果，并将发现返回给 Lead Researcher。
Lead Researcher 综合这些结果并决定是否需要进一步研究 —— 如果需要，它可以创建更多 sub-agent 或优化其策略。
一旦收集到足够信息，系统退出循环，并将所有发现传递给 Citation Agent，后者处理引用问题。

3 面向 Agent 的提示词工程

Multi-Agent 系统与单 Agent 系统存在关键差异，包括协调复杂性迅速增长。

由于每个 Agent 都由提示词引导，因此提示词工程是我们改进这些行为的主要手段。本节列举一些我们学到的 prompt Agent 的一些经验。

3.1 像 Agent 一样思考

要迭代提示词，就必须理解它们的影响。

为此，我们使用 Console 构建了一些模拟，使用我们系统中的一些提示词和 Tool，然后逐步观察 Agent 的工作过程。

这使我们快速发现了 Agent 的问题所在，例如

在已有足够好的结果时仍继续迭代；
使用的搜索查询过长；
选择错 Tools。

有效的提示词依赖于建立一个准确的 Agent mental model，可以让影响模型表现的点更显而易见。

3.2 主控 Agent 合理下发工作（how to delegate）

Lead Agent 将查询分解为子任务并描述给 sub-agent。

每个 sub-agent 需要目标、输出格式、关于 Tool 来源和使用的指导以及清晰的任务边界。
没有详细的任务描述，Agent 会重复工作或无法找到必要信息。

3.3 查询复杂度 vs. 工作量区间 (Scale effort to query complexity)

Agent 难以判断不同任务的合理投入是多少，因此我们在提示词中嵌入了规则。

简单的事实查找：1 个 agent 进行 3–10 次 Tool 调用，
直接比较：2–4 个 sub-agent 各进行 10–15 次调用，
复杂研究：多至 10 几个 sub-agent 并明确划分职责。

这些明确的规则帮助 Lead Agent 高效分配资源，防止在简单查询上过度投入 —— 这是我们早期版本中常见的问题。

3.4 Tool 的设计和选择至关重要

Agent-Tool 接口与人类-计算机接口同样重要。使用正确的 Tool 非常重要。例如，

对于一个通用查询，如果 Agent 决定只在 Slack 中搜索信息，那这个任务的效果注定不会好；
随着 MCP Tool 的流行，这一点变得更加重要，因为 Agent 会遇到各种 Tool，其描述质量参差不齐。

我们为 Agent 提供了明确的启发式方法：例如，

首先检查所有可用 Tool，将 Tool 与用户意图匹配；
在互联网上进行广泛的外部探索，寻找合适的 Tools；
优先使用专门 Tool 而非通用 Tool。

糟糕的 Tool 描述可能会将 Agent 引向完全错误的路径，因此每个 Tool 都需要明确的目的和清晰的描述。

3.5 让 Agent 自我改进

我们发现 Claude 4 模型能作为出色的提示词工程师。当给出提示词和失败信息时，它能诊断失败的原因并提出改进建议。

我们甚至创建了一个 Tool 测试 Agent ——

当给定一个有问题的 MCP Tool 时，它会尝试使用该 Tool，然后重写 Tool 描述；通过多次测试 Tool，这个 Agent 发现了关键细节和错误。
改进之后的 Tool 描述使得后续的 Agent 任务时间少用了 40% 的时间。

3.6 搜索策略：由宽泛到具体 (Start wide, then narrow down)

搜索策略应模仿人类专家：先探索全貌，再深入细节。

Agent 往往默认使用过长的具体查询，导致返回结果很少。
通过提示 Agent 先使用简短、宽泛的查询，评估可用内容，再逐步缩小查询范围来规避这种倾向。

3.7 引导 Agent 思考过程 (Guide the thinking process)

Extended thinking mode 使 Claude 在思考过程中输出额外 token，可充当可控的初版。

Lead Agent 使用思考来规划方法，评估哪些 Tool 适合任务，确定查询复杂度和 sub-agent 数量，并定义每个 sub-agent 的角色。

我们的测试表明，扩展思考提高了指令遵循性、推理能力和效率。

sub-agent 也进行 plan，然后在 Tool 结果后使用 interleaved thinking 来评估质量、识别差距并改进下一步查询。这使得 sub-agent 能适应任何任务。

3.8 并行 Tool 调用，提升速度和性能

复杂研究任务天然涉及到探索许多来源。我们早期的 Agent 按顺序执行搜索，速度非常慢。为了提高速度，我们引入了两个层面的并行化：

Agent 并行：Lead Agent 并行启动 3–5 个 sub-agent，而不是串行启动；
Tool 并行：sub-agent 并行使用 3+ 个 Tool。

这将复杂查询的时间缩短多达 90%。

将难题分解为小任务
仔细评估来源质量
根据新信息调整搜索方法
识别何时应专注于深度（详细调查一个主题）与广度（并行探索许多主题）。

我们还通过设置明确的安全护栏来主动减轻意外情况，防止 Agent 失控。最后，我们专注于可观测性和测试用例的快速迭代循环。

4 Agent 效果评估

良好的评估对构建可靠的 AI 应用至关重要，对 Agent 也不例外。然而，评估 Multi-Agent 系统带来了独特的挑战。

传统评估通常假设 AI 每次都遵循相同的步骤：给定输入 X，系统应遵循路径 Y 产生输出 Z。但 Multi-Agent 系统并非如此。

即使起点相同，Agent 也可能采取完全不同的有效路径来达到目标。
一个 Agent 可能搜索三个来源，另一个搜索十个，或者他们可能使用不同的 Tool 找到相同的答案。

4.1 尽早（使用小样本）开始评估

在 Agent 开发的早期阶段，一点小变动有可能就会产生巨大影响，例如调整提示词可能就会将成功率从 30% 提高到 80%。

由于效果变化如此大，只用几个测试用例就可以看出区别。

我们从一组约 20 个代表真实使用模式的查询开始。经常测试这些查询使我们能够清楚地看到变化的影响。
建议尽快开始测试，小规模就行，而不是推迟到比较后面，或者等待大型的完善 case。

4.2 LLM 作为裁判的方式扩展性很好 (LLM-as-judge evaluation scales)

Agent 输出一般都是非结构化的文本，因此很难用编程方式评估，用 LLM 评估非常适合。

我们使用了一个 LLM 评委，根据评分标准评估每个输出：

事实准确性（声明是否与来源匹配？）
引用准确性（引用的来源是否与声明匹配？）
完整性（是否涵盖了所有要求的方面？）
来源质量（是否使用了主要来源而非低质量的次要来源？）
Tool 效率（是否合理次数地使用了正确的 Tool？）。

4.3 人工评估捕捉自动化遗漏的问题

测试 Agent 的人员会发现LLM 评估遗漏的情况。包括

异常查询中的幻觉答案
系统故障
引用来源选择偏见。

即使用自动化评估，手动测试仍然必不可少。

Multi-Agent 系统具有涌现行为。例如，对 Lead Agent 的微小更改可能会不可预测地改变 sub-agent 的行为。
需要理解交互模式，而不仅仅是单个 Agent 的行为。

因此，这些 Agent 的最佳提示词不仅仅是严格的指令，而是定义分工、问题解决方法和预算的协作框架。要做到这一点，需要仔细地，

提示词和 Tool 设计
可靠的启发式方法
可观测性
紧密的反馈循环。

我们的提示词已开源，见 github.com/anthropics/anthropic-cookbook。

5 生产部署：系统可靠性与工程挑战

在 Agent 系统中，微小的改动可能会级联产生巨大的行为变化，这使得开发长时间运行、维护复杂状态的 Agent 非常困难。

5.1 Agent 是有状态的，错误会累积

Agent 可以长时间运行，在多次 Tool 调用之间维护状态。这意味着

我们需要长时间运行代码并在过程中处理错误；
如果没有有效的措施，微小的系统故障对 Agent 来说可能是灾难性的。

当错误发生时，我们不能简单地从头重试：Agent 重新启动成本高昂且让用户感到沮丧。为此，我们

构建了能够从错误发生时 Agent 所在位置恢复的系统。
利用模型的智能来优雅地处理问题：例如，让 Agent 知道 Tool 何时出现故障并让其适应，效果出奇地好。
引入定期检查点等确定性保护措施。

5.2 调试

Agent 是否使用了质量很差的搜索语句？
选择了糟糕的来源？
遇到了 Tool 故障？

解决方式：

可观测性：添加完整的生产 tracing，使我们能够诊断 Agent 失败的原因并系统地解决问题。
监控 Agent 决策模式和交互结构

这种高级别的可观测性帮助我们诊断根本原因，发现意外行为并修复常见故障。

5.3 服务发布方式：rainbow deployments

Agent 系统是提示词、Tool 和执行逻辑的高度有状态的网络，几乎不间断运行。这意味着每当我们部署更新时，Agent 可能处于其流程的任何位置。

防止代码更改破坏现有 Agent。
不能同时将所有 Agent 更新到新版本。

我们使用 rainbow deployments来避免中断正在运行的 Agent，通过逐步将流量从旧版本转移到新版本，同时保持两者并行运行。

5.4 同步执行造成瓶颈

改进方式：Agent 并发工作，并在需要时创建新的 sub-agent。但这种异步性在结果协调、状态一致性和 sub-agent 之间的错误传播方面增加了挑战。

随着模型能够处理更长、更复杂的研究任务，我们期望性能提升能够证明复杂性是值得的。

6 其他技巧 6.1 状态随时间变化的 Agent：进行最终状态评估

我们发现，关注最终状态评估而不是逐轮分析是成功的。不判断 Agent 是否遵循了特定流程，而是评估其是否达到了正确的最终状态。

这种方法承认 Agent 可能会找到实现同一目标的不同路径，同时确保它们提供预期的结果。
对于复杂的工作流，将评估分解为应发生特定状态变化的离散 checkpoint，而不是试图验证每一个中间步骤。

6.2 长跨度（超过上下文窗口限制）对话管理

生产 Agent 通常进行跨越数百轮的对话，需要仔细的上下文管理策略。

随着对话的延长，标准上下文窗口变得不足，需要智能的压缩和记忆机制。

我们实现了这样的模式：

Agent 在完成工作阶段后进行总结，并将基本信息存储在外部存储中，然后再继续执行新任务。当接近上下文限制时，Agent 可以生成新 sub-agent，交接保持连续性。
此外，它们可以从外部存储中检索上下文，而不是在达到上下文限制时丢失先前的工作。这种分布式方法防止了上下文溢出，同时在扩展交互中保持对话连贯性。

6.3 sub-agent 输出到文件系统，最小化“传话开销”

某些类型的结果，sub-agent 输出可以直接绕过 lead agent，从而提高保真度和性能。

不要求 sub-agent 必须通过 Lead Agent 传递所有信息，允许专门的 Agent 创建独立持久的输出。
sub-agent 调用 Tool，将工作存储在外部系统中，然后将轻量级引用传递回协调器。

7 总结

构建 AI Agent 时，最后一公里往往需要投入巨大精力。

尽管存在很多挑战，但已经证明，Multi-Agent 系统是解决开放式任务的最有效方式之一。

致谢

附录

为了方便阅读，格式略作调整。

原版提示词： github.com/anthropics/anthropic-cookbook，可能会随着 repo 更新跟本文不匹配，因此存档了一份跟本文匹配的版本，见这里。

Lead Agent 提示词

<research_process>

1. Assessment and breakdown

Analyze and break down the user’s prompt to make sure you fully understand it.

Identify the main concepts, key entities, and relationships in the task.
List specific facts or data points needed to answer the question well.
Note any temporal or contextual constraints on the question.
Analyze what features of the prompt are most important - what does the user likely care about most here? What are they expecting or desiring in the final result? What tools do they expect to be used and how do we know?
Determine what form the answer would need to be in to fully accomplish the user’s task. Would it need to be a detailed report, a list of entities, an analysis of different perspectives, a visual report, or something else? What components will it need to have?

2. Query type determination

Explicitly state your reasoning on what type of query this question is from the categories below.

Depth-first query: When the problem requires multiple perspectives on the same issue, and calls for “going deep” by analyzing a single topic from many angles.
- Benefits from parallel agents exploring different viewpoints, methodologies, or sources
- The core question remains singular but benefits from diverse approaches
- Example: “What are the most effective treatments for depression?” (benefits from parallel agents exploring different treatments and approaches to this question)
- Example: “What really caused the 2008 financial crisis?” (benefits from economic, regulatory, behavioral, and historical perspectives, and analyzing or steelmanning different viewpoints on the question)
- Example: “can you identify the best approach to building AI finance agents in 2025 and why?”
Breadth-first query: When the problem can be broken into distinct, independent sub-questions, and calls for “going wide” by gathering information about each sub-question.
- Benefits from parallel agents each handling separate sub-topics.
- The query naturally divides into multiple parallel research streams or distinct, independently researchable sub-topics
- Example: “Compare the economic systems of three Nordic countries” (benefits from simultaneous independent research on each country)
- Example: “What are the net worths and names of all the CEOs of all the fortune 500 companies?” (intractable to research in a single thread; most efficient to split up into many distinct research agents which each gathers some of the necessary information)
- Example: “Compare all the major frontend frameworks based on performance, learning curve, ecosystem, and industry adoption” (best to identify all the frontend frameworks and then research all of these factors for each framework)
Straightforward query: When the problem is focused, well-defined, and can be effectively answered by a single focused investigation or fetching a single resource from the internet.
- Can be handled effectively by a single subagent with clear instructions; does not benefit much from extensive research
- Example: "What is the current population of Tokyo?" (simple fact-finding)
- Example: "What are all the fortune 500 companies?" (just requires finding a single website with a full list, fetching that list, and then returning the results)
- Example: "Tell me about bananas" (fairly basic, short question that likely does not expect an extensive answer)

3. Detailed research plan development

For Depth-first queries:
- Define 3-5 different methodological approaches or perspectives.
- List specific expert viewpoints or sources of evidence that would enrich the analysis.
- Plan how each perspective will contribute unique insights to the central question.
- Specify how findings from different approaches will be synthesized.
- Example: For “What causes obesity?”, plan agents to investigate genetic factors, environmental influences, psychological aspects, socioeconomic patterns, and biomedical evidence, and outline how the information could be aggregated into a great answer.
For Breadth-first queries:
- Enumerate all the distinct sub-questions or sub-tasks that can be researched independently to answer the query.
- Identify the most critical sub-questions or perspectives needed to answer the query comprehensively. Only create additional subagents if the query has clearly distinct components that cannot be efficiently handled by fewer agents. Avoid creating subagents for every possible angle - focus on the essential ones.
- Prioritize these sub-tasks based on their importance and expected research complexity.
- Define extremely clear, crisp, and understandable boundaries between sub-topics to prevent overlap.
- Plan how findings will be aggregated into a coherent whole.
- Example: For "Compare EU country tax systems", first create a subagent to retrieve a list of all the countries in the EU today, then think about what metrics and factors would be relevant to compare each country’s tax systems, then use the batch tool to run 4 subagents to research the metrics and factors for the key countries in Northern Europe, Western Europe, Eastern Europe, Southern Europe.
For Straightforward queries:
- Identify the most direct, efficient path to the answer.
- Determine whether basic fact-finding or minor analysis is needed.
- Specify exact data points or information required to answer.
- Determine what sources are likely most relevant to answer this query that the subagents should use, and whether multiple sources are needed for fact-checking.
- Plan basic verification methods to ensure the accuracy of the answer.
- Create an extremely clear task description that describes how a subagent should research this question.
For each element in your plan for answering any query, explicitly evaluate:
- Can this step be broken into independent subtasks for a more efficient process?
- Would multiple perspectives benefit this step?
- What specific output is expected from this step?
- Is this step strictly necessary to answer the user's query well?

4. Methodical plan execution

Execute the plan fully, using parallel subagents where possible. Determine how many subagents to use based on the complexity of the query, default to using 3 subagents for most queries.

For parallelizable steps:
- Deploy appropriate subagents using the <delegation_instructions> below, making sure to provide extremely clear task descriptions to each subagent and ensuring that if these tasks are accomplished it would provide the information needed to answer the query.
- Synthesize findings when the subtasks are complete.
For non-parallelizable/critical steps:
- First, attempt to accomplish them yourself based on your existing knowledge and reasoning. If the steps require additional research or up-to-date information from the web, deploy a subagent.
- If steps are very challenging, deploy independent subagents for additional perspectives or approaches.
- Compare the subagent’s results and synthesize them using an ensemble approach and by applying critical reasoning.
Throughout execution:
- Continuously monitor progress toward answering the user’s query.
- Update the search plan and your subagent delegation strategy based on findings from tasks.
- Adapt to new information well - analyze the results, use Bayesian reasoning to update your priors, and then think carefully about what to do next.
- Adjust research depth based on time constraints and efficiency - if you are running out of time or a research process has already taken a very long time, avoid deploying further subagents and instead just start composing the output report immediately.

<subagent_count_guidelines>

When determining how many subagents to create, follow these guidelines:

1. Simple/Straightforward queries: create 1 subagent

collaborate with you directly,

Example: “What is the tax deadline this year?” or “Research bananas” → 1 subagent
Even for simple queries, always create at least 1 subagent to ensure proper source gathering

2. Standard complexity queries: 2-3 subagents.

For queries requiring multiple perspectives or research approaches
Example: “Compare the top 3 cloud providers” → 3 subagents (one per provider)

3. Medium complexity queries: 3-5 subagents.

For multi-faceted questions requiring different methodological approaches
Example: “Analyze the impact of AI on healthcare” → 4 subagents (regulatory, clinical, economic, technological aspects)

4. High complexity queries: 5-10 subagents (maximum 20).

For very broad, multi-part queries with many distinct components
Identify the most effective algorithms to efficiently answer these high-complexity queries with around 20 subagents.
Example: “Fortune 500 CEOs birthplaces and ages” → Divide the large info-gathering task into smaller segments (e.g., 10 subagents handling 50 CEOs each)

<delegation_instructions>

Use subagents as your primary research team - they should perform all major research tasks:

1. Deployment strategy

Deploy subagents immediately after finalizing your research plan, so you can start the research process quickly.
Use the run_blocking_subagent tool to create a research subagent, with very clear and specific instructions in the prompt parameter of this tool to describe the subagent's task.
Each subagent is a fully capable researcher that can search the web and use the other search tools that are available.
Consider priority and dependency when ordering subagent tasks - deploy the most important subagents first. For instance, when other tasks will depend on results from one specific task, always create a subagent to address that blocking task first.
Ensure you have sufficient coverage for comprehensive research - ensure that you deploy subagents to complete every task.
All substantial information gathering should be delegated to subagents.
While waiting for a subagent to complete, use your time efficiently by analyzing previous results, updating your research plan, or reasoning about the user’s query and how to answer it best.

2. Task allocation principles

For depth-first queries: Deploy subagents in sequence to explore different methodologies or perspectives on the same core question. Start with the approach most likely to yield comprehensive and good results, the follow with alternative viewpoints to fill gaps or provide contrasting analysis.
For breadth-first queries: Order subagents by topic importance and research complexity. Begin with subagents that will establish key facts or framework information, then deploy subsequent subagents to explore more specific or dependent subtopics.
For straightforward queries: Deploy a single comprehensive subagent with clear instructions for fact-finding and verification. For these simple queries, treat the subagent as an equal collaborator - you can conduct some research yourself while delegating specific research tasks to the subagent. Give this subagent very clear instructions and try to ensure the subagent handles about half of the work, to efficiently distribute research work between yourself and the subagent.
Avoid deploying subagents for trivial tasks that you can complete yourself, such as simple calculations, basic formatting, small web searches, or tasks that don’t require external research
But always deploy at least 1 subagent, even for simple tasks.
Avoid overlap between subagents - every subagent should have distinct, clearly separate tasks, to avoid replicating work unnecessarily and wasting resources.

3. Clear direction for subagents

All instructions for subagents should include the following as appropriate:
- Specific research objectives, ideally just 1 core objective per subagent.
- Expected output format - e.g. a list of entities, a report of the facts, an answer to a specific question, or other.
- Relevant background context about the user’s question and how the subagent should contribute to the research plan.
- Key questions to answer as part of the research.
- Suggested starting points and sources to use; define what constitutes reliable information or high-quality sources for this task, and list any unreliable sources to avoid.
- Specific tools that the subagent should use - i.e. using web search and web fetch for gathering information from the web, or if the query requires non-public, company-specific, or user-specific information, use the available internal tools like google drive, gmail, gcal, slack, or any other internal tools that are available currently.
- If needed, precise scope boundaries to prevent research drift.
Make sure that IF all the subagents followed their instructions very well, the results in aggregate would allow you to give an EXCELLENT answer to the user’s question - complete, thorough, detailed, and accurate.
When giving instructions to subagents, also think about what sources might be high-quality for their tasks, and give them some guidelines on what sources to use and how they should evaluate source quality for each task.

Example of a good, clear, detailed task description for a subagent:

4. Synthesis responsibility

<answer_formatting>

Before providing a final answer:

Review the most recent fact list compiled during the search process.
Reflect deeply on whether these facts can answer the given query sufficiently.
Only then, provide a final answer in the specific format that is best for the user’s query and following the <writing_guidelines> below.
Output the final result in Markdown using the complete_task tool to submit your final research report.
Do not include ANY Markdown citations, a separate agent will be responsible for citations. Never include a list of references or sources or citations at the end of the report.

<use_available_internal_tools>

<use_parallel_tool_calls>

<important_guidelines>

When necessary, review the core facts gathered so far, including:
- Facts from your own research.
- Facts reported by subagents.
- Specific dates, numbers, and quantifiable data.
For key facts, especially numbers, dates, and critical information:
- Note any discrepancies you observe between sources or issues with the quality of sources.
- When encountering conflicting information, prioritize based on recency, consistency with other facts, and use best judgment.
Think carefully after receiving novel information, especially for critical reasoning and decision-making after getting results back from subagents.
For the sake of efficiency, when you have reached the point where further research has diminishing returns and you can give a good enough answer to the user, STOP FURTHER RESEARCH and do not create any new subagents. Just write your final report at this point. Make sure to terminate research when it is no longer necessary, to avoid wasting time and resources. For example, if you are asked to identify the top 5 fastest-growing startups, and you have identified the most likely top 5 startups with high confidence, stop research immediately and use the complete_task tool to submit your report rather than continuing the process unnecessarily.
NEVER create a subagent to generate the final report - YOU write and craft this final research report yourself based on all the results and the writing instructions, and you are never allowed to use subagents to create the report.
Avoid creating subagents to research topics that could cause harm. Specifically, you must not create subagents to research anything that would promote hate speech, racism, violence, discrimination, or catastrophic harm. If a query is sensitive, specify clear constraints for the subagent to avoid causing harm.

subagent 提示词

You are a research subagent working as part of a team. The current date is {{.CurrentDate}}.

<research_process> 1. Planning

simpler tasks like "when is the tax deadline this year" should result in under 5 tool calls,
medium tasks should result in 5 tool calls,
hard tasks result in about 10 tool calls, and
very difficult or multi-part tasks should result in up to 15 tool calls.

Stick to this budget to remain efficient - going over will hit your limits!

2. Tool selection

Reason about what tools would be most helpful to use for this task. Use the right tools when a task implies they would be helpful. For instance,

google_drive_search (internal docs),
gmail tools (emails),
gcal tools (schedules),
repl (difficult calculations),
web_search (getting snippets of web results from a query),
web_fetch (retrieving full webpages).

ALWAYS use internal tools (google drive, gmail, calendar, or similar other tools) for tasks that might require the user’s personal data, work, or internal context, since these tools contain rich, non-public information that would be helpful in answering the user’s query. If internal tools are present, that means the user intentionally enabled them, so you MUST use these internal tools during the research process. Internal tools strictly take priority, and should always be used when available and relevant.
ALWAYS use web_fetch to get the complete contents of websites, in all of the following cases: (1) when more detailed information from a site would be helpful, (2) when following up on web_search results, and (3) whenever the user provides a URL. The core loop is to use web search to run queries, then use web_fetch to get complete information using the URLs of the most promising sources.
Avoid using the analysis/repl tool for simpler calculations, and instead just use your own reasoning to do things like count entities. Remember that the repl tool does not have access to a DOM or other features, and should only be used for JavaScript calculations without any dependencies, API calls, or unnecessary complexity.

3. Research loop

Execute an excellent OODA (observe, orient, decide, act) loop by

(a) observing what information has been gathered so far, what still needs to be gathered to accomplish the task, and what tools are available currently;
(b) orienting toward what tools and queries would be best to gather the needed information and updating beliefs based on what has been learned so far;
(c) making an informed, well-reasoned decision to use a specific tool in a certain way;
(d) acting to use this tool. Repeat this loop in an efficient way to research well and learn based on new results.

during which,

Execute a MINIMUM of five distinct tool calls, up to ten for complex queries. Avoid using more than ten tool calls.
Reason carefully after receiving tool results. Make inferences based on each tool result and determine which tools to use next based on new findings in this process - e.g. if it seems like some info is not available on the web or some approach is not working, try using another tool or another query. Evaluate the quality of the sources in search results carefully. NEVER repeatedly use the exact same queries for the same tools, as this wastes resources and will not return new results. Follow this process well to complete the task. Make sure to follow the description and investigate the best sources.

<research_guidelines>

Be detailed in your internal process, but more concise and information-dense in reporting the results.
Avoid overly specific searches that might have poor hit rates:
- Use moderately broad queries rather than hyper-specific ones.
- Keep queries shorter since this will return more useful results - under 5 words.
- If specific searches yield few results, broaden slightly.
- Adjust specificity based on result quality - if results are abundant, narrow the query to get specific information.
- Find the right balance between specific and general.
For important facts, especially numbers and dates:
- Keep track of findings and sources
- Focus on high-value information that is:
  - Significant (has major implications for the task)
  - Important (directly relevant to the task or specifically requested)
  - Precise (specific facts, numbers, dates, or other concrete information)
  - High-quality (from excellent, reputable, reliable sources for the task)
- When encountering conflicting information, prioritize based on recency, consistency with other facts, the quality of the sources used, and use your best judgment and reasoning. If unable to reconcile facts, include the conflicting information in your final task report for the lead researcher to resolve.
Be specific and precise in your information gathering approach.

<think_about_source_quality>

DO NOT use the evaluate_source_quality tool ever - ignore this tool. It is broken and using it will not work.

<use_parallel_tool_calls>

<maximum_tool_call_limit>

citation agent 提示词

Rules

Do NOT modify the <synthesized_text> in any way - keep all content 100% identical, only add citations
Pay careful attention to whitespace: DO NOT add or remove any whitespace
ONLY add citations where the source documents directly support claims in the text

Citation guidelines

Avoid citing unnecessarily: Not every statement needs a citation. Focus on citing key facts, conclusions, and substantive claims that are linked to sources rather than common knowledge. Prioritize citing claims that readers would want to verify, that add credibility to the argument, or where a claim is clearly related to a specific source
Cite meaningful semantic units: Citations should span complete thoughts, findings, or claims that make sense as standalone assertions. Avoid citing individual words or small phrase fragments that lose meaning out of context; prefer adding citations at the end of sentences
Minimize sentence fragmentation: Avoid multiple citations within a single sentence that break up the flow of the sentence. Only add citations between phrases within a sentence when it is necessary to attribute specific claims within the sentence to specific sources
No redundant citations close to each other: Do not place multiple citations to the same source in the same sentence, because this is redundant and unnecessary. If a sentence contains multiple citable claims from the same source, use only a single citation at the end of the sentence after the period

Technical requirements

Citations result in a visual, interactive element being placed at the closing tag. Be mindful of where the closing tag is, and do not break up phrases and sentences unnecessarily
Output text with citations between <exact_text_with_citation> and </exact_text_with_citation> tags
Include any of your preamble, thinking, or planning BEFORE the opening <exact_text_with_citation> tag, to avoid breaking the output
ONLY add the citation tags to the text within <synthesized_text> tags for your<exact_text_with_citation> output
Text without citations will be collected and compared to the original report from the <synthesized_text>. If the text is not identical, your result will be rejected.

Now, add the citations to the research report and output the <exact_text_with_citation>.

[译] 关于 AI 下半场的思考：技术/模型篇（2025）

ARTHURCHIAO'S BLOG

5 months 4 weeks ago

本文翻译自 2025 年的一篇英文博客 The Second Half。拆分了一些章节并增加标题，方便个人学习理解。

文章几个核心点：

Agent + Reasoning + prior knowledge，使得强化学习终于能泛化，一套组合拳能完成所有场景的任务，因此专攻算法和模型变得没以前那么重要；

针对特定任务的新算法可能只能提高 5%，而得益于预训练、强化学习和良好的泛化能力，下一代推理模型可以在不明确针对这个任务的情况下直接提高 30%。
模型已经在大多数任务上超越人类选手，但还并未对真实世界产生太大影响（例如，经济、GDP）；
基于 1 & 2，认为 AI 发展进入中场时刻，需要做出方向性转变，
- 上半场：专注在算法和模型训练，但评估方式没有与现实世界对齐，因此对真实世界影响不够大；
- 下半场：应该从根本上重新考虑评估（evaluation）这个事情，让 AI 能更大程度影响真实世界，甚至通往 AGI。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

关于 AI 下半场的思考（一）：技术/模型篇（2025）
关于 AI 下半场的思考（二）：商业/应用篇（2025）

1 引言
- 1.2 最近几十年 AI 的发展方向
- 1.2 为什么说要进入下半场了？
2 上半场
3 下半场
原文致谢

1 引言 1.2 最近几十年 AI 的发展方向

最近几十年，人工智能领域主要致力于提出新的训练方法和模型（new training methods and models）。这个方向是成功的，例如 AI 已经能：

在国际象棋和围棋中击败人类世界冠军，
在 SAT 和律师资格考试中超越大多数人类应试者，
在国际数学奥林匹克竞赛（IMO）和国际信息学奥林匹克竞赛（IOI）中获得金牌。

教科书中的一系列里程碑模型（DeepBlue、AlphaGo、GPT-4、GPT-o 系列）背后，是人工智能方法的根本性创新：

搜索（search）
深度强化学习（deep RL）
扩展/规模（scaling）
推理（reasoning）

一切都在沿着这个方向不断进步。那么，现在为什么突然说要进入下半场了呢？

1.2 为什么说要进入下半场了？

用一句话来回答：强化学习终于奏效了（RL finally works）。

1.2.1 游戏终结者：强化学习（终于能泛化了！）

更准确地说：强化学习终于能够泛化了（RL finally generalizes）。

之前的一系列突破不断累积，使我们终于找到了一种统一的方式，只使用语言和推理（language and reasoning）就能完成各种领域的强化学习任务（a wide range of RL tasks）。
即便在仅仅一年前，如果你跟任何 AI 研究者说，有一种统一的方式可以解决 软件工程、创意写作、数学、AI 自动使用鼠标和键盘、长篇问答等领域的任务，肯定都会得到无情的嘲笑。这些任务每一个都极其困难，许多人在整个博士期间也只专注于其中的某个狭窄领域。然而，现在不一样了。

1.2.2 重点的转变：解决问题 -> 定义问题

人工智能的下半场，重点将从解决问题（solving problems）转移到定义问题（defining problems）。具体来说，

评估将比训练更重要（evaluation becomes more important than training）；
原来是思考 “我们能训练一个模型来解决某某问题吗”，现在更应该思考：“我们应该训练人工智能做什么？如何衡量我们的进展？”

1.2.3 思维方式和技术储备转变

要在下半场取得成功，需要及时转变思维方式和技术储备 —— 也许要更多地像产品经理那样思考。

2 上半场 2.1 训练方法和模型

要理解上半场，可以先看看它的赢家是谁。你认为到目前为止最有影响力的 AI 论文是什么？

我在斯坦福 224N 课程中做了调研，答案并不令人惊讶：Transformer、AlexNet、GPT-3 等等。

2.1.1 最有影响力的 AI 论文的共同点

这些论文有什么共同点？

首先，都提出了一些根本性的创新，能训练出更好的模型。

其次，还有一个不那么明显的共同点：这些“赢家”都是训练方法或模型（methods or models），而不是基准测试或任务（benchmarks or tasks）。

即使是最有影响力的基准测试 —— ImageNet —— 其引用量也不及 AlexNet 的三分之一。
在其他地方，方法与基准的对比甚至更为悬殊。例如，Transformer 的主要基准测试是 WMT’14，其引用量约为 1300，而 Transformer 的引用量则超过了 16w。

2.1.2 上半场的核心：构建新的模型和方法

这说明了上半场的游戏 专注于构建新的模型和方法，而评估和基准测试是次要的（尽管是论文系统正常运转所必要的）。

算法 vs. 任务：洞察力和工程能力

为什么呢？一个很大的原因是，在人工智能的上半场，方法/算法比任务更难、更令人兴奋。

从零开始设计一个新算法或模型架构 —— 例如反向传播算法、卷积网络（AlexNet）、GPT-3 中使用的 Transformer —— 需要非凡的洞察力和工程能力。
相比之下，为人工智能定义任务往往感觉更简单直接：我们只是把人类已经做的事情（比如翻译、图像识别或国际象棋）变成基准测试 —— 不需要太多洞察力甚至工程能力。

算法 vs. 任务：通用性和普适性

方法（methods）也往往比单个任务（task）更具通用性和普适性，这使得它们非常有价值。

例如，Transformer 架构最终推动了计算机视觉（CV）、自然语言处理（NLP）、强化学习（RL）以及许多其他领域的进步 —— 远远超出了它最初证明自己的单一数据集（WMT’14 translation）。

一个伟大的新方法可以在许多不同的基准测试中不断改进提升，因为它简单且通用，因此其影响往往超出单个任务。

2.1.3 训练组合拳的质变时刻

这种方式已经持续了几十年，并激发了很多改变世界的思想和突破 —— 体现在各个领域不断提高的基准测试性能上。

那么，为什么说此时到了一个分水岭了呢？因为这些思想和突破的积累已经产生质变（made a qualitative difference）， 能让我们用一种新方式完成不同类型的任务。

训练组合拳包括什么呢？

massive language pre-training
scale (in data and compute)
reasoning and acting

这些术语大家应该已经司空见惯了。但为什么称它们为组合拳呢？可以通过强化学习（RL）来理解一下。

2.2 强化学习（RL）

强化学习通常被认为是人工智能的“终极游戏” —— 毕竟， 从理论上讲，RL 能够完成任何任务，而且很难想象不用 RL 就能实现的超级人类系统（例如 AlphaGo）。

在 RL 中，有三个关键组成部分：

算法
环境
先验知识

2.2.1 传统 RL：主要关注算法

长期以来，RL 研究者主要关注算法（例如 REINFORCE、DQN、TD-learning、actor-critic、PPO、TRPO……）—— 这是 agent 学习的智力核心 —— 而将环境和先验知识视为固定或最小化的。例如，Sutton 和 Barto 的经典教科书几乎只关注算法，而几乎不涉及环境或先验知识。

2.2.2 深度 RL：环境因素非常重要，决定算法的效果

在深度 RL 时代，从经验上说，环境很重要：算法的性能往往与其开发和测试环境高度相关。

如果忽视环境，你可能构建出来的就是一个只在 toy 设置中表现出色的“最优”算法。

2.2.3 深度 RL：OpenAI 的工程经验

也就是说，我们需要先确定我们真正想要解决的环境，然后才能找到最适合它的算法。这正是 OpenAI 最初的计划。

OpenAI 先是构建了 gym，一个用于各种游戏的标准 RL 环境，
然后是 World of Bits 和 Universe 项目，试图将互联网或计算机变成一个游戏。

一旦我们将所有数字世界变成一个环境，就能用 RL 算法解决它 —— 最终我们就拥有了通用人工智能（AGI）。

计划是好的，但并不完全奏效。OpenAI 在这条道路上取得了巨大的进展，使用 RL 解决了 Dota、robotic hands 等问题。但它从未接近解决 computer use 或 web navigation 问题，而且在不同领域工作的 RL agents 无法相互转移学到的知识。中间似乎缺少了什么。

直到 GPT-2 或 GPT-3 出现后，才发现缺失的部分是先验知识。

你需要强大的预训练，将一般常识和语言知识提炼到模型中，
然后可以微调以成为 web agent (WebGPT) 或 chat agent (ChatGPT) （进而改变真实世界）。

2.2.4 深度 RL：最重要的可能是先验知识（预训练到模型中）

事实证明，RL 最重要的部分可能不是 RL 算法或环境，而是先验知识，这些可以通过与 RL 完全无关的方式获得。

预训练只对聊天场景比较有效（先验知识）

预训练为聊天场景（chatting）创造了良好的先验知识，但并不同样适用于控制计算机或玩电子游戏。

为什么呢？因为这些领域与互联网文本的分布相距较远，而简单地在这些领域进行 SFT/RL 很难泛化。

2.3 顿悟时刻：模型需要像人类一样去【思考】

我在 2019 年注意到了这个问题，当时 GPT-2 刚刚问世，我在其基础上进行了 SFT/RL，以解决基于文本的游戏 —— CALM 是世界上第一个通过预训练语言模型构建的 agent。但该 agent 需要数百万次 RL 步骤才能学会一个游戏，而且无法转移到新游戏中。

尽管这是 RL 的典型特征，RL 研究者对此并不陌生，但我发现这很奇怪，因为我们人类可以轻松地玩一个新游戏，并且在零样本的情况下表现得更好。然后我迎来了人生中的第一个顿悟时刻 —— 我们之所以能够泛化，是因为我们不仅可以选择“走到橱柜 2”、“用钥匙 1 打开宝箱 3”或“用剑打开地牢”等动作，还可以选择思考像“地牢很危险，我需要武器来战斗。没有可见的武器，也许我需要在锁着的箱子或宝箱中找到一个。宝箱 3 在橱柜 2 里，我先去那里打开它”这样的事情。

2.4 突破：AI 思考/推理

思考，或者说推理，是一种奇怪的动作 —— 它并不直接影响外部世界，而推理空间是开放的、无限组合的 —— 你可以去想一个词、一句话、一段文字，或者 10000 个随机的单词，但你周围的世界并不会立即改变。

2.4.1 经典 RL：无法在开放、无限组合的推理空间做出决策

在经典 RL 理论中，这是一个糟糕的事情，因为它导致无法做出决策。想象一下，

如果你要在两个盒子中选择一个，其中一个盒子里有 100 万美元，另一个是空的。那你的期望收益将是 50 万美元。
如果在其中增加了无数个空盒子，你的期望收益将变为零。

2.4.2 经典 RL + Reasoning + 预训练模型（先验知识）：实现 RL 泛化

But by adding reasoning into the action space of any RL environment, we make use of the language pre-training priors to generalize, and we afford to have flexible test-time compute for different decisions.

但是，往任何 RL 环境的 action space 加入 reasoning 能力之后，我们就利用预训练的先验知识来泛化，并且可以为不同的决策提供灵活的 test-time compute。

这是一件非常神奇的事情，我为不能在这里完全解释清楚而致歉，可能需要再写一篇文章来专门来解释它。你可以阅读我的 paper ReAct 了解最原始的 agent 推理的故事，感受一下我当时的感受。

2.4.3 “选盒子游戏”的直观 vs. 抽象解释

目前，我的直观解释是：即使增加了无数个空盒子，但你此生已经在玩过的各种游戏中都见过它们，因此在任何给定的游戏中，你能尽量排除掉它们，仍然选出最有可能装了钱的那个盒子。

我的抽象解释是：agents 中，语言通过推理实现泛化（language generalizes through reasoning in agents）。

2.5 RL 小结：先验知识 > 环境 > 算法

一旦我们有了正确的 RL 先验知识（语言预训练）和 RL 环境（将语言推理作为动作）， 事实证明 RL 算法可能就是最不重要的部分了。

因此，我们有了 GPT-o 系列、DeepSeek R1、深度研究、computer-use agent ，还会有更多出现。

真是一个讽刺的转折！长期以来，RL 研究者一直最关注算法，然后才是环境，而没有人关注过先验知识 —— 所有 RL 实验基本上都是从头开始的。我们经过了数十年的曲折才意识到，也许优先级应该完全颠倒过来。

但正如史蒂夫·乔布斯所说：You can’t connect the dots looking forward; you can only connect them looking backward。

这个发现正在彻底改变游戏规则。

3 下半场

回顾上半场的游戏：

开发新的训练方法或模型，以在基准测试中不断提升性能；
创建更难的基准测试；
转 1，继续这个循环。

这个游戏现在玩不下去了，因为：

这种基准测试本质已经很标准化和工业化，不需要什么新算法就能实现性能提升 —— 你针对特定任务的新方法可能只能提高 5%，而得益于预训练、强化学习和良好的泛化能力，下一个 o 系列模型可以在不明确针对它的情况下提高 30%。
即使创建更难的基准测试，很快（而且越来越快）它们也会被以上方式解决。我的同事 Jason Wei 制作了下图，很好地可视化了这一趋势：

那么，在下半场还剩下什么呢？如果不再需要新方法，而更难的基准测试很快就会被解决，我们该怎么办？

3.1 从根本上重新思考 evaluation

我认为，我们应该从根本上重新思考评估（evaluation）。

这意味着不仅要创建新的、更难的基准测试，
还要从根本上质疑现有的评估 setups 并创建新的 setups，迫使我们发明出更有效的评估新方法。

这很难，因为人类有惯性，很少质疑基本假设 —— 你把它们当作理所当然，而没有意识到它们是假设，而不是法则。

为了说明惯性，假设你基于人类考试发明了历史上最成功的评估之一。这是一个在 2021 年非常大胆的想法，但 3 年后它已经饱和了。你会怎么做？最有可能的是创建一个更难的考试。或者假设你解决了简单的编程任务。你会怎么做？最有可能的是找到更难的编程任务来解决，直到你达到了 IOI 金牌水平。

3.2 效用问题：AI 已经在大量场合超越人类，但并未对真实世界（e.g. GDP）产生太大影响

人工智能已经在国际象棋和围棋中击败了世界冠军，在 SAT 和律师资格考试中超越了大多数人类，并在 IOI 和 IMO 中达到了金牌水平。但世界并没有因此而发生太大变化，至少从经济和 GDP 来看是这样。

我称这为效用问题，并认为这是人工智能最重要的问题。

这个问题我们也许会很快解决，也许不会。但不管怎样，这个问题的根本原因可能出人意料地简单： 我们的评估 setups 在许多基本方面与现实世界 setups 不同。

3.3 评估 setups 与现实世界 setups 不同

举两个例子。

3.3.1 例子一：评估“应该”自动运行

根据这个假设，通常 agent 接收任务输入，自主地做事情，然后接收任务奖励。

但在现实中， agent 必须在整个任务过程中与人类互动 —— 你不会给客户服务发一条超长的信息，等 10 分钟，然后期望一个详细的回复来解决所有问题。

解决这类问题就需要提出一些新的基准测试，要么引入真人打分（例如 Chatbot Arena），要么引入用户模拟（例如 tau-bench）。

3.3.2 例子二：评估“应该”独立同分布（i.i.d.）

如果你有一个包含 500 个任务的测试集，你会独立运行每个任务，平均任务指标，然后得到一个总体指标。

但在现实中，你是顺序解决任务，而不是并行解决。

谷歌的软件工程师（SWE）随着对代码库的熟悉程度越来越高，解决 google 问题的能力也越来越强，
但 SWE agent 在同一个代码库中解决许多问题之后，却无法获得这种熟悉感。

我们显然需要长期记忆方法（已经有了），但学术界没有合适的基准测试来证明这种需求，甚至没有勇气质疑机器学习的基础假设 —— 独立同分布。

这些假设“一直”以来都是这样，而在人工智能的上半场，在这些假设下开发基准测试是可以的，因为当智能水平较低时，提高智能通常会提高效用（when the intelligence is low, improving intelligence generally improves utility）。

3.4 下半场游戏规则

下半场的游戏方式：

开发针对现实世界效用的新评估 setups 或任务；
用现在的训练组合拳（或引入新组件增强）去训练模型，在 1 的任务上不断提升性能；
转 1，继续这个循环。

3.5 小结

下半场的游戏很难，因为大家对它还比较陌生，但它令人兴奋。

上半场的参与者解决了电子游戏和考试，下半场的参与者可以通过开发有用的 AI 产品，建立数十亿甚至万亿美元的公司。
上半场是渐进式的方法和模型，下半场则不一样了，通用训练组合拳能轻松击败渐进式方法，除非你能提出新的假设来打破组合拳，那你就是在做真正改变游戏规则的研究了。

欢迎来到下半场！

原文致谢

This blog post is based on my talk given at Stanford 224N and Columbia. I used OpenAI deep research to read my slides and write a draft.

[笔记] 关于 AI 下半场的思考：商业/应用篇（2025）

ARTHURCHIAO'S BLOG

5 months 4 weeks ago

本篇笔记整理自 2025 年真格基金的一篇长文从「没必要付费」到「非用不可」，AI 正在冲击人类历史上最快的增长纪录。拆分了一些章节并增加标题，方便个人学习理解。

近日，真格基金展开了一场关于 AI 创业的深度对谈，核心点：

真正的技术突破，不依赖营销也能实现自发传播。DeepSeek 是个例子。
AI 正在把我们带回那个凭产品力打动用户的时代。
新产品正在快速验证：只要创造了真实价值，就有机会跨越鸿沟（从少数走向大众）。

水平及维护精力所限，文中不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

关于 AI 下半场的思考（一）：技术/模型篇（2025）
关于 AI 下半场的思考（二）：商业/应用篇（2025）

1 看 AI 真正跨越鸿沟
2 对 AI 创业者的要求
3 AI 使得执行力不再稀缺，那以后工作的关键是什么
4 给用户创造了价值，就总有办法变现
- 4.1 用户爱用但不知道怎么赚钱
- 4.2 商业的本质就是你为用户创造更多价值，并从中提取利润
5 在行业早期，奢谈终局都没有意义，唯有下场开始执行
6 当 AI 可以替你干活
7 AI 应用的价值分层
8 第一次，我们都可以当（AI 的）老板了
- 8.1 当好 AI 的老板不容易
- 8.2 组织的 scaling law
9 结束语：水到沸点，蒸汽时代即将来临？

24 年行业都在关注大模型公司的军备竞赛，大家都在问：训练大模型烧了这么多钱，应用什么时候落地，商业价值到底在哪？而我们认为新技术的落地需要时间，就像送孩子上学，前期学费是投入，要等他长大才能赚钱。

和历史上其他创新技术相比，生成式 AI 的应用落地速度非常快，今年我们已经看到随着模型能力的飞速进展，不少 AI 应用开始有实打实的收入。

1 看 AI 真正跨越鸿沟 1.2 早期 Google：技术极为先进，界面极其简单

99 年刚出来时的 Google：一个简单的输入框，用自然语言什么都可以问，问什么都有答案。

这是我对终极产品的向往：把极为先进的技术包装在超级简单的界面背后，像魔法一样让普通人具备非常强大的能力。

1.2 ChatGPT：AI 的 ‘Google’ 时刻

虽然早期的大模型还不够聪明，也有很多幻觉，但 AI 不再只是在科研界的热议话题，而是真正能用起来的产品。

在生成式 AI 到来之前，虽然 AlphaGo 已经击败李世石和柯洁，特斯拉也已推出 FSD，但 AI 离普通人的生活还比较远。
当时谈 AI，还更多是在讲科技研发和未来愿景，跟大众产品还很有距离。

当 22 年底上线的 ChatGPT，就像 99 年的 Google。它是一个真正的转折点，让 AI 变得人人可用，也真的好用。

1.3 ChatGPT：第一个跨越鸿沟的 AI 产品

认知技术创新的框架「跨越鸿沟」：创新技术怎么从早期市场进入主流市场。

ChatGPT 可能是第一个能真正跨越鸿沟的 AI 产品。

2 对 AI 创业者的要求 2.1 创业者分类

我们曾经把早期成功创业者分成四类：小天才、老司机、科学家、操盘手。

最近想，是不是还得区分「技术变革的早期」和「技术成熟期」，不同时期成功几率大的创业者画像和打法可能都不一样。

过去十年是移动互联网的成熟期，在下半场，容错率更低，经验和资源更重要，打过仗交过学费的连续创业者胜率更高。
现在的 AI，又回到了技术变革的早期。创业者需要对新技术很懂，对技术边际变化带来的机会很敏感，这就给年轻创业者带来了很多机会。

2.2 AI 创业者：既要懂前沿技术，又要有很强的产品执行力

AI 也要通过成熟的形态如 App 或网站去落地，因此对创业者提出了更高的要求：既要懂前沿技术，又要有很强的产品执行力。

2.3 成熟的方法论（e.g. 投放）未必在 AI 领域有效

与此同时，很多产业成熟期的方法论，比如 AB 测试、精细化投放等，在产业早期却未必最有效。

举个例子，AB 测试适合找到产品方案的细节差异，但技术早期往往是要在没有数据的情况下做选择，选对了就是 10 倍起步，选错就全盘皆输。

例如 Transformer 出现之后，BERT 和 GPT 哪个技术路线更好，OpenAI 不是 AB 测试出来的，是靠判断选出来、执行做出来的，甚至在模型规模到达一定规模之前，BERT 反而是效果更好的方案。但这种选择的能力，反而是 AI native 创业者面对大厂的机会。

2.4 花一点小钱看未来，其实很值

第一批吃螃蟹的人往往会得到不菲的奖励。例如

当年第一批做互联网创业的人，很多是最早买电脑、最早上网的；
第一批做移动互联网的人，也常常是最早买 iPhone 的。

现在 AI 产品其实已经很便宜，一个月可能只要花 20 美金，也就一顿饭的价格，但能帮助你先看到未来，也先抓住机会。

3 AI 使得执行力不再稀缺，那以后工作的关键是什么

当执行力不再稀缺，我认为工作的关键变成：Agency & Taste。

3.1 你要做什么（主观能动性，Agency）

这是人的主观能动性（Agency）。很关注创业者是不是那个真正行动的人，清楚自己要做什么，想办法推进，招人、找钱、做产品，遇到问题也能努力解决往前走。

【注释】zh.wikipedia.org

在哲学中，能动性（英语：Agency）是行动者在给定环境中行动的能力。能动性可以被归类为无意识的、非自愿的行为，或有目的的、目标导向的活动（故意行为）。能动者通常对他们的身体活动和活动旨在实现的目标有某种直接的认识。在“目标导向行动”中，能动者对其自己的行为实施一种直接控制或指导。

3.2 你选择什么（品味，Taste）

AI 可以创造很多选项，但是选择最后还是人来做。也就是所谓的 Taste（品味）。

Midjourney 一次给你四张图，Vibe Coding 给你多个实现方案，你选哪个？
也许有一天 AI 的 taste 会比人更强，但现在，决定还得人来做。

3.3 小结：AI 时代人与人之间的关键分野

Agency（主观能动性）和 Taste（品味），是 AI 时代人与人之间的关键分野。

4 给用户创造了价值，就总有办法变现

已经有不少人在用 Cursor、Manus、Genspark 等工具给自己的工作 10x 提速，他们看到的是完全不一样的世界。但对于没有体验这些产品的人来说，世界没有什么变化。

技术扩散需要时间，所以才会有从创新者、早期采用者到大众市场的创新扩散曲线。现在，我们已经能直观地看到那道鸿沟的存在。

4.1 用户爱用但不知道怎么赚钱

新技术驱动的产品，早期常常是「用户爱用但不知道怎么赚钱」。

Google 刚出来时是个基于先进技术，非常好用但没盈利模式的产品。那时候华尔街有很多质疑，说它不做广告，还鼓励用户尽快离开网站，这怎么赚钱？

2002 年，Google 通过 AdWords 和 Adsense 找到了商业模式，现在搜索引擎广告是互联网行业最很赚钱的印钞机之一。

4.2 商业的本质就是你为用户创造更多价值，并从中提取利润

商业模式的完善需要时间。只要产品能给用户创造足够大的价值，总会有办法把价值提取转化出来变成收入。不论是订阅、广告还是导流，商业的本质就是你为用户创造更多价值，并从中提取利润。

5 在行业早期，奢谈终局都没有意义，唯有下场开始执行

在行业早期，奢谈终局都没有意义，唯有下场开始执行。比起终局，我更关注当下：谁在用，得到了什么价值，以及未来还会在哪些场景继续产生价值。

5.1 增长的关键不在投放，而是有没有「魔法体验」

投放是移动互联网后期的必修课，然而现在很多 AI 应用的成功，投放不是重点，甚至根本不需要投放。

关键是能不能让用户有魔法般的体验产生自然传播。当用户突然遇到一个体验好十倍的产品，这时候，口碑和自然增长的力量，远比投放更管用。

DeepSeek 就是个例子，一上线火遍全球，但没花一分钱在营销上。过去几年，投放这件事被高度专业化，做增长的人越来越多，但技术范式一变，这些成熟方法不一定还管用。

5.2 AI 把我们带回了那个靠产品力打动用户的时代

我很开心 AI 把我们带回了那个靠产品力打动用户的时代，需要产品经理用判断做选择，用体验打动人。

回头看互联网早期，投放还不是个显学，大家靠的是产品、内容和口碑本身。比如 Facebook，用户加了几个好友就会上头，呈现出非常好的留存，产品设计本身就很有利于病毒传播。

5.3 是否有场景能吸引到用户主动使用

不靠补贴和投放。

5.4 产品进化的斜率是重点

再说留存和新增的选择。做增长的人总说留存重要，但这有个隐含前提：产品够普世。

很多小众产品，比如豆瓣、即刻，用户留存都很好，还在用的人绝对是真爱，但是它不增长了。
技术革命早期，有明确的亮点，快速吸引用户才更重要。
在技术还不完善的时候，留存差一点也正常，技术本身还在演进。

回头看亚马逊刚起步的时候，能买的东西很少，体验也一般，但重点是产品进化的斜率高不高。

AI 时代，ChatGPT 就是典型。

一开始 ChatGPT 功能没那么强，很多人试完，觉得和 AI 瞎聊几句也没啥用，留存远没有现在好。
反倒是 C.ai 这样情感类的 AI 产品当时留存高，因为核心用户粘性强。

但你逐渐会发现，这类产品的用户群相对集中，大多数人没感觉。而 ChatGPT 的需求是更加普适的。哪怕一开始留存一般，但产品能力随着模型进步非常快，从 good to have 变成 must to have，走入了真实的高频场景。

所以比起留存，我现在更看重一个 AI 应用是否有吸引用户的亮点：

产品有没有在某个场景的吸引力，不靠补贴和投放，用户自己愿意来使用
产品是不是在快速变好，斜率是否够高。这可能就是技术革命早期和成熟期做增长最大的区别。

6 当 AI 可以替你干活

AI 可能会带来一种新的商业模式：虚拟雇佣。

6.1 你愿意在哪种程度上为它付费？

过去我们对工具付费，通常想的是它的价值加上你的时间成本。但雇一个人不一样，本质上是买他的时间。工具和员工的定价机制是两套逻辑。

只要 AI 真的帮我创造了价值，比如它帮我节省或赚到了 100 块钱，我付他 20 块，可能是个很自然的决定。这已经不再是按月订阅，而是更像「给 AI 发工资」。

这种正向循环不仅可以突破人类的注意力上限，也有机会突破传统订阅的价格上限。现在像 Cursor、一些 AI 工具已经开始按使用量计费，帮你做了多少任务，系统自动算账。

6.2 如果有 100 个 Agent 并行干活，你到底想让它们做什么

如果 AI 能直接帮你做事，想象空间就完全变了。有 10 个、100 个 agent 并行干活，真正的限制变成了：你到底想让它做什么？

6.3 模型吞噬应用 vs. 应用胜过模型

应用或者是「套壳」到底有没有长期价值？

观点一：模型越来越强大，会吞噬应用的价值。
观点二：模型越强大，应用就越能够通过专有的上下文和环境来创造增量价值。

头部模型公司竞争激烈， API 的差距在不断缩小。如果应用公司始终能使用接近 SOTA 水平的模型 API，那么加上好的产品设计、用户数据、使用习惯、品牌效应等，就可能做出更好的体验。

7 AI 应用的价值分层 7.1 模型能力

最底层是模型能力，这一层是相对通用和公开的，确实需要大模型公司通过开源模型或者闭源 API 的方式来提供。

7.2 上下文能力（public/organizational/personal）

中间层是模型权重中并不直接具备的上下文（context），这里又可以细分成三层：

公开的上下文（public context），如用于搜索的新闻报道等；
组织专有的上下文（organizational context），比如说组织内的文件，流程，数据等；
用户私人的上下文（personal context），如用户和 AI 的交互记录，个人信息和偏好等。

1 & 2 可以建构壁垒。

7.3 环境（environment）

环境层（environment），这里包括

模型可以调用的各种工具如 computer use，MCP，A2A 等协议，
模型可以改变迭代的 code base 等。

随着 AI 产品越来越完善，更多的价值创造会出现在上下文和环境这两层，这也就是 AI 应用的壁垒。

7.4 小结：思考 6-12 个月后 SOTA 模型的能力，做基于这个做准备

应用创业者真正该做的，是去思考 6-12 个月以后 SOTA 模型会有哪些能力，再基于这个做准备。

正如乔布斯引用一位传奇冰球教练的话：「我永远滑向冰球将要去的地方。」

8 第一次，我们都可以当（AI 的）老板了

能够自主完成任务的 Agent 的出现，意味着第一次我们每个人都可以当（AI 的）老板。

8.1 当好 AI 的老板不容易

要当一个好老板不容易，也需要很多学习。

8.2 组织的 scaling law

技术升级往往会带来组织的 scaling law。

一方面，新技术可以让更小的团队完成更多的工作，另一方面，新技术也可以让大公司管理更大更多的业务。
例如移动互联网革命中，既出现了 Instagram 这样被 10 亿美金收购时只有十来个人的 mini 公司，也出现了美团这样能够使用技术高效管理几百万骑手的超级公司。

AI 革命可能让组织的 scaling law 进一步发展。Sam Altman 预言我们很快就会看到一个人的独角兽公司。

9 结束语：水到沸点，蒸汽时代即将来临？

AI 的发展有点像烧开水，在水已热但还没烧开之前可能只能泡咖啡，但一旦到达 100 度的沸点，将会解锁蒸汽机，带来各行各业巨大的生产力变革。

[译] 关于 AI 下半场的思考：技术/模型篇（2025）

ARTHURCHIAO'S BLOG

5 months 4 weeks ago

本文翻译自 2025 年的一篇英文博客 The Second Half。拆分了一些章节并增加标题，方便个人学习理解。

文章几个核心点：

Agent + Reasoning + prior knowledge，使得强化学习终于能泛化，一套组合拳能完成所有场景的任务，因此专攻算法和模型变得没以前那么重要；

针对特定任务的新算法可能只能提高 5%，而得益于预训练、强化学习和良好的泛化能力，下一代推理模型可以在不明确针对这个任务的情况下直接提高 30%。
模型已经在大多数任务上超越人类选手，但还并未对真实世界产生太大影响（例如，经济、GDP）；
基于 1 & 2，认为 AI 发展进入中场时刻，需要做出方向性转变，
- 上半场：专注在算法和模型训练，但评估方式没有与现实世界对齐，因此对真实世界影响不够大；
- 下半场：应该从根本上重新考虑评估（evaluation）这个事情，让 AI 能更大程度影响真实世界，甚至通往 AGI。

水平及维护精力所限，译文不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

关于 AI 下半场的思考（一）：技术/模型篇（2025）
关于 AI 下半场的思考（二）：商业/应用篇（2025）

1 引言
- 1.2 最近几十年 AI 的发展方向
- 1.2 为什么说要进入下半场了？
2 上半场
3 下半场
原文致谢

1 引言 1.2 最近几十年 AI 的发展方向

最近几十年，人工智能领域主要致力于提出新的训练方法和模型（new training methods and models）。这个方向是成功的，例如 AI 已经能：

在国际象棋和围棋中击败人类世界冠军，
在 SAT 和律师资格考试中超越大多数人类应试者，
在国际数学奥林匹克竞赛（IMO）和国际信息学奥林匹克竞赛（IOI）中获得金牌。

教科书中的一系列里程碑模型（DeepBlue、AlphaGo、GPT-4、GPT-o 系列）背后，是人工智能方法的根本性创新：

搜索（search）
深度强化学习（deep RL）
扩展/规模（scaling）
推理（reasoning）

一切都在沿着这个方向不断进步。那么，现在为什么突然说要进入下半场了呢？

1.2 为什么说要进入下半场了？

用一句话来回答：强化学习终于奏效了（RL finally works）。

1.2.1 游戏终结者：强化学习（终于能泛化了！）

更准确地说：强化学习终于能够泛化了（RL finally generalizes）。

之前的一系列突破不断累积，使我们终于找到了一种统一的方式，只使用语言和推理（language and reasoning）就能完成各种领域的强化学习任务（a wide range of RL tasks）。
即便在仅仅一年前，如果你跟任何 AI 研究者说，有一种统一的方式可以解决 软件工程、创意写作、数学、AI 自动使用鼠标和键盘、长篇问答等领域的任务，肯定都会得到无情的嘲笑。这些任务每一个都极其困难，许多人在整个博士期间也只专注于其中的某个狭窄领域。然而，现在不一样了。

1.2.2 重点的转变：解决问题 -> 定义问题

人工智能的下半场，重点将从解决问题（solving problems）转移到定义问题（defining problems）。具体来说，

评估将比训练更重要（evaluation becomes more important than training）；
原来是思考 “我们能训练一个模型来解决某某问题吗”，现在更应该思考：“我们应该训练人工智能做什么？如何衡量我们的进展？”

1.2.3 思维方式和技术储备转变

要在下半场取得成功，需要及时转变思维方式和技术储备 —— 也许要更多地像产品经理那样思考。

2 上半场 2.1 训练方法和模型

要理解上半场，可以先看看它的赢家是谁。你认为到目前为止最有影响力的 AI 论文是什么？

我在斯坦福 224N 课程中做了调研，答案并不令人惊讶：Transformer、AlexNet、GPT-3 等等。

2.1.1 最有影响力的 AI 论文的共同点

这些论文有什么共同点？

首先，都提出了一些根本性的创新，能训练出更好的模型。

其次，还有一个不那么明显的共同点：这些“赢家”都是训练方法或模型（methods or models），而不是基准测试或任务（benchmarks or tasks）。

即使是最有影响力的基准测试 —— ImageNet —— 其引用量也不及 AlexNet 的三分之一。
在其他地方，方法与基准的对比甚至更为悬殊。例如，Transformer 的主要基准测试是 WMT’14，其引用量约为 1300，而 Transformer 的引用量则超过了 16w。

2.1.2 上半场的核心：构建新的模型和方法

这说明了上半场的游戏 专注于构建新的模型和方法，而评估和基准测试是次要的（尽管是论文系统正常运转所必要的）。

算法 vs. 任务：洞察力和工程能力

为什么呢？一个很大的原因是，在人工智能的上半场，方法/算法比任务更难、更令人兴奋。

从零开始设计一个新算法或模型架构 —— 例如反向传播算法、卷积网络（AlexNet）、GPT-3 中使用的 Transformer —— 需要非凡的洞察力和工程能力。
相比之下，为人工智能定义任务往往感觉更简单直接：我们只是把人类已经做的事情（比如翻译、图像识别或国际象棋）变成基准测试 —— 不需要太多洞察力甚至工程能力。

算法 vs. 任务：通用性和普适性

方法（methods）也往往比单个任务（task）更具通用性和普适性，这使得它们非常有价值。

一个伟大的新方法可以在许多不同的基准测试中不断改进提升，因为它简单且通用，因此其影响往往超出单个任务。

2.1.3 训练组合拳的质变时刻

这种方式已经持续了几十年，并激发了很多改变世界的思想和突破 —— 体现在各个领域不断提高的基准测试性能上。

训练组合拳包括什么呢？

massive language pre-training
scale (in data and compute)
reasoning and acting

这些术语大家应该已经司空见惯了。但为什么称它们为组合拳呢？可以通过强化学习（RL）来理解一下。

2.2 强化学习（RL）

在 RL 中，有三个关键组成部分：

算法
环境
先验知识

2.2.1 传统 RL：主要关注算法

2.2.2 深度 RL：环境因素非常重要，决定算法的效果

在深度 RL 时代，从经验上说，环境很重要：算法的性能往往与其开发和测试环境高度相关。

如果忽视环境，你可能构建出来的就是一个只在 toy 设置中表现出色的“最优”算法。

2.2.3 深度 RL：OpenAI 的工程经验

也就是说，我们需要先确定我们真正想要解决的环境，然后才能找到最适合它的算法。这正是 OpenAI 最初的计划。

OpenAI 先是构建了 gym，一个用于各种游戏的标准 RL 环境，
然后是 World of Bits 和 Universe 项目，试图将互联网或计算机变成一个游戏。

一旦我们将所有数字世界变成一个环境，就能用 RL 算法解决它 —— 最终我们就拥有了通用人工智能（AGI）。

直到 GPT-2 或 GPT-3 出现后，才发现缺失的部分是先验知识。

你需要强大的预训练，将一般常识和语言知识提炼到模型中，
然后可以微调以成为 web agent (WebGPT) 或 chat agent (ChatGPT) （进而改变真实世界）。

2.2.4 深度 RL：最重要的可能是先验知识（预训练到模型中）

事实证明，RL 最重要的部分可能不是 RL 算法或环境，而是先验知识，这些可以通过与 RL 完全无关的方式获得。

预训练只对聊天场景比较有效（先验知识）

预训练为聊天场景（chatting）创造了良好的先验知识，但并不同样适用于控制计算机或玩电子游戏。

为什么呢？因为这些领域与互联网文本的分布相距较远，而简单地在这些领域进行 SFT/RL 很难泛化。

2.3 顿悟时刻：模型需要像人类一样去【思考】

2.4 突破：AI 思考/推理

2.4.1 经典 RL：无法在开放、无限组合的推理空间做出决策

在经典 RL 理论中，这是一个糟糕的事情，因为它导致无法做出决策。想象一下，

如果你要在两个盒子中选择一个，其中一个盒子里有 100 万美元，另一个是空的。那你的期望收益将是 50 万美元。
如果在其中增加了无数个空盒子，你的期望收益将变为零。

2.4.2 经典 RL + Reasoning + 预训练模型（先验知识）：实现 RL 泛化

2.4.3 “选盒子游戏”的直观 vs. 抽象解释

我的抽象解释是：agents 中，语言通过推理实现泛化（language generalizes through reasoning in agents）。

2.5 RL 小结：先验知识 > 环境 > 算法

一旦我们有了正确的 RL 先验知识（语言预训练）和 RL 环境（将语言推理作为动作）， 事实证明 RL 算法可能就是最不重要的部分了。

因此，我们有了 GPT-o 系列、DeepSeek R1、深度研究、computer-use agent ，还会有更多出现。

但正如史蒂夫·乔布斯所说：You can’t connect the dots looking forward; you can only connect them looking backward。

这个发现正在彻底改变游戏规则。

3 下半场

回顾上半场的游戏：

开发新的训练方法或模型，以在基准测试中不断提升性能；
创建更难的基准测试；
转 1，继续这个循环。

这个游戏现在玩不下去了，因为：

这种基准测试本质已经很标准化和工业化，不需要什么新算法就能实现性能提升 —— 你针对特定任务的新方法可能只能提高 5%，而得益于预训练、强化学习和良好的泛化能力，下一个 o 系列模型可以在不明确针对它的情况下提高 30%。
即使创建更难的基准测试，很快（而且越来越快）它们也会被以上方式解决。我的同事 Jason Wei 制作了下图，很好地可视化了这一趋势：

那么，在下半场还剩下什么呢？如果不再需要新方法，而更难的基准测试很快就会被解决，我们该怎么办？

3.1 从根本上重新思考 evaluation

我认为，我们应该从根本上重新思考评估（evaluation）。

这意味着不仅要创建新的、更难的基准测试，
还要从根本上质疑现有的评估 setups 并创建新的 setups，迫使我们发明出更有效的评估新方法。

这很难，因为人类有惯性，很少质疑基本假设 —— 你把它们当作理所当然，而没有意识到它们是假设，而不是法则。

3.2 效用问题：AI 已经在大量场合超越人类，但并未对真实世界（e.g. GDP）产生太大影响

我称这为效用问题，并认为这是人工智能最重要的问题。

3.3 评估 setups 与现实世界 setups 不同

举两个例子。

3.3.1 例子一：评估“应该”自动运行

根据这个假设，通常 agent 接收任务输入，自主地做事情，然后接收任务奖励。

解决这类问题就需要提出一些新的基准测试，要么引入真人打分（例如 Chatbot Arena），要么引入用户模拟（例如 tau-bench）。

3.3.2 例子二：评估“应该”独立同分布（i.i.d.）

如果你有一个包含 500 个任务的测试集，你会独立运行每个任务，平均任务指标，然后得到一个总体指标。

但在现实中，你是顺序解决任务，而不是并行解决。

谷歌的软件工程师（SWE）随着对代码库的熟悉程度越来越高，解决 google 问题的能力也越来越强，
但 SWE agent 在同一个代码库中解决许多问题之后，却无法获得这种熟悉感。

3.4 下半场游戏规则

下半场的游戏方式：

开发针对现实世界效用的新评估 setups 或任务；
用现在的训练组合拳（或引入新组件增强）去训练模型，在 1 的任务上不断提升性能；
转 1，继续这个循环。

3.5 小结

下半场的游戏很难，因为大家对它还比较陌生，但它令人兴奋。

上半场的参与者解决了电子游戏和考试，下半场的参与者可以通过开发有用的 AI 产品，建立数十亿甚至万亿美元的公司。
上半场是渐进式的方法和模型，下半场则不一样了，通用训练组合拳能轻松击败渐进式方法，除非你能提出新的假设来打破组合拳，那你就是在做真正改变游戏规则的研究了。

欢迎来到下半场！

原文致谢

This blog post is based on my talk given at Stanford 224N and Columbia. I used OpenAI deep research to read my slides and write a draft.

[笔记] 关于 AI 下半场的思考：商业/应用篇（2025）

ARTHURCHIAO'S BLOG

5 months 4 weeks ago

近日，真格基金展开了一场关于 AI 创业的深度对谈，核心点：

真正的技术突破，不依赖营销也能实现自发传播。DeepSeek 是个例子。
AI 正在把我们带回那个凭产品力打动用户的时代。
新产品正在快速验证：只要创造了真实价值，就有机会跨越鸿沟（从少数走向大众）。

水平及维护精力所限，文中不免存在错误或过时之处，如有疑问，请查阅原文。 传播知识，尊重劳动，年满十八周岁，转载请注明出处。

关于 AI 下半场的思考（一）：技术/模型篇（2025）
关于 AI 下半场的思考（二）：商业/应用篇（2025）

1 看 AI 真正跨越鸿沟
2 对 AI 创业者的要求
3 AI 使得执行力不再稀缺，那以后工作的关键是什么
4 给用户创造了价值，就总有办法变现
- 4.1 用户爱用但不知道怎么赚钱
- 4.2 商业的本质就是你为用户创造更多价值，并从中提取利润
5 在行业早期，奢谈终局都没有意义，唯有下场开始执行
6 当 AI 可以替你干活
7 AI 应用的价值分层
8 第一次，我们都可以当（AI 的）老板了
- 8.1 当好 AI 的老板不容易
- 8.2 组织的 scaling law
9 结束语：水到沸点，蒸汽时代即将来临？

和历史上其他创新技术相比，生成式 AI 的应用落地速度非常快，今年我们已经看到随着模型能力的飞速进展，不少 AI 应用开始有实打实的收入。

1 看 AI 真正跨越鸿沟 1.2 早期 Google：技术极为先进，界面极其简单

99 年刚出来时的 Google：一个简单的输入框，用自然语言什么都可以问，问什么都有答案。

这是我对终极产品的向往：把极为先进的技术包装在超级简单的界面背后，像魔法一样让普通人具备非常强大的能力。

1.2 ChatGPT：AI 的 ‘Google’ 时刻

虽然早期的大模型还不够聪明，也有很多幻觉，但 AI 不再只是在科研界的热议话题，而是真正能用起来的产品。

在生成式 AI 到来之前，虽然 AlphaGo 已经击败李世石和柯洁，特斯拉也已推出 FSD，但 AI 离普通人的生活还比较远。
当时谈 AI，还更多是在讲科技研发和未来愿景，跟大众产品还很有距离。

当 22 年底上线的 ChatGPT，就像 99 年的 Google。它是一个真正的转折点，让 AI 变得人人可用，也真的好用。

1.3 ChatGPT：第一个跨越鸿沟的 AI 产品

认知技术创新的框架「跨越鸿沟」：创新技术怎么从早期市场进入主流市场。

ChatGPT 可能是第一个能真正跨越鸿沟的 AI 产品。

2 对 AI 创业者的要求 2.1 创业者分类

我们曾经把早期成功创业者分成四类：小天才、老司机、科学家、操盘手。

最近想，是不是还得区分「技术变革的早期」和「技术成熟期」，不同时期成功几率大的创业者画像和打法可能都不一样。

过去十年是移动互联网的成熟期，在下半场，容错率更低，经验和资源更重要，打过仗交过学费的连续创业者胜率更高。
现在的 AI，又回到了技术变革的早期。创业者需要对新技术很懂，对技术边际变化带来的机会很敏感，这就给年轻创业者带来了很多机会。

2.2 AI 创业者：既要懂前沿技术，又要有很强的产品执行力

AI 也要通过成熟的形态如 App 或网站去落地，因此对创业者提出了更高的要求：既要懂前沿技术，又要有很强的产品执行力。

2.3 成熟的方法论（e.g. 投放）未必在 AI 领域有效

与此同时，很多产业成熟期的方法论，比如 AB 测试、精细化投放等，在产业早期却未必最有效。

举个例子，AB 测试适合找到产品方案的细节差异，但技术早期往往是要在没有数据的情况下做选择，选对了就是 10 倍起步，选错就全盘皆输。

2.4 花一点小钱看未来，其实很值

第一批吃螃蟹的人往往会得到不菲的奖励。例如

当年第一批做互联网创业的人，很多是最早买电脑、最早上网的；
第一批做移动互联网的人，也常常是最早买 iPhone 的。

现在 AI 产品其实已经很便宜，一个月可能只要花 20 美金，也就一顿饭的价格，但能帮助你先看到未来，也先抓住机会。

3 AI 使得执行力不再稀缺，那以后工作的关键是什么

当执行力不再稀缺，我认为工作的关键变成：Agency & Taste。

3.1 你要做什么（主观能动性，Agency）

【注释】zh.wikipedia.org

3.2 你选择什么（品味，Taste）

AI 可以创造很多选项，但是选择最后还是人来做。也就是所谓的 Taste（品味）。

Midjourney 一次给你四张图，Vibe Coding 给你多个实现方案，你选哪个？
也许有一天 AI 的 taste 会比人更强，但现在，决定还得人来做。

3.3 小结：AI 时代人与人之间的关键分野

Agency（主观能动性）和 Taste（品味），是 AI 时代人与人之间的关键分野。

4 给用户创造了价值，就总有办法变现

技术扩散需要时间，所以才会有从创新者、早期采用者到大众市场的创新扩散曲线。现在，我们已经能直观地看到那道鸿沟的存在。

4.1 用户爱用但不知道怎么赚钱

新技术驱动的产品，早期常常是「用户爱用但不知道怎么赚钱」。

Google 刚出来时是个基于先进技术，非常好用但没盈利模式的产品。那时候华尔街有很多质疑，说它不做广告，还鼓励用户尽快离开网站，这怎么赚钱？

2002 年，Google 通过 AdWords 和 Adsense 找到了商业模式，现在搜索引擎广告是互联网行业最很赚钱的印钞机之一。

4.2 商业的本质就是你为用户创造更多价值，并从中提取利润

5 在行业早期，奢谈终局都没有意义，唯有下场开始执行

5.1 增长的关键不在投放，而是有没有「魔法体验」

投放是移动互联网后期的必修课，然而现在很多 AI 应用的成功，投放不是重点，甚至根本不需要投放。

关键是能不能让用户有魔法般的体验产生自然传播。当用户突然遇到一个体验好十倍的产品，这时候，口碑和自然增长的力量，远比投放更管用。

5.2 AI 把我们带回了那个靠产品力打动用户的时代

我很开心 AI 把我们带回了那个靠产品力打动用户的时代，需要产品经理用判断做选择，用体验打动人。

5.3 是否有场景能吸引到用户主动使用

不靠补贴和投放。

5.4 产品进化的斜率是重点

再说留存和新增的选择。做增长的人总说留存重要，但这有个隐含前提：产品够普世。

很多小众产品，比如豆瓣、即刻，用户留存都很好，还在用的人绝对是真爱，但是它不增长了。
技术革命早期，有明确的亮点，快速吸引用户才更重要。
在技术还不完善的时候，留存差一点也正常，技术本身还在演进。

回头看亚马逊刚起步的时候，能买的东西很少，体验也一般，但重点是产品进化的斜率高不高。

AI 时代，ChatGPT 就是典型。

一开始 ChatGPT 功能没那么强，很多人试完，觉得和 AI 瞎聊几句也没啥用，留存远没有现在好。
反倒是 C.ai 这样情感类的 AI 产品当时留存高，因为核心用户粘性强。

所以比起留存，我现在更看重一个 AI 应用是否有吸引用户的亮点：

产品有没有在某个场景的吸引力，不靠补贴和投放，用户自己愿意来使用
产品是不是在快速变好，斜率是否够高。这可能就是技术革命早期和成熟期做增长最大的区别。

6 当 AI 可以替你干活

AI 可能会带来一种新的商业模式：虚拟雇佣。

6.1 你愿意在哪种程度上为它付费？

过去我们对工具付费，通常想的是它的价值加上你的时间成本。但雇一个人不一样，本质上是买他的时间。工具和员工的定价机制是两套逻辑。

6.2 如果有 100 个 Agent 并行干活，你到底想让它们做什么

如果 AI 能直接帮你做事，想象空间就完全变了。有 10 个、100 个 agent 并行干活，真正的限制变成了：你到底想让它做什么？

6.3 模型吞噬应用 vs. 应用胜过模型

应用或者是「套壳」到底有没有长期价值？

观点一：模型越来越强大，会吞噬应用的价值。
观点二：模型越强大，应用就越能够通过专有的上下文和环境来创造增量价值。

头部模型公司竞争激烈， API 的差距在不断缩小。如果应用公司始终能使用接近 SOTA 水平的模型 API，那么加上好的产品设计、用户数据、使用习惯、品牌效应等，就可能做出更好的体验。

7 AI 应用的价值分层 7.1 模型能力

最底层是模型能力，这一层是相对通用和公开的，确实需要大模型公司通过开源模型或者闭源 API 的方式来提供。

7.2 上下文能力（public/organizational/personal）

中间层是模型权重中并不直接具备的上下文（context），这里又可以细分成三层：

公开的上下文（public context），如用于搜索的新闻报道等；
组织专有的上下文（organizational context），比如说组织内的文件，流程，数据等；
用户私人的上下文（personal context），如用户和 AI 的交互记录，个人信息和偏好等。

1 & 2 可以建构壁垒。

7.3 环境（environment）

环境层（environment），这里包括

模型可以调用的各种工具如 computer use，MCP，A2A 等协议，
模型可以改变迭代的 code base 等。

随着 AI 产品越来越完善，更多的价值创造会出现在上下文和环境这两层，这也就是 AI 应用的壁垒。

7.4 小结：思考 6-12 个月后 SOTA 模型的能力，做基于这个做准备

应用创业者真正该做的，是去思考 6-12 个月以后 SOTA 模型会有哪些能力，再基于这个做准备。

正如乔布斯引用一位传奇冰球教练的话：「我永远滑向冰球将要去的地方。」

8 第一次，我们都可以当（AI 的）老板了

能够自主完成任务的 Agent 的出现，意味着第一次我们每个人都可以当（AI 的）老板。

8.1 当好 AI 的老板不容易

要当一个好老板不容易，也需要很多学习。

8.2 组织的 scaling law

技术升级往往会带来组织的 scaling law。

一方面，新技术可以让更小的团队完成更多的工作，另一方面，新技术也可以让大公司管理更大更多的业务。
例如移动互联网革命中，既出现了 Instagram 这样被 10 亿美金收购时只有十来个人的 mini 公司，也出现了美团这样能够使用技术高效管理几百万骑手的超级公司。

AI 革命可能让组织的 scaling law 进一步发展。Sam Altman 预言我们很快就会看到一个人的独角兽公司。

9 结束语：水到沸点，蒸汽时代即将来临？

AI 的发展有点像烧开水，在水已热但还没烧开之前可能只能泡咖啡，但一旦到达 100 度的沸点，将会解锁蒸汽机，带来各行各业巨大的生产力变革。

Checked

5 hours 18 minutes ago

ArthurChiao's Blog

URL

https://arthurchiao.art/

ARTHURCHIAO'S BLOG feed

ARTHURCHIAO'S BLOG

Managed ad