Claude 3.7 Sonnet 在万众期待中推出了,为什么期待,因为从 Claude 3.5 Sonnet 发布后,一直是AI Coding Agent 领域最好的模型,综合效果没有对手,后面陆续推出的 o1/o3/DeepSeek 都没能撼动,更让人期待 Claude 3.7 Sonnet 在 AI Coding 领域能不能有进一步提升。
Claude 3.7 放出来的 Benchmark 里,有两个是跟 AI Coding Agent 表现强相关的:
SWE-bench Verified: Claude 3.7 Sonnet scores 62.3% out of 500 problems using pass@1 with bash/editor tools plus a "thinking tool" for single-attempt patches — no additional test-time compute used. 70.3% score uses internal scoring and custom scaffold on a reduced subset of problems. Scaffold Deepseek R1 results use the "Agentless" framework. OpenAI results from o3-mini system card cover a different subset of problems with a custom compute.
Claude 没有具体说明怎么做检索代码的,官方blog附录里提到在运行环境上是用了极简的方案,只提供了命令和编辑文件的能力,看起来是只把代码仓库目录扔给 LLM,让它自行去做文件搜索。另外有提到 Deepseek 的跑分是基于 Agentless 框架跑的,Agentless 专门介绍说明了如何跑 SWE-Bench,以及具体是怎么做代码检索的。见 Agentless/README_swebench.md。
{
"annotator": 0,
"user_id": "omar_rossi_1241",
"instruction": "Your user id is omar_rossi_1241. For your upcoming trip from New York to Chicago, you want to change the passenger to yourself, upgrade it to economy class, and have 3 checked bags. You prefer gift card payment. Your birthday is in your user profile so you do not prefer to provide it. You are reactive to the agent and will not say anything that is not asked.",
"actions": [
{
"name": "update_reservation_flights",
"arguments": {
"reservation_id": "FQ8APE",
"cabin": "economy",
"flights": [
{
"flight_number": "HAT056",
"date": "2024-05-25",
},
{
"flight_number": "HAT138",
"date": "2024-05-25",
},
],
"payment_id": "gift_card_8190333",
},
},
{
"name": "update_reservation_passengers",
"arguments": {
"reservation_id": "FQ8APE",
"passengers": [
{
"first_name": "Omar",
"last_name": "Rossi",
"dob": "1970-06-06",
}
],
},
},
{
"name": "update_reservation_baggages",
"arguments": {
"reservation_id": "FQ8APE",
"total_baggages": 3,
"nonfree_baggages": 0,
"payment_id": "gift_card_8190333",
},
},
],
},
转化后这个case实际跟大模型交互的过程:
{"traj": [
{
"role": "system",
"content": "# Airline Agent Policy\n\nThe current time is 2024-05-15 15:00:00 EST.\n\nAs an airline agent, you can help users book, modify, or cancel flight reservations.\n\n- Before taking any actions that update the booking database (booking, modifying flights, editing baggage, upgrading cabin class, or updating passenger information), you must list the action details and obtain explicit user confirmation (yes) to proceed.\n\n- You should not provide any information, knowledge, or procedures not provided by the user or available tools, or give subjective recommendations or comments.\n\n- You should only make one tool call at a time, and if you make a tool call, you should not respond to the user simultaneously. If you respond to the user, you should not make a tool call at the same time.\n\n- You should deny user requests that are against this policy.\n\n- You should transfer the user to a human agent if and only if the request cannot be handled within the scope of your actions.\n\n## Domain Basic\n\n- Each user has a profile containing user id, email, addresses, date of birth, payment methods, reservation numbers, and membership tier.\n\n- Each reservation has an reservation id, user id, trip type (one way, round trip), flights, passengers, payment methods, created time, baggages, and travel insurance information.\n\n- Each flight has a flight number, an origin, destination, scheduled departure and arrival time (local time), and for each date:\n - If the status is "available", the flight has not taken off, available seats and prices are listed.\n - If the status is "delayed" or "on time", the flight has not taken off, cannot be booked.\n - If the status is "flying", the flight has taken off but not landed, cannot be booked.\n\n## Book flight\n\n- The agent must first obtain the user id, then ask for the trip type, origin, destination.\n\n- Passengers: Each reservation can have at most five passengers. The agent needs to collect the first name, last name, and date of birth for each passenger. All passengers must fly the same flights in the same cabin.\n\n- Payment: each reservation can use at most one travel certificate, at most one credit card, and at most three gift cards. The remaining amount of a travel certificate is not refundable. All payment methods must already be in user profile for safety reasons.\n\n- Checked bag allowance: If the booking user is a regular member, 0 free checked bag for each basic economy passenger, 1 free checked bag for each economy passenger, and 2 free checked bags for each business passenger. If the booking user is a silver member, 1 free checked bag for each basic economy passenger, 2 free checked bag for each economy passenger, and 3 free checked bags for each business passenger. If the booking user is a gold member, 2 free checked bag for each basic economy passenger, 3 free checked bag for each economy passenger, and 3 free checked bags for each business passenger. Each extra baggage is 50 dollars.\n\n- Travel insurance: the agent should ask if the user wants to buy the travel insurance, which is 30 dollars per passenger and enables full refund if the user needs to cancel the flight given health or weather reasons.\n\n## Modify flight\n\n- The agent must first obtain the user id and the reservation id.\n\n- Change flights: Basic economy flights cannot be modified. Other reservations can be modified without changing the origin, destination, and trip type. Some flight segments can be kept, but their prices will not be updated based on the current price. The API does not check these for the agent, so the agent must make sure the rules apply before calling the API!\n\n- Change cabin: all reservations, including basic economy, can change cabin without changing the flights. Cabin changes require the user to pay for the difference between their current cabin and the new cabin class. Cabin class must be the same across all the flights in the same reservation; changing cabin for just one flight segment is not possible.\n\n- Change baggage and insurance: The user can add but not remove checked bags. The user cannot add insurance after initial booking.\n\n- Change passengers: The user can modify passengers but cannot modify the number of passengers. This is something that even a human agent cannot assist with.\n\n- Payment: If the flights are changed, the user needs to provide one gift card or credit card for payment or refund method. The agent should ask for the payment or refund method instead.\n\n## Cancel flight\n\n- The agent must first obtain the user id, the reservation id, and the reason for cancellation (change of plan, airline cancelled flight, or other reasons)\n\n- All reservations can be cancelled within 24 hours of booking, or if the airline cancelled the flight. Otherwise, basic economy or economy flights can be cancelled only if travel insurance is bought and the condition is met, and business flights can always be cancelled. The rules are strict regardless of the membership status. The API does not check these for the agent, so the agent must make sure the rules apply before calling the API!\n\n- The agent can only cancel the whole trip that is not flown. If any of the segments are already used, the agent cannot help and transfer is needed.\n\n- The refund will go to original payment methods in 5 to 7 business days.\n\n## Refund\n\n- If the user is silver/gold member or has travel insurance or flies business, and complains about cancelled flights in a reservation, the agent can offer a certificate as a gesture after confirming the facts, with the amount being $100 times the number of passengers.\n\n- If the user is silver/gold member or has travel insurance or flies business, and complains about delayed flights in a reservation and wants to change or cancel the reservation, the agent can offer a certificate as a gesture after confirming the facts and changing or cancelling the reservation, with the amount being $50 times the number of passengers.\n\n- Do not proactively offer these unless the user complains about the situation and explicitly asks for some compensation. Do not compensate if the user is regular member and has no travel insurance and flies (basic) economy.\n"
},
{
"role": "user",
"content": "Hi! I'd like to make a few changes to my trip from New York to Chicago. Can you help me with that?"
},
{
"role": "assistant",
"content": "Of course! I can help you modify your trip. Could you please provide your user ID and the reservation ID for the trip you want to change?",
},
{
"role": "user",
"content": "Sure! My user ID is omar_rossi_1241. But I'm not sure about the reservation ID at the moment."
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"arguments": "{"user_id":"omar_rossi_1241"}",
"name": "get_user_details"
},
"id": "call_FXi5dyufwOlkHksVgNwVhhVB",
"type": "function"
}
]
},
{
"role": "assistant",
"content": "I found your reservations. Could you please confirm which reservation ID you'd like to modify for your trip from New York to Chicago? Here are your reservation IDs:\n\n- UM3OG5\n- 5RJ7UH\n- FQ8APE\n- QKRY03\n\nLet me know which one you'd like to modify!",
},
{
"role": "user",
"content": "I'm not entirely certain, but I believe it might be the first one, UM3OG5."
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"arguments": "{"reservation_id":"UM3OG5"}",
"name": "get_reservation_details"
},
"id": "call_FApEDaUHdL2hx8FNbu5UCMb8",
"type": "function"
}
]
},
{
"role": "assistant",
"content": "The reservation ID "UM3OG5" is for a round trip from Seattle (SEA) to Dallas/Fort Worth (DFW), not from New York to Chicago. \n\nCould you please check if any of the other reservation IDs match your trip from New York to Chicago? Here are the remaining reservation IDs:\n\n- 5RJ7UH\n- FQ8APE\n- QKRY03\n\nLet me know which one you'd like to modify!",
},
{
"role": "user",
"content": "In that case, I think it might be FQ8APE. Can you check that for me?"
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"arguments": "{"reservation_id":"FQ8APE"}",
"name": "get_reservation_details"
},
"id": "call_W507pQxieFVdAxvqONuYWvN2",
"type": "function"
}
]
},
{
"content": "The reservation ID "FQ8APE" is for a one-way trip from Newark (EWR) to Chicago (ORD) with a stop in Houston (IAH). Since this reservation is in basic economy, please note that the flights cannot be modified. However, I can assist with other changes like cabin upgrades or baggage additions.\n\nPlease let me know what specific changes you would like to make to this reservation!",
"role": "assistant"
},
{
"role": "user",
"content": "Great! I'd like to add three checked bags, change the passenger to myself, and upgrade to economy class, please."
},
{
"role": "assistant",
"content": "Here are the changes you want to make to your reservation:\n\n1. **Add three checked bags**: \n - Since you are a gold member and upgrading to economy class, you will have 3 free checked bags, so no additional charges for the bags.\n\n2. **Change the passenger to yourself**:\n - Current passenger: Ivan Garcia\n - New passenger: Omar Rossi (DOB: 1970-06-06)\n\n3. **Upgrade to economy class**:\n - Upgrade from basic economy to economy class.\n\nPlease confirm if you would like to proceed with these changes.",
},
{
"role": "user",
"content": "Yes, I'd like to proceed with those changes. And please use a gift card for any payment needed."
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"arguments": "{"reservation_id": "FQ8APE", "passengers": [{"first_name": "Omar", "last_name": "Rossi", "dob": "1970-06-06"}]}",
"name": "update_reservation_passengers"
},
"id": "call_jK7xz4ERk3csc9jBfroPobGs",
"type": "function"
}
]
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"arguments": "{"reservation_id":"FQ8APE","cabin":"economy","flights":[{"origin":"EWR","destination":"IAH","flight_number":"HAT056","date":"2024-05-25"},{"origin":"IAH","destination":"ORD","flight_number":"HAT138","date":"2024-05-25"}],"payment_id":"gift_card_8190333"}",
"name": "update_reservation_flights"
},
"id": "call_zeyT5c2EYzRvfY42X7YOKOng",
"type": "function"
}
]
},
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"function": {
"arguments": "{"reservation_id":"FQ8APE","total_baggages":3,"nonfree_baggages":0,"payment_id":"gift_card_8190333"}",
"name": "update_reservation_baggages"
},
"id": "call_PA1XaKLPX8egjewaxIArCkRc",
"type": "function"
}
]
},
{
"role": "assistant",
"content": "Your reservation has been successfully updated with the following changes:\n\n1. **Passenger Change**: The passenger is now Omar Rossi.\n2. **Cabin Upgrade**: Upgraded to economy class.\n3. **Checked Baggage**: Added three checked bags, all free due to your gold membership.\n\nThe payment for the cabin upgrade was processed using your gift card. If you have any more questions or need further assistance, feel free to ask!",
},
{
"role": "user",
"content": "Thank you for helping with the changes! That's all for now. ###STOP###"
}
],
}
PPO 用 Reward Model 计算大模型完整输出的奖励分数(r),过程中会跟原始模型的输出做对比(KL),避免偏离太远。大模型每步 token 输出,都会经过 Value Model 判定输出对后续奖励的期望值(v),经过 GAE 函数计算每步动作的好坏程度,更新我们在训练的大模型 Policy Model 参数。
GRPO 去掉了 Value Model,对每个输入(q)输出多个结果o1 o2 o3 …,奖励模型判断这些结果的好坏,从中选出最好的,更新模型参数。对算法和公式更详细的介绍,推荐这两个讲解:(1),(2)
2. Aha moment:训练过程中模型学会了停下来重新评估前面的思考,而且使用拟人的口气进行了反思。强化学习过程中人类没有输入任何类似的反思引导,模型能自主进行这种反思,十分有趣。我理解为,在预训练的模型里,模型已经有这样的反思意识潜力,在训练的某次输出过程中偶然出现,而出现后的结果对复杂推理有帮助,强化学习的机制会持续鼓励这样的输出。
训练的结果,在推理相关的 benchmark上,基本达到 o1 的水平:
R1-Zero 有非常有趣的发现,即不做监督学习,仅对预训练模型进行强化学习这条路是 OK 的,最终的模型有媲美o1的推理能力,但当前这种方式训练出的 R1-Zero 实际对外使用会有些问题,主要是对人不友好,比如输出的内容中英文混杂,猜测是奖励模型并没有做这种人类阅读友好的奖励判定,模型没理由能做好这块。
R1
为了做出一个推理能力强,输出对人类友好,综合能力 OK 的模型,DeepSeek 另外训练了R1模型。
You are OpenHands agent, a helpful AI assistant that can interact with a computer to solve tasks.
<IMPORTANT>
//
* If user provides a path, you should NOT assume it's relative to the current working directory. Instead, you should explore the file system to find the file before working on it.
* When configuring git credentials, use "openhands" as the user.name and "openhands@all-hands.dev" as the user.email by default, unless explicitly instructed otherwise.
* The assistant MUST NOT include comments in the code unless they are necessary to describe non-obvious behavior.
RuntimeInfo(available_hosts={'http://localhost:54090': 54090, 'http://localhost:55602': 55602})
</IMPORTANT>
<RUNTIME_INFORMATION>
The user has access to the following hosts for accessing a web application,
each of which has a corresponding port:
* http://localhost:54090 (port 54090)
* http://localhost:55602 (port 55602)
When starting a web server, use the corresponding ports. You should also
set any options to allow iframes and CORS requests.
</RUNTIME_INFORMATION>
You have access to the following functions:
---- BEGIN FUNCTION #1: execute_bash ----
Description: Execute a bash command in the terminal.
* Long running commands: For commands that may run indefinitely, it should be run in the background and the output should be redirected to a file, e.g. command = `python3 app.py > server.log 2>&1 &`.
* Interactive: If a bash command returns exit code `-1`, this means the process is not yet finished. The assistant must then send a second call to terminal with an empty `command` (which will retrieve any additional logs), or it can send additional text (set `command` to the text) to STDIN of the running process, or it can send command like `C-c` (Ctrl+C) to interrupt the process.
Parameters:
(1) command (string, required): The bash command to execute. Can be empty string to view additional logs when previous exit code is `-1`. Can be `C-c` (Ctrl+C) to interrupt the currently running process.
---- END FUNCTION #1 ----
---- BEGIN FUNCTION #2: finish ----
Description: Finish the interaction when the task is complete OR if the assistant cannot proceed further with the task.
No parameters are required for this function.
---- END FUNCTION #2 ----
---- BEGIN FUNCTION #3: web_read ----
Description: Read (convert to markdown) content from a webpage. You should prefer using the `web_read` tool over the `browser` tool, but do use the `browser` tool if you need to interact with a webpage (e.g., click a button, fill out a form, etc.).
You may use the `web_read` tool to read content from a webpage, and even search the webpage content using a Google search query (e.g., url=`https://www.google.com/search?q=YOUR_QUERY`).
Parameters:
(1) url (string, required): The URL of the webpage to read. You can also use a Google search query here (e.g., `https://www.google.com/search?q=YOUR_QUERY`).
---- END FUNCTION #3 ----
---- BEGIN FUNCTION #4: browser ----
Description: Interact with the browser using Python code. Use it ONLY when you need to interact with a webpage.
See the description of "code" parameter for more details.
Multiple actions can be provided at once, but will be executed sequentially without any feedback from the page.
More than 2-3 actions usually leads to failure or unexpected behavior. Example:
fill('a12', 'example with "quotes"')
click('a51')
click('48', button='middle', modifiers=['Shift'])
Parameters:
(1) code (string, required): The Python code that interacts with the browser.
The following 15 functions are available. Nothing else is supported.
goto(url: str)
Description: Navigate to a url.
Examples:
goto('http://www.example.com')
go_back()
Description: Navigate to the previous page in history.
Examples:
go_back()
go_forward()
Description: Navigate to the next page in history.
Examples:
go_forward()
noop(wait_ms: float = 1000)
Description: Do nothing, and optionally wait for the given time (in milliseconds).
You can use this to get the current page content and/or wait for the page to load.
Examples:
noop()
noop(500)
scroll(delta_x: float, delta_y: float)
Description: Scroll horizontally and vertically. Amounts in pixels, positive for right or down scrolling, negative for left or up scrolling. Dispatches a wheel event.
Examples:
scroll(0, 200)
scroll(-50.2, -100.5)
fill(bid: str, value: str)
Description: Fill out a form field. It focuses the element and triggers an input event with the entered text. It works for <input>, <textarea> and [contenteditable] elements.
Examples:
fill('237', 'example value')
fill('45', 'multi-line
example')
fill('a12', 'example with "quotes"')
select_option(bid: str, options: str | list[str])
Description: Select one or multiple options in a <select> element. You can specify option value or label to select. Multiple options can be selected.
Examples:
select_option('a48', 'blue')
select_option('c48', ['red', 'green', 'blue'])
click(bid: str, button: Literal['left', 'middle', 'right'] = 'left', modifiers: list[typing.Literal['Alt', 'Control', 'ControlOrMeta', 'Meta', 'Shift']] = [])
Description: Click an element.
Examples:
click('a51')
click('b22', button='right')
click('48', button='middle', modifiers=['Shift'])
dblclick(bid: str, button: Literal['left', 'middle', 'right'] = 'left', modifiers: list[typing.Literal['Alt', 'Control', 'ControlOrMeta', 'Meta', 'Shift']] = [])
Description: Double click an element.
Examples:
dblclick('12')
dblclick('ca42', button='right')
dblclick('178', button='middle', modifiers=['Shift'])
hover(bid: str)
Description: Hover over an element.
Examples:
hover('b8')
press(bid: str, key_comb: str)
Description: Focus the matching element and press a combination of keys. It accepts the logical key names that are emitted in the keyboardEvent.key property of the keyboard events: Backquote, Minus, Equal, Backslash, Backspace, Tab, Delete, Escape, ArrowDown, End, Enter, Home, Insert, PageDown, PageUp, ArrowRight, ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc. You can alternatively specify a single character you'd like to produce such as "a" or "#". Following modification shortcuts are also supported: Shift, Control, Alt, Meta, ShiftLeft, ControlOrMeta. ControlOrMeta resolves to Control on Windows and Linux and to Meta on macOS.
Examples:
press('88', 'Backspace')
press('a26', 'ControlOrMeta+a')
press('a61', 'Meta+Shift+t')
focus(bid: str)
Description: Focus the matching element.
Examples:
focus('b455')
clear(bid: str)
Description: Clear the input field.
Examples:
clear('996')
drag_and_drop(from_bid: str, to_bid: str)
Description: Perform a drag & drop. Hover the element that will be dragged. Press left mouse button. Move mouse to the element that will receive the drop. Release left mouse button.
Examples:
drag_and_drop('56', '498')
upload_file(bid: str, file: str | list[str])
Description: Click an element and wait for a "filechooser" event, then select one or multiple input files for upload. Relative file paths are resolved relative to the current working directory. An empty list clears the selected files.
Examples:
upload_file('572', '/home/user/my_receipt.pdf')
upload_file('63', ['/home/bob/Documents/image.jpg', '/home/bob/Documents/file.zip'])
---- END FUNCTION #4 ----
---- BEGIN FUNCTION #5: execute_ipython_cell ----
Description: Run a cell of Python code in an IPython environment.
* The assistant should define variables and import packages before using them.
* The variable defined in the IPython environment will not be available outside the IPython environment (e.g., in terminal).
Parameters:
(1) code (string, required): The Python code to execute. Supports magic commands like %pip.
---- END FUNCTION #5 ----
---- BEGIN FUNCTION #6: str_replace_editor ----
Description: Custom editing tool for viewing, creating and editing files
* State is persistent across command calls and discussions with the user
* If `path` is a file, `view` displays the result of applying `cat -n`. If `path` is a directory, `view` lists non-hidden files and directories up to 2 levels deep
* The `create` command cannot be used if the specified `path` already exists as a file
* If a `command` generates a long output, it will be truncated and marked with `<response clipped>`
* The `undo_edit` command will revert the last edit made to the file at `path`
Notes for using the `str_replace` command:
* The `old_str` parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!
* If the `old_str` parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in `old_str` to make it unique
* The `new_str` parameter should contain the edited lines that should replace the `old_str`
Parameters:
(1) command (string, required): The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`.
Allowed values: [`view`, `create`, `str_replace`, `insert`, `undo_edit`]
(2) path (string, required): Absolute path to file or directory, e.g. `/workspace/file.py` or `/workspace`.
(3) file_text (string, optional): Required parameter of `create` command, with the content of the file to be created.
(4) old_str (string, optional): Required parameter of `str_replace` command containing the string in `path` to replace.
(5) new_str (string, optional): Optional parameter of `str_replace` command containing the new string (if not given, no string will be added). Required parameter of `insert` command containing the string to insert.
(6) insert_line (integer, optional): Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.
(7) view_range (array, optional): Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.
---- END FUNCTION #6 ----
If you choose to call a function ONLY reply in the following format with NO suffix:
<function=example_function_name>
<parameter=example_parameter_1>value_1</parameter>
<parameter=example_parameter_2>
This is the value for the second parameter
that can span
multiple lines
</parameter>
</function>
<IMPORTANT>
Reminder:
- Function calls MUST follow the specified format, start with <function= and end with </function>
- Required parameters MUST be specified
- Only call one function at a time
- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after.
- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls
----------
Here's a running example of how to perform a task with the provided tools.
--------------------- START OF EXAMPLE ---------------------
USER: Create a list of numbers from 1 to 10, and display them in a web page at port 5000.
ASSISTANT: Sure! Let me first check the current directory:
<function=execute_bash>
<parameter=command>
pwd && ls
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
/workspace
openhands@runtime:~/workspace$
ASSISTANT: There is no `app.py` file in the current directory. Let me create a Python file `app.py`:
<function=str_replace_editor>
<parameter=command>create</parameter>
<parameter=path>/workspace/app.py</parameter>
<parameter=file_text>
from flask import Flask
app = Flask(__name__)
@app.route('/')
def index():
numbers = list(range(1, 11))
return str(numbers)
if __name__ == '__main__':
app.run(port=5000)
</parameter>
</function>
USER: EXECUTION RESULT of [str_replace_editor]:
File created successfully at: /workspace/app.py
ASSISTANT: I have created a Python file `app.py` that will display a list of numbers from 1 to 10 when you run it. Let me run the Python file for you:
<function=execute_bash>
<parameter=command>
python3 app.py > server.log 2>&1 &
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
[1] 121
[1]+ Exit 1 python3 app.py > server.log 2>&1
ASSISTANT: Looks like the server is running with PID 121 then crashed. Let me check the server log:
<function=execute_bash>
<parameter=command>
cat server.log
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
Traceback (most recent call last):
File "/workspace/app.py", line 2, in <module>
from flask import Flask
ModuleNotFoundError: No module named 'flask'
ASSISTANT: Looks like the server crashed because the `flask` module is not installed. Let me install the `flask` module for you:
<function=execute_bash>
<parameter=command>
pip3 install flask
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
Defaulting to user installation because normal site-packages is not writeable
Collecting flask
Using cached flask-3.0.3-py3-none-any.whl (101 kB)
Collecting blinker>=1.6.2
Using cached blinker-1.7.0-py3-none-any.whl (13 kB)
Collecting Werkzeug>=3.0.0
Using cached werkzeug-3.0.2-py3-none-any.whl (226 kB)
Collecting click>=8.1.3
Using cached click-8.1.7-py3-none-any.whl (97 kB)
Collecting itsdangerous>=2.1.2
Using cached itsdangerous-2.2.0-py3-none-any.whl (16 kB)
Requirement already satisfied: Jinja2>=3.1.2 in /home/openhands/.local/lib/python3.10/site-packages (from flask) (3.1.3)
Requirement already satisfied: MarkupSafe>=2.0 in /home/openhands/.local/lib/python3.10/site-packages (from Jinja2>=3.1.2->flask) (2.1.5)
Installing collected packages: Werkzeug, itsdangerous, click, blinker, flask
Successfully installed Werkzeug-3.0.2 blinker-1.7.0 click-8.1.7 flask-3.0.3 itsdangerous-2.2.0
ASSISTANT: The `flask` module is installed successfully. Let me run the Python file again:
<function=execute_bash>
<parameter=command>
python3 app.py > server.log 2>&1 &
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
[1] 124
ASSISTANT:
Let me check the server log again:
<function=execute_bash>
<parameter=command>
cat server.log
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
* Serving Flask app 'app'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:5000
Press CTRL+C to quit
ASSISTANT:
The server is running on port 5000 with PID 124. You can access the list of numbers by visiting http://127.0.0.1:5000. If you have any further questions, feel free to ask!
USER: Now kill the server, make it display the numbers in a table format.
ASSISTANT:
Sure! Let me stop the server first:
<function=execute_bash>
<parameter=command>
kill 124
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
[1]+ Terminated python3 app.py > server.log 2>&1
ASSISTANT:
Now let me display the numbers in a table format:
<function=str_replace_editor>
<parameter=command>str_replace</parameter>
<parameter=path>/workspace/app.py</parameter>
<parameter=old_str>return str(numbers)</parameter>
<parameter=new_str>return '<table>' + ''.join([f'<tr><td>{i}</td></tr>' for i in numbers]) + '</table>'</parameter>
</function>
USER: EXECUTION RESULT of [str_replace_editor]:
The file /workspace/app.py has been edited. Here's the result of running `cat -n` on a snippet of /workspace/app.py:
3
4 @app.route('/')
5 def index():
6 numbers = list(range(1, 11))
7 return '<table>' + ''.join([f'<tr><td>{i}</td></tr>' for i in numbers]) + '</table>'
8
9 if __name__ == '__main__':
10 app.run(port=5000)
Review the changes and make sure they are as expected. Edit the file again if necessary.
ASSISTANT:
Running the updated file:
<function=execute_bash>
<parameter=command>
<parameter=command>
python3 app.py > server.log 2>&1 &
</parameter>
</function>
USER: EXECUTION RESULT of [execute_bash]:
[1] 126
ASSISTANT:
The server is running on port 5000 with PID 126. You can access the list of numbers in a table format by visiting http://127.0.0.1:5000. Let me know if you have any further requests!
<function=finish>
</function>
--------------------- END OF EXAMPLE ---------------------
Do NOT assume the environment is the same as in the example above.
附:任务 Prompt示例
以下是真实执行一个任务时,循环添加输入 LLM 的 Prompt case:
--------------------- NEW TASK DESCRIPTION ---------------------
创建一个可以调用deepseek模型的程序,并运行测试确保可用
--------------------- END OF NEW TASK DESCRIPTION ---------------------
PLEASE follow the format strictly! PLEASE EMIT ONE AND ONLY ONE FUNCTION CALL PER MESSAGE.
----------
Let me start by checking the current directory and listing its contents:
<function=execute_bash>
<parameter=command>pwd && ls</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
/workspace
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 0]
----------
I see we are in the `/workspace` directory. Let's create a Python file that can call the DeepSeek model. First, I'll create a new file called `deepseek_app.py`:
<function=str_replace_editor>
<parameter=command>create</parameter>
<parameter=path>/workspace/deepseek_app.py</parameter>
<parameter=file_text>
from deepseek import DeepSeekModel
def call_deepseek_model(input_text):
model = DeepSeekModel()
response = model.generate(input_text)
return response
if __name__ == "__main__":
test_input = "Hello, DeepSeek!"
result = call_deepseek_model(test_input)
print("Model Response:", result)
</parameter>
</function>
----------
EXECUTION RESULT of [str_replace_editor]:
File created successfully at: /workspace/deepseek_app.py
----------
Now, let's check if the `deepseek` module is installed. If not, we'll need to install it:
<function=execute_bash>
<parameter=command>pip3 show deepseek</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
WARNING: Package(s) not found: deepseek
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 1]
----------
The `deepseek` module is not installed. Let me install it using pip:
<function=execute_bash>
<parameter=command>pip3 install deepseek</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
Collecting deepseek
Downloading deepseek-1.0.0-py3-none-any.whl.metadata (3.9 kB)
Requirement already satisfied: requests in /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages (from deepseek) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages (from requests->deepseek) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages (from requests->deepseek) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages (from requests->deepseek) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages (from requests->deepseek) (2024.12.14)
Downloading deepseek-1.0.0-py3-none-any.whl (4.5 kB)
Installing collected packages: deepseek
Successfully installed deepseek-1.0.0
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 0]
----------
The `deepseek` module is now installed. Let's run the `deepseek_app.py` program to test if it works:
<function=execute_bash>
<parameter=command>python3 deepseek_app.py</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
Traceback (most recent call last):
File "/workspace/deepseek_app.py", line 1, in <module>
from deepseek import DeepSeekModel
ImportError: cannot import name 'DeepSeekModel' from 'deepseek' (/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/deepseek/__init__.py)
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 1]
----------
It seems the `DeepSeekModel` class is not available in the `deepseek` module. Let me check the documentation or available classes in the `deepseek` module:
<function=execute_bash>
<parameter=command>python3 -c "import deepseek; print(dir(deepseek))"</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
['DeepSeekAPI', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'api', 'const']
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 0]
----------
The `deepseek` module provides a `DeepSeekAPI` class instead of `DeepSeekModel`. Let me update the `deepseek_app.py` file to use the correct class:
<function=str_replace_editor>
<parameter=command>str_replace</parameter>
<parameter=path>/workspace/deepseek_app.py</parameter>
<parameter=old_str>
from deepseek import DeepSeekModel
def call_deepseek_model(input_text):
model = DeepSeekModel()
response = model.generate(input_text)
return response
</parameter>
<parameter=new_str>
from deepseek import DeepSeekAPI
def call_deepseek_model(input_text):
api = DeepSeekAPI()
response = api.generate(input_text)
return response
</parameter>
</function>
----------
EXECUTION RESULT of [str_replace_editor]:
The file /workspace/deepseek_app.py has been edited. Here's the result of running `cat -n` on a snippet of /workspace/deepseek_app.py:
1 from deepseek import DeepSeekAPI
2
3 def call_deepseek_model(input_text):
4 api = DeepSeekAPI()
5 response = api.generate(input_text)
6 return response
7
8 if __name__ == "__main__":
9 test_input = "Hello, DeepSeek!"
10 result = call_deepseek_model(test_input)
Review the changes and make sure they are as expected. Edit the file again if necessary.
----------
Now that the file is updated, let's run the program again to test if it works:
<function=execute_bash>
<parameter=command>python3 deepseek_app.py</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
Traceback (most recent call last):
File "/workspace/deepseek_app.py", line 10, in <module>
result = call_deepseek_model(test_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/deepseek_app.py", line 4, in call_deepseek_model
api = DeepSeekAPI()
^^^^^^^^^^^^^
File "/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/deepseek/api.py", line 9, in __init__
raise ValueError("DEEPSEEK_API_KEY is missing")
ValueError: DEEPSEEK_API_KEY is missing
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 1]
----------
The program requires an API key to use the `DeepSeekAPI`. Let me update the program to include the API key as an environment variable. First, I'll modify the `deepseek_app.py` file:
<function=str_replace_editor>
<parameter=command>str_replace</parameter>
<parameter=path>/workspace/deepseek_app.py</parameter>
<parameter=old_str>
from deepseek import DeepSeekAPI
def call_deepseek_model(input_text):
api = DeepSeekAPI()
response = api.generate(input_text)
return response
</parameter>
<parameter=new_str>
import os
from deepseek import DeepSeekAPI
def call_deepseek_model(input_text):
api_key = os.getenv("DEEPSEEK_API_KEY")
if not api_key:
raise ValueError("Please set the DEEPSEEK_API_KEY environment variable.")
api = DeepSeekAPI(api_key=api_key)
response = api.generate(input_text)
return response
</parameter>
</function>
----------
EXECUTION RESULT of [str_replace_editor]:
The file /workspace/deepseek_app.py has been edited. Here's the result of running `cat -n` on a snippet of /workspace/deepseek_app.py:
1 import os
2 from deepseek import DeepSeekAPI
3
4 def call_deepseek_model(input_text):
5 api_key = os.getenv("DEEPSEEK_API_KEY")
6 if not api_key:
7 raise ValueError("Please set the DEEPSEEK_API_KEY environment variable.")
8 api = DeepSeekAPI(api_key=api_key)
9 response = api.generate(input_text)
10 return response
11
12 if __name__ == "__main__":
13 test_input = "Hello, DeepSeek!"
14 result = call_deepseek_model(test_input)
Review the changes and make sure they are as expected. Edit the file again if necessary.
----------
Now, let's set the `DEEPSEEK_API_KEY` environment variable and run the program again. Please provide the API key for DeepSeek, and I will set it up for you. If you don't have one, you may need to sign up for an API key from DeepSeek's official website.
<function=execute_bash>
<parameter=command>export DEEPSEEK_API_KEY=your_api_key_here && python3 deepseek_app.py</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
Traceback (most recent call last):
File "/workspace/deepseek_app.py", line 14, in <module>
result = call_deepseek_model(test_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/deepseek_app.py", line 9, in call_deepseek_model
response = api.generate(input_text)
^^^^^^^^^^^^
AttributeError: 'DeepSeekAPI' object has no attribute 'generate'
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 1]
----------
It seems the `DeepSeekAPI` class does not have a `generate` method. Let me check the available methods in the `DeepSeekAPI` class:
<function=execute_bash>
<parameter=command>python3 -c "from deepseek import DeepSeekAPI; print(dir(DeepSeekAPI))"</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_post_request', 'chat_completion', 'completion_impl', 'fim_completion', 'get_models', 'user_balance']
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 0]
----------
The `DeepSeekAPI` class provides methods like `chat_completion`, `fim_completion`, and `get_models`. Let's update the `deepseek_app.py` file to use the `chat_completion` method instead of `generate`:
<function=str_replace_editor>
<parameter=command>str_replace</parameter>
<parameter=path>/workspace/deepseek_app.py</parameter>
<parameter=old_str>
api = DeepSeekAPI(api_key=api_key)
response = api.generate(input_text)
return response
</parameter>
<parameter=new_str>
api = DeepSeekAPI(api_key=api_key)
response = api.chat_completion(input_text)
return response
</parameter>
</function>
----------
EXECUTION RESULT of [str_replace_editor]:
The file /workspace/deepseek_app.py has been edited. Here's the result of running `cat -n` on a snippet of /workspace/deepseek_app.py:
4 def call_deepseek_model(input_text):
5 api_key = os.getenv("DEEPSEEK_API_KEY")
6 if not api_key:
7 raise ValueError("Please set the DEEPSEEK_API_KEY environment variable.")
8 api = DeepSeekAPI(api_key=api_key)
9 response = api.chat_completion(input_text)
10 return response
11
12 if __name__ == "__main__":
13 test_input = "Hello, DeepSeek!"
14 result = call_deepseek_model(test_input)
Review the changes and make sure they are as expected. Edit the file again if necessary.
----------
Now, let's run the program again with the updated method. Please provide the `DEEPSEEK_API_KEY` environment variable:
<function=execute_bash>
<parameter=command>export DEEPSEEK_API_KEY=your_api_key_here && python3 deepseek_app.py</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
Traceback (most recent call last):
File "/workspace/deepseek_app.py", line 14, in <module>
result = call_deepseek_model(test_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/deepseek_app.py", line 9, in call_deepseek_model
response = api.chat_completion(input_text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/deepseek/api.py", line 62, in chat_completion
response = self._post_request(API_CHAT_COM, payload, stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/openhands/poetry/openhands-ai-5O4_aCHf-py3.12/lib/python3.12/site-packages/deepseek/api.py", line 25, in _post_request
raise Exception(f"HTTP Error {response.status_code}: {response.text}")
Exception: HTTP Error 401: {"error":{"message":"Authentication Fails (no such user)","type":"authentication_error","param":null,"code":"invalid_request_error"}}
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 1]
----------
The error indicates that the provided API key is invalid or does not correspond to a valid user. To proceed, you will need to:
1. Obtain a valid API key from DeepSeek's official platform.
2. Ensure the API key is correctly set in the environment.
Once you have a valid API key, you can set it and run the program again. Let me know if you need further assistance!
----------
key: *** ///在这里输入了key,隐藏
----------
Thank you for providing the API key. Let's set the environment variable and run the program again:
<function=execute_bash>
<parameter=command>export DEEPSEEK_API_KEY=sk-4a727097b1ef4e61a3444ec96e1e0db0 && python3 deepseek_app.py</parameter>
</function>
----------
EXECUTION RESULT of [execute_bash]:
Model Response: Hello! How can I assist you today? 😊
[Python Interpreter: /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python]
[Command finished with exit code 0]
今年玩过最好的游戏当然是《黑神话悟空》,远超预期,独有的中国文化之美,光看画面就是很大的享受,尤其是小西天,剧情音乐也都很顶,黄风岭陕西说唱、《壁上观》都很美,能玩到这样的游戏真好。不足的是难度没有循序渐进,开头的白衣居士卡得我差点放弃,后面反而障碍不多,但偶尔被 boss 卡还是挺难受。另外玩游戏过程中几乎感受不到剧情,这么顶的剧情全靠外部解说,也不知是不是个传播策略。
LLM在学习了大量文字对应图片的构图后,可以从语义推理这些文字的构图应该是怎样的,同时 LLM 自身也能很好理解哪些词是关键字,哪些词应该在同一行。比如上图的 旷野之息邮票 a stamp of Breath of the Wild,LLM 可以学到图上的文本应该是 Breath of the Wild,而对于邮票比较好的布局是上下两行,有个关键字 Wild 突出,得出相应的布局数据。
近期苹果发布的新品,无论是 iPhone 还是 Mac,都一改之前挤牙膏的风格,在最低配机器上都加大了内存,目的很明确,就是支撑 iPhone 和 Mac 上的端 AI 大模型。过去一年,AI手机、AI电脑的概念也一度在炒,在之前写的文章也说过,在客户端上跑大模型,一定是未来趋势。那目前端上大模型情况怎样?
有些场景可以不受设备大小限制、甚至续航功率限制,可以做得比较大,车机系统是一种,这是最好最大的应用场景,端上大模型 AI 应用会最先产生在这个领域,FSD也可以认为是端 AI 的一种。
还有一些可能得 AI 教育硬件,陪伴的玩偶等,本身也足够塞个大运算量芯片和大电池。一些刚需的硬件,比如导盲眼镜,也可以是连着口袋里一个不小的计算设备,这些算是后续可能的端上大模型的应用场景。
但除了车载系统以外,其他 AI 硬件要采用这种方式,发展会比较难。技术体验是一回事,还有商业模式的问题。
这些设备是自带硬件端上跑,还是云端跑,其实就是买断制和订阅制的区别。在端上跑需要用户一次性付出较高的硬件成本,但后续没有其他额外的成本。云端跑初期用户付出的硬件成本低,甚至厂家也愿意赔钱卖机器,但后期是可以用订阅服务制长期收费。从这角度看,用户和商家基本都会选择订阅制,对双方都更友好。所以端大模型要在 AI 硬件上流行起来,还比较难,除非是有些场景对隐私和实时性要求就是很高。
手机电脑
另一种就是利用已有设备,不需要用户额外花钱买硬件,那就还是回到设备大小、续航功耗、发热、机型覆盖等限制,有些场景为了省成本可以先用起来,PC / Mac 陆续可以有一些应用场景,例如上面提到的连接 ollama 的 MemoAI,浏览器上的 AI 搜索也非常适合端上 LLM 去做,但可能这几年会一直处于小场景尝试的阶段,要到主流的程度还早得很,也可能一直不会是主流,手机更是了。
漫画/绘本故事:核心是模型角色保持的能力。儿童绘本故事门槛很低,网上也有大量应用的教程,大众对质量的要求也没那么高,这是 AI 图生成目前擅长的。漫画门槛高一些,核心是故事、分镜的质量,生图所占的比例其实不高,所以如果用 AI 大规模生产,质量堪忧,但也有一些精品,比如这个。针对漫画有一些独立的产品和模型,例如dashtoon、Comic Factory、comicsmaker、llamagen等。
AI写真:人像 P 图是刚需,AI写真算是这个刚需的分支,火过很多产品,国内的妙鸭,海外Remini,还有一大波专门做这块的垂类产品 PhotoAI、星绘等。妙鸭虽然火一波以后销声匿迹,但这个需求是长期可持续的,photoAI 是独立开发者的产品,月流水已经到17万美元。主要用于各社交软件头像、linkedin商务照等。
新的艺术形式:摄影单独是一种艺术形式,相信 AI 图片生成也会带来独有的新的艺术形式,只是目前还未成型,摄影从诞生到成为一种艺术形式,也花了60年。跟画画与摄影不同的是,AI 图片生成创作,是有双向交互的,它不是定死的画笔或相机,创作过程中,AI 创作出来的内容会牵引下一步创作动作,不是一步到位,也不是忠实呈现自己脑里所想、呈现现实世界已有的东西,AI 不仅是工具,作品是人与 AI 的共创,有可能是新的艺术形式。
但跟摄影不同的是,图片生成技术,也许无法像拍照一样普及率那么高,摄像头记录美好生活是高频刚需,但创作不是,纯 AI 创作最终还是属于少部分创作者,就像能称为摄影师的只是少部分人。AI 技术进步是赋予了不会画画但有创意的一波人更强的能力,就像抖音最终赋予的也是少部分创作者展示他们才华的能力一样。
创作无法普及到大众,但创作出来的内容是能普及的,内容消费是大众刚需,至于这波创作者能否创作出跟摄像头相媲美的另一个维度的内容,支撑起一个 AI 内容消费社区,有待探索。
SVD(Stable Video Diffusion),23年12月发布。模型结构复用 Video LDM,主要是在 U-Net 和 VAE 解码器中分别加入时序层(temporal attention layer),SVD 论文本身在讲模型怎么训练的,包括高质量视频的微调。
PixelDance(字节),23年11月发布。特点是首尾帧机制,首帧图作为强引导,与噪点图拼接一起作为输入,严格遵守首帧图,同时尾帧图作为弱引导,训练中会随机抛弃尾帧,推理降噪过程中在步数大于τ值时也会抛弃尾帧,避免完全对齐,让生成的结果有多样性。在 DiT 架构的模型出现之前,效果基本是最好的,生成的视频运动幅度大,稳定性不错。未开源。
上面大部分模型,包括可控性的扩展模型,核心底层都是基于经典的 UNet 架构,但 Sora 出来后,业界公认 DiT 架构才是未来,毕竟效果太碾压了,最近可灵 / Luma 的出现也印证了这点。架构范式转移到 DiT 后,原先在 UNet 上做的各种可控雕花,看起来基本上是没法迁移到 DiT 架构的,一切得重来。
DiT 架构开源的只见到去年11月 sora 出来之前的 Latte,研究性比较多,效果一般。其他靠谱的开源模型还没见到,毕竟 Sora 还没见影,可灵和 luma 也刚出。(DiT架构的图片生成就有一些,比如腾讯混元)
DiT 架构后,视频生成和视频编辑这些模型大概率要淘汰,而人像视频可能在较长一段时间内仍有应用空间,如果要做 AI 视频短片,人物表情动作精细控制挺重要,DiT 架构目前还没看到有能做到精细控制的技术,基于 Unet 的通用视频生成模型这么长时间也没法做好这块的可控性,可能一段时间内还得靠原有技术做这里的可控后编辑。
Transformer 作为新 AI 时代的基石,有必要深入了解下。网上对 Transformer 的教学文章/视频非常多,很多讲得很好,像 3Blue1Brown 的讲解视频,以及这篇文章。整个详细过程原理写不来,本文主要记录一下其中我觉得比较容易混淆的 Attention 模块运算过程,主要是里面的 Q K V 的概念/运算过程/作用。