Video2Tasks：自动将多任务机器人视频切分为 VLA 训练数据

Wang YinXi2026-01-312026-01-31

🎬 一个开源工具，解决 VLA 模型训练中最头疼的数据预处理问题

背景：VLA 训练数据的痛点

如果你正在做具身智能（Embodied AI）相关的研究，尤其是训练 VLA（Vision-Language-Action）模型，比如：

π₀ (pi-zero) - Physical Intelligence 的通用机器人策略
OpenVLA - 开源的视觉-语言-动作模型
ACT - Action Chunking with Transformers

你一定遇到过这个问题：

训练数据需要的是「单任务视频片段 + 自然语言指令」，但你采集到的是「包含多个连续任务的长视频」。

比如一个遥操作视频里，操作员连续完成了：拿起杯子 → 放下杯子 → 拿起叉子 → 放到盘子里 → 拿起勺子 → …

这意味着你需要：

手动看视频，找到每个任务的起止帧
手动为每个片段写一个 instruction（如 “Pick up the cup”）
重复几百次…

这个过程极其痛苦，而且容易出错。

Video2Tasks：自动化解决方案

Video2Tasks 是一个开源工具，利用视觉语言模型（VLM）自动完成这个预处理流程：

输入：包含多个任务的长视频（无标注）
           ┃
           ▼
     ┌─────────────────────────────────────────┐
     │  Video2Tasks                            │
     │  • VLM 自动检测任务边界                  │
     │  • 自动生成自然语言指令                  │
     │  • 分布式处理大规模数据集                │
     └─────────────────────────────────────────┘
           ┃
           ▼
输出：单任务片段 + instruction 标注 → 直接用于 VLA 训练

实际效果展示

一个 4501 帧的视频，自动切分成 16 个单任务片段：

{
  "video_id": "1765279974654",
  "nframes": 4501,
  "segments": [
    {"seg_id": 0,  "start_frame": 0,    "end_frame": 373,  "instruction": "Pick up and manipulate the tote bag"},
    {"seg_id": 1,  "start_frame": 373,  "end_frame": 542,  "instruction": "Retrieve and adjust the white face mask"},
    {"seg_id": 2,  "start_frame": 542,  "end_frame": 703,  "instruction": "Open and place items into the bag"},
    {"seg_id": 3,  "start_frame": 703,  "end_frame": 912,  "instruction": "Place the first black object into the tote bag"},
    {"seg_id": 4,  "start_frame": 912,  "end_frame": 1214, "instruction": "Place the second black object into the tote bag"},
    {"seg_id": 5,  "start_frame": 1214, "end_frame": 1375, "instruction": "Place the white cup on the table"},
    {"seg_id": 6,  "start_frame": 1375, "end_frame": 1524, "instruction": "Move the cup to the right table"},
    {"seg_id": 7,  "start_frame": 1524, "end_frame": 1784, "instruction": "Connect the power adapter to the cable"},
    {"seg_id": 8,  "start_frame": 1784, "end_frame": 2991, "instruction": "Plug the device into the power strip"},
    {"seg_id": 9,  "start_frame": 2991, "end_frame": 3135, "instruction": "Interact with black object on coffee table"},
    {"seg_id": 10, "start_frame": 3135, "end_frame": 3238, "instruction": "Adjust the ashtray"},
    {"seg_id": 11, "start_frame": 3238, "end_frame": 3359, "instruction": "Interact with the white mug"},
    {"seg_id": 12, "start_frame": 3359, "end_frame": 3478, "instruction": "Move the black rectangular object and cup"},
    {"seg_id": 13, "start_frame": 3478, "end_frame": 3711, "instruction": "Pick up the ashtray"},
    {"seg_id": 14, "start_frame": 3711, "end_frame": 4095, "instruction": "Move the white slippers from the shoe rack"},
    {"seg_id": 15, "start_frame": 4095, "end_frame": 4501, "instruction": "Raise the window blind"}
  ]
}

每个片段都有精确的帧范围和自动生成的英文指令，可以直接用于 VLA 模型训练。

技术架构

分布式 Server-Worker 设计

这不是一个简单的脚本，而是一个工业级的分布式系统：

┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│                 │         │                 │         │                 │
│     Server      │────────▶│   Job Queue     │◀────────│     Worker      │
│    (FastAPI)    │         │                 │         │     (VLM)       │
│                 │         │                 │         │                 │
└────────┬────────┘         └─────────────────┘         └────────┬────────┘
         │                                                       │
         ▼                                                       ▼
┌─────────────────┐                                     ┌─────────────────┐
│   Video Files   │                                     │    VLM Model    │
└─────────────────┘                                     └─────────────────┘

Server：读取视频、分窗抽帧、管理任务队列、聚合结果
Worker：调用 VLM 推理，检测任务切换点，生成指令

你可以在一台 4090 上跑 Server，再挂 10 台机器跑 Worker，并行处理海量数据。

VLM 逐窗口推理

VLM 会分析每个视频窗口，提供详细的推理过程：

{
  "task_id": "LongData601-1189::1765279974654_w3",
  "window_id": 3,
  "vlm_json": {
    "thought": "Frames 0-2: The robot's left hand reaches for and grasps a small black object from the left table. The right hand holds a white tote bag. Frames 3-5: The left hand places the black object into the tote bag. Frames 6-7: The left hand releases the black object into the bag and then reaches back to pick up another small black object. This is a clear switch: the robot completes interaction with the first black object and starts interacting with a second, distinct black object. Frame 15: The robot reaches for the white kettle on the left table. This marks a new interaction with a different object (the kettle). Therefore, switches are detected at frame 6 and frame 15.",
    "transitions": [6, 15],
    "instructions": ["Place the first black object into the tote bag", "Place the second black object into the tote bag", "Pick up the white kettle"]
  }
}

智能切分算法

不是简单地把 VLM 的结果拼起来。build_segments_via_cuts 采用了：

加权投票：多个重叠窗口的结果进行加权聚合
Hanning Window：处理窗口边缘权重，解决”边缘识别不稳”的经典问题

专业 Prompt 设计

prompt_switch_detection 明确区分：

True Switch：切换到新物体（如从杯子切换到叉子）
False Switch：同一物体的不同操作（如拿起杯子 → 放下杯子，这是两个任务，不是 False Switch）

这个设计贴合 Manipulation 数据集的痛点，显著降低了过切和漏切。

工程化容错

大规模任务需要稳定运行：

⏱️ Inflight 超时重发
🔄 失败重试上限
📍 .DONE 断点续传标记

快速开始

安装

git clone https://github.com/ly-geming/video2tasks.git
cd video2tasks

pip install -e .

# 如果使用 Qwen3-VL（需要 GPU）
pip install -e ".[qwen3vl]"

配置

1 2	cp config.example.yaml config.yaml # 编辑配置文件

运行

# 终端 1 - 启动 Server
v2t-server --config config.yaml

# 终端 2 - 启动 Worker
v2t-worker --config config.yaml

可以启动多个 Worker 并行处理！

支持的 VLM 后端

后端	说明
Dummy	轻量测试后端，不加载模型
Qwen3-VL	本地部署 Qwen3-VL-32B-Instruct
Remote API	调用远程 VLM API
Custom	实现 `VLMBackend` 接口自定义

适用场景

🤖 遥操作数据预处理
🎮 仿真环境数据标注
📹 真机采集视频切分
🧪 Imitation Learning 数据准备
🔬 Manipulation 研究

开源地址

GitHub: https://github.com/ly-geming/video2tasks

欢迎 Star ⭐、Fork、提 Issue 和 PR！

总结

Video2Tasks 从实际科研痛点出发，解决了 VLA 训练数据预处理中最繁琐的环节：

Before	After
手动看视频找切点	VLM 自动检测任务边界
手动写 instruction	自动生成自然语言指令
200 个视频 = 两周	200 个视频 = 一晚上