TaskSeg: Unsupervised Deep Instruction Tuning for Few Shot Object Segmentation

Abstract

Instance segmentation is crucial for many manipulation policies that rely on object-centric reasoning (e.g. modeling an object’s pose or geometry). The appropriate segmentation varies depending on the task being performed; moving a clock, for example, would require a different segmen- tation compared to adjusting the hands of the clock.

Obtaining sufficient annotated data to train task-specific segmentation models is generally infeasible; off-the-shelf foundation models, on the other hand, can perform poorly on out-of-distribution object instances or task descriptions. To address this issue, we propose TaskSeg, a novel system for the unsupervised learning of task-specific segmentations from unlabeled video demonstrations. We leverage the insight that in demonstra- tions used to train manipulation policies, the object being manipulated typically undergoes the most motion.

For an arbitrary object in an arbitrary task, we can extract pseudo- ground truth segmentations using optical flow, and finetune a foundation model for few shot object segmentation at policy deployment. We demonstrate our method’s ability to generate high-quality segmentations both on a suite of manipulation tasks in simulation and on human demonstrations collected in the real world.

Video

Motivation

Off-the-shelf foundation models can perform poorly on out-of-distribution object instances, and cannot reliably adapt to the semantic ambiguity of task-specific segmentation. Manually training task-specific segmentation models, on the other hand, is infeasible due to the lack of annotated data.

Method Overview: Task-Specific Finetuning

To address this issue, we propose TaskSeg, a novel method for the unsupervised extraction of pseudo-ground-truth segmentations from unlabeled video demonstrations. With TaskSeg, we can finetune SAM through Deep Instruction Tuning for few-shot object segmentation in arbitrary manipulation task settings, without the need for annotated data.

Method Overview: TaskSeg Data Generation

Our method, TaskSeg, uses optical flow to extact segmentations from video demonstrations. Since the action object will typically undergo the most motion in a manipulation demonstration, TaskSeg is able to reliably extract high-quality object masks in an arbitrary manipulation setting - moreover, these videos are often already available as a pre-requisite when training manipulation policies.

Real World Generalization

Using TaskSeg, we are able to perform few-shot object segmentation in the real world, including out-of-distribution objects (i.e. "trailer tie plate"), and across novel and cluttered environments.