TaskSeg: Unsupervised Deep Instruction Tuning for Few Shot Object Segmentation

In Submission

1Olin College of Engineering 2Yale University 3Carnegie Mellon University

TaskSeg enables unsupervised novel object segmentation for downstream robotic manipulation policies.

Abstract

Instance segmentation is crucial for many manipulation policies that rely on object-centric reasoning (e.g. modeling an object’s pose or geometry). The appropriate segmentation varies depending on the task being performed; moving a clock, for example, would require a different segmen- tation compared to adjusting the hands of the clock.

Obtaining sufficient annotated data to train task-specific segmentation models is generally infeasible; off-the-shelf foundation models, on the other hand, can perform poorly on out-of-distribution object instances or task descriptions. To address this issue, we propose TaskSeg, a novel system for the unsupervised learning of task-specific segmentations from unlabeled video demonstrations. We leverage the insight that in demonstra- tions used to train manipulation policies, the object being manipulated typically undergoes the most motion.

For an arbitrary object in an arbitrary task, we can extract pseudo- ground truth segmentations using optical flow, and finetune a foundation model for few shot object segmentation at policy deployment. We demonstrate our method’s ability to generate high-quality segmentations both on a suite of manipulation tasks in simulation and on human demonstrations collected in the real world.

Video

Motivation

Description of Image
Off-the-shelf foundation models can perform poorly on out-of-distribution object instances, and cannot reliably adapt to the semantic ambiguity of task-specific segmentation. Manually training task-specific segmentation models, on the other hand, is infeasible due to the lack of annotated data.

Method Overview: Task-Specific Finetuning

Description of Image
To address this issue, we propose TaskSeg, a novel method for the unsupervised extraction of pseudo-ground-truth segmentations from unlabeled video demonstrations. With TaskSeg, we can finetune SAM through Deep Instruction Tuning for few-shot object segmentation in arbitrary manipulation task settings, without the need for annotated data.

Method Overview: TaskSeg Data Generation

Description of Image
Our method, TaskSeg, uses optical flow to extact segmentations from video demonstrations. Since the action object will typically undergo the most motion in a manipulation demonstration, TaskSeg is able to reliably extract high-quality object masks in an arbitrary manipulation setting - moreover, these videos are often already available as a pre-requisite when training manipulation policies.

Real World Generalization

Using TaskSeg, we are able to perform few-shot object segmentation in the real world, including out-of-distribution objects (i.e. "trailer tie plate"), and across novel and cluttered environments.