DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

1Technical University of Munich

2Tsinghua University

We propose DNF, a dictionary-based representation for the unconditional generation of 4D deforming shapes, with a transformer-based diffusion model. Our method is capable of generating motions with superior shape quality and temporal consistency.

Abstract

While remarkable success has been achived through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields. Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression -- combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.

Video

Overview

We first pre-train disentangled shape and motion MLPs with per-instance latents. We then decompose the pre-trained MLPs using SVD to conduct dictionary-based fine-tuning of the singular values for each train instance, in order to more expressively capture local object detail. We then obtain for each train instance its latent shape and motion codes as well as coefficient vectors, along with a globally shared dictionary. This effectively balances quality, contiguity and compression in the learned representation space.

Training and generation of our DNF for unconditional 4D synthesis. We employ transformer-based diffusion models to model the \(\boldsymbol{\sigma}\) that modulate the shape and motion MLPs, along with shape and motion codes. At inference time, new samples can then be decoded to shape and motion to form a 4D deforming sequence.

Unconditional Motion Generation

Generation for Unseen Species

More Results

BibTeX

@misc{zhang2024dnfunconditional4dgeneration,
      title={DNF: Unconditional 4D Generation with Dictionary-based Neural Fields}, 
      author={Xinyi Zhang and Naiqi Li and Angela Dai},
      year={2024},
      eprint={2412.05161},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05161}, 
  }