Computer Vision , AI

[One-page summary] Make A Video: Text to Video Generation Without Text Video Data (ICLR 2023) by Singer et al. 본문

Paper_review[short]

[One-page summary] Make A Video: Text to Video Generation Without Text Video Data (ICLR 2023) by Singer et al.

Elune001 2024. 1. 16. 01:02

● Summary: Text to Video generation with Text Image Data

 

● Approach highlight

  • Text-to-Image Model: DALLE 2 architecture
  • Spatiotemporal layers: U-Net based spatiotemporal diffusion decoder makes a frame from noise
  • Frame interpolation network

 

● Main Results:

 

● Discussion

  • How to generate temporal frames from the spatiotemporal decoder
  • How to learn the relationship between text and action that can only be inferred in videos (ex. a video of a person waving their hand left to right or right to left)