Computer Vision , AI

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

[One-page summary] MetaFormer Is Actually What You Need for Vision (CVPR 2022) by Yu et al.

● Summary: The performance of a transformer comes from its architecture, not the attention module ● Approach highlight MetaFormer: The structure of the transformer plays a bigger role in performance than the type of token mixer PoolFormer: Prove that the structure of the MetaFormer has a greater impact on the performance of the transformer by replacing the token mixer with a pooling layer to val..

Paper_review[short] 2024. 1. 16. 00:58

[One-page summary] A Simple MultiModality Transfer Learning Baseline for Sign Language Translation (CVPR 2022) y Chen et al.

● Summary: Solving the problem of lack of sign language translation label data with progressive pretraining ● Approach highlight Improve sign language translation performance with the Pretrain Language model S3D backbone base visual encoder V-L mapper for end-to-end training: simple fully connected 2 MLP layer ● Main Results ● Discussion Can a Simple 2 MLP layer(V-L Mapper) efficiently represent..

Paper_review[short] 2024. 1. 16. 00:52

[One-page summary] Emergence of Maps in the Memories of Blind NavigationAgents(ICLR 2023) by Wijmans et al.

● Summary: ask if AI navigation agents build implicit (or ‘mental’) maps like animals ● Approach highlight Blind vs Clairvoyant Bug algorithm: effective navigation with only egomotion sensing Memory enables blind agent’s effective performance ● Main Results ● Discussion Can it work in more complex environments? (is it possible to understand moving objects?)

Paper_review[short] 2024. 1. 16. 00:48

[One-page summary] See, Hear, and Feel: Smart Sensory Fusion for RoboticManipulation( CoRL 2022) by Li et al.

● Summary: Visual Acoustic Tactile multisensory robot learning ● Approach highlight Modality-Temporal feature fusion with self-attention: 1. cross-modality attention 2. cross-time attention 3. cross-modality and cross-time attention ● Main Results ● Discussion Too restrictive an experimental environment (only works in easy and limited settings)

Paper_review[short] 2024. 1. 16. 00:44

[One-page summary] MimicPlay: Long Horizon Imitation Learning by Watching Human Play by Wang et al.

● Summary: Plan from human play, Control from teleoperated demonstrations ● Approach highlight high-level planer and low-level control policy: learn high-level plans from human play data with their hands and learn low-level control using a small number of teleoperated demonstrations ● Main Results ● Discussion Efficiency in sophisticated finger-driven tasks (high-level planner can work?) What is..

Paper_review[short] 2024. 1. 16. 00:41

[One-page summary] TextTo 4D Dynamic Scene Generation by Singer et al.

● Summary: Zero-shot 4D generation(time + 3D) using text prompt ● Approach highlight HexPlane: represents a 4D scene with six planes of feature vectors spanning all pairs of axes in {X,Y,Z,T}. To train Scene optimization, Project the rendered result, and denoising with text embedding and use this loss to train Scene optimization ● Main Results ● Discussion Limitations of representing complex or ..

Paper_review[short] 2024. 1. 16. 00:29

[One-page summary] TuneA Video: One Shot Tuning of Image Diffusion Models for Text to Video Generation by Wu et al.

● Summary: Text to Video generation model using Text to Image diffusion model ● Approach highlight Spatio-temporal attention for efficiency: attend to selected previous frame( first, previous frame) T2V generation using T2I model fine-tuning: update only attention block in fine-tuning stage ● Main Results ● Discussion lack of ability to represent multiple object interactions due to limitations o..

Paper_review[short] 2024. 1. 16. 00:27

[One-page summary] Zero1 to 3: Zero shot One Image to 3D Object by Liu et al.

● Summary:Diffusion model for NeRF ● Approach highlight Viewpoint-conditioned translation image translation model using a conditional latent diffusion model $\hat{X}_{R,T}=f(x,R,T)$ Score Jacobian Chaining (SJC) for 3d representation: randomly sample viewpoints perform volumetric rendering perturb the resulting images with Gaussian noise ϵ denoise them by applying the Unet $ϵ_{θ}$ conditioned on..

Paper_review[short] 2024. 1. 16. 00:24

[One-page summary] Monocular Depth Estimation using Diffusion Models by Saxena et al.

● Summary: Monocular depth estimation using diffusion model with noisy and incomplete depth map in training data ● Approach highlight Fill missing depth: for diffusion process, fill indoor missing depth(window, mirror) by nearest interpolating and fill outdoor missing depth(sky) with a maximum depth Step-Unrolled Denoising Diffusion ● Main Results ● Discussion To fill the outdoor missing depth m..

Paper_review[short] 2024. 1. 16. 00:15

[One-page summary] NerfDiff: Single image View Synthesis with NeRF guided Distillation from 3D aware Diffusion by Gu et al.

● Summary: Use diffusion model to generate multiple view images for NeRF ● Approach highlight Training phase: Train a diffusion model to generate images corresponding to multiple views of an object in the training phase. Finetuning phase: The diffusion model learned in the training phase is used to train the NeRF by creating multiple views of the image ● Main Results ● Discussion In my opinion, ..

Paper_review[short] 2024. 1. 16. 00:10

Computer Vision , AI

목록전체 글 (39)

Computer Vision , AI

티스토리툴바