Stale Diffusion、Drag Your Noise、PhysReaction、CityGaussian

2024-04-06 23:52

本文主要是介绍Stale Diffusion、Drag Your Noise、PhysReaction、CityGaussian,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本文首发于公众号:机器感知

Stale Diffusion、Drag Your Noise、PhysReaction、CityGaussian

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic  Propagation

图片

Point-based interactive editing serves as an essential tool to complement the controllability of existing generative models. A concurrent work, DragDiffusion, updates the diffusion latent map in response to user inputs, causing global latent map alterations. This results in imprecise preservation of the original content and unsuccessful editing due to gradient vanishing. In contrast, we present DragNoise, offering robust and accelerated editing without retracing the latent map. The core rationale of DragNoise lies in utilizing the predicted noise output of each U-Net as a semantic editor. This approach is grounded in two critical observations: firstly, the bottleneck features of U-Net inherently possess semantically rich features ideal for interactive editing; secondly, high-level semantics, established early in the denoising process, show minimal variation in subsequent stages. Leveraging these insights, DragNoise edits diffusion semantics in a single denoising step and effi......

Stale Diffusion: Hyper-realistic 5D Movie Generation Using Old-school  Methods

图片

Two years ago, Stable Diffusion achieved super-human performance at generating images with super-human numbers of fingers. Following the steady decline of its technical novelty, we propose Stale Diffusion, a method that solidifies and ossifies Stable Diffusion in a maximum-entropy state. Stable Diffusion works analogously to a barn (the Stable) from which an infinite set of horses have escaped (the Diffusion). As the horses have long left the barn, our proposal may be seen as antiquated and irrelevant. Nevertheless, we vigorously defend our claim of novelty by identifying as early adopters of the Slow Science Movement, which will produce extremely important pearls of wisdom in the future. Our speed of contributions can also be seen as a quasi-static implementation of the recent call to pause AI experiments, which we wholeheartedly support. As a result of a careful archaeological expedition to 18-months-old Git commit histories, we found that naturally-accumulating errors have......

PhysReaction: Physically Plausible Real-Time Humanoid Reaction Synthesis  via Forward Dynamics Guided 4D Imitation

图片

Humanoid Reaction Synthesis is pivotal for creating highly interactive and empathetic robots that can seamlessly integrate into human environments, enhancing the way we live, work, and communicate. However, it is difficult to learn the diverse interaction patterns of multiple humans and generate physically plausible reactions. The kinematics-based approaches face challenges, including issues like floating feet, sliding, penetration, and other problems that defy physical plausibility. The existing physics-based method often relies on kinematics-based methods to generate reference states, which struggle with the challenges posed by kinematic noise during action execution. Constrained by their reliance on diffusion models, these methods are unable to achieve real-time inference. In this work, we propose a Forward Dynamics Guided 4D Imitation method to generate physically plausible human-like reactions. The learned policy is capable of generating physically plausible and human-li......

HairFastGAN: Realistic and Robust Hair Transfer with a Fast  Encoder-Based Approach

图片

Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on. This task is challenging due to the need to adapt to various photo poses, the sensitivity of hairstyles, and the lack of objective metrics. The current state of the art hairstyle transfer methods use an optimization process for different parts of the approach, making them inexcusably slow. At the same time, faster encoder-based models are of very low quality because they either operate in StyleGAN's W+ space or use other low-dimensional image generators. Additionally, both approaches have a problem with hairstyle transfer when the source pose is very different from the target pose, because they either don't consider the pose at all or deal with it inefficiently. In our paper, we present the HairFast model, which uniquely solves these problems and achieves high resolution, near real-time performance, and superior reconstruction compared to optimiza......

CityGaussian: Real-time High-quality Large-Scale Scene Rendering with  Gaussians

图片

The advancement of real-time 3D scene reconstruction and novel view synthesis has been significantly propelled by 3D Gaussian Splatting (3DGS). However, effectively training large-scale 3DGS and rendering it in real-time across various scales remains challenging. This paper introduces CityGaussian (CityGS), which employs a novel divide-and-conquer training approach and Level-of-Detail (LoD) strategy for efficient large-scale 3DGS training and rendering. Specifically, the global scene prior and adaptive training data selection enables efficient training and seamless fusion. Based on fused Gaussian primitives, we generate different detail levels through compression, and realize fast rendering across various scales through the proposed block-wise detail levels selection and aggregation strategy. Extensive experimental results on large-scale scenes demonstrate that our approach attains state-of-theart rendering quality, enabling consistent real-time rendering of largescale scenes......

Condition-Aware Neural Network for Controlled Image Generation

图片

We present Condition-Aware Neural Network (CAN), a new method for adding control to image generative models. In parallel to prior conditional control methods, CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT. In particular, CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step. ......

Uncovering the Text Embedding in Text-to-Image Diffusion Models

图片

The correspondence between input text and the generated image exhibits opacity, wherein minor textual modifications can induce substantial deviations in the generated image. While, text embedding, as the pivotal intermediary between text and images, remains relatively underexplored. In this paper, we address this research gap by delving into the text embedding space, unleashing its capacity for controllable image editing and explicable semantic direction attributes within a learning-free framework. Specifically, we identify two critical insights regarding the importance of per-word embedding and their contextual correlations within text embedding, providing instructive principles for learning-free image editing. Additionally, we find that text embedding inherently possesses diverse semantic potentials, and further reveal this property through the lens of singular value decomposition (SVD). These uncovered properties offer practical utility for image editing and semantic disco......

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

图片

One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training......

Video Interpolation with Diffusion Models

图片

We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion mo......

StructLDM: Structured Latent Diffusion for 3D Human Generation

图片

Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can b......

A Unified and Interpretable Emotion Representation and Expression  Generation

图片

Canonical emotions, such as happy, sad, and fearful, are easy to understand and annotate. However, emotions are often compound, e.g. happily surprised, and can be mapped to the action units (AUs) used for expressing emotions, and trivially to the canonical ones. Intuitively, emotions are continuous as represented by the arousal-valence (AV) model. An interpretable unification of these four modalities - namely, Canonical, Compound, AUs, and AV - is highly desirable, for a better representation and understanding of emotions. However, such unification remains to be unknown in the current literature. In this work, we propose an interpretable and unified emotion model, referred as C2A2. We also develop a method that leverages labels of the non-unified models to annotate the novel unified one. Finally, we modify the text-conditional diffusion models to understand continuous numbers, which are then used to generate continuous expressions using our unified emotion model. Through quan......

Large Motion Model for Unified Multi-Modal Motion Generation

图片

Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Archite......

CosmicMan: A Text-to-Image Foundation Model for Humans

图片

We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488......

MagicMirror: Fast and High-Quality Avatar Generation with a Constrained  Search Space

图片

We introduce a novel framework for 3D human avatar generation and personalization, leveraging text prompts to enhance user engagement and customization. Central to our approach are key innovations aimed at overcoming the challenges in photo-realistic avatar synthesis. Firstly, we utilize a conditional Neural Radiance Fields (NeRF) model, trained on a large-scale unannotated multi-view dataset, to create a versatile initial solution space that accelerates and diversifies avatar generation. Secondly, we develop a geometric prior, leveraging the capabilities of Text-to-Image Diffusion Models, to ensure superior view invariance and enable direct optimization of avatar geometry. These foundational ideas are complemented by our optimization pipeline built on Variational Score Distillation (VSD), which mitigates texture loss and over-saturation issues. As supported by our extensive experiments, these strategies collectively enable the creation of custom avatars with unparalleled vis......

这篇关于Stale Diffusion、Drag Your Noise、PhysReaction、CityGaussian的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/881159

相关文章

使用亚马逊Bedrock的Stable Diffusion XL模型实现文本到图像生成:探索AI的无限创意

引言 什么是Amazon Bedrock? Amazon Bedrock是亚马逊云服务(AWS)推出的一项旗舰服务,旨在推动生成式人工智能(AI)在各行业的广泛应用。它的核心功能是提供由顶尖AI公司(如AI21 Labs、Anthropic、Cohere、Meta、Mistral AI、Stability AI以及亚马逊自身)开发的多种基础模型(Foundation Models,简称FMs)。

Differential Diffusion,赋予每个像素它应有的力量,以及在comfyui中的测试效果

🥽原论文要点 首先是原论文地址:https://differential-diffusion.github.io/paper.pdf 其次是git介绍地址:GitHub - exx8/differential-diffusion 感兴趣的朋友们可以自行阅读。 首先,论文开篇就给了一个例子: 我们的方法根据给定的图片和文本提示,以不同的程度改变图像的不同区域。这种可控性允许我们再现

diffusion model 合集

diffusion model 整理 DDPM: 前向一步到位,从数据集里的图片加噪声,根据随机到的 t t t 决定混合的比例,反向要慢慢迭代,DDPM是用了1000步迭代。模型的输入是带噪声图和 t,t 先生成embedding后,用通道和的方式加到每一层中间去: 训练过程是对每个样本分配一个随机的t,采样一个高斯噪声 ϵ \epsilon ϵ,然后根据 t 对图片和噪声进行混合,将加噪

如何在算家云搭建模型Stable-diffusion-webUI(AI绘画)

一、Stable Diffusion WebUI简介 Stable Diffusion WebUI 是一个网页版的 AI 绘画工具,基于强大的绘画模型Stable Diffusion ,可以实现文生图、图生图等。 二、模型搭建流程 1.选择主机和镜像 (1)进入算家云的“应用社区”,点击搜索或者找到"stable-diffusion-webui,进入详情页后,点击“创建应用”

Stable Diffusion【提示词】【居家设计】:AI绘画给你的客厅带来前所未有的视觉盛宴!

前言 参数设置大模型:RealVisXL V4.0 Lightning采样器:DPM++ SDE Karras采样迭代步数:5CFG:2图片宽高:1024*1024反向提示词:(octane render, render, drawing, anime, bad photo, bad photography:1.3),(worst quality, low quality, blurry:1.2

StyleGAN和Diffusion结合能擦出什么火花?PreciseControl:实现文本到图像生成中的精确属性控制!

之前给大家介绍过CycleGAN和Diffusion结合的一项优秀的工作,感兴趣的小伙伴可以点击以下链接阅读~ 图像转换新风尚!当CycleGAN遇到Diffusion能擦出什么火花?CycleGAN-Turbo来了! 今天给大家介绍StyleGAN和Diffusion结合的一项工作PreciseControl,通过结合扩散模型和 StyleGAN 实现文本到图像生成中的精确属性控制,该文章已

VideoCrafter1:Open Diffusion models for high-quality video generation

https://zhuanlan.zhihu.com/p/677918122https://zhuanlan.zhihu.com/p/677918122 视频生成无论是文生视频,还是图生视频,图生视频这块普遍的操作还是将图片作为一个模态crossattention进unet进行去噪,这一步是需要训练的,svd除此之外,还将图片和noise做拼接,这一步,很多文生视频的方式通过通过这一步来扩展其成

24全网最全stable diffusion模型讲解!快来!!新手必收藏!!

前言 手把手教你入门绘图超强的AI绘画程序Stable Diffusion,用户只需要输入一段图片的文字描述,即可生成精美的绘画。给大家带来了全新Stable Diffusion保姆级教程资料包(文末可获取) AI模型最新展现出的图像生成能力远远超出人们的预期,直接根据文字描述就能创造出具有惊人视觉效果的图像,其背后的运行机制显得十分神秘与神奇,但确实影响了人类创造艺术的方式。 AI模型最新

ondrag 事件_html5拖放(Drag和Drop)

二、相关重点 DataTransfer 对象:退拽对象用来传递的媒介,使用一般为Event.dataTransfer。 draggable 属性:就是标签元素要设置draggable=true,否则不会有效果,例如: <div title="拖拽我" draggable="true">列表1</div> ondragstart 事件:当拖拽元素开始被拖拽的时候触发的事件,此事件作用在被拖曳元

Stable Diffusion之提示词指南(三)

在上一篇的文章中,我们讲解了Stable Diffusion提示词的高级用法,对于一些高级属性有了了解。如果有不记得的,可以再去看看———Stable Diffusion之提示词指南(二)。今天我们讲解一下负提示词。 负提示词 负向提示词:简单说就是告诉AI你想不要绘制什么,不要在画面中出现的内容。 可以看到在Web UI页面中负提示词也是和正提示词一样,有一个输入框,一般我们不输入也是