Style Transfer、Move Anything、InstantMesh、RealmDreamer、GoodDrag

2024-04-11 19:44

本文主要是介绍Style Transfer、Move Anything、InstantMesh、RealmDreamer、GoodDrag,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本文首发于公众号:机器感知

https://mp.weixin.qq.com/s/Jm8oc_WIjKRRb0IBOgte5gicon-default.png?t=N7T8https://mp.weixin.qq.com/s/Jm8oc_WIjKRRb0IBOgte5g

GenCHiP: Generating Robot Policy Code for High-Precision and  Contact-Rich Manipulation Tasks

图片

Large Language Models (LLMs) have been successful at generating robot policy code, but so far these results have been limited to high-level tasks that do not require precise movement. It is an open question how well such approaches work for tasks that require reasoning over contact forces and working within tight success tolerances. We find that, with the right action space, LLMs are capable of successfully generating policies for a variety of contact-rich and high-precision manipulation tasks, even under noisy conditions, such as perceptual errors or grasping inaccuracies. Specifically, we reparameterize the action space to include compliance with constraints on the interaction forces and stiffnesses involved in reaching a target pose. We validate this approach on subtasks derived from the Functional Manipulation Benchmark (FMB) and NIST Task Board Benchmarks. Exposing this action space alongside methods for estimating object poses improves policy generation with an LLM by g......

RULER: What's the Real Context Size of Your Long-Context Language  Models?

图片

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate ten long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, all models exhibit large performance drops as the context length increases. ......

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving  Zero-Shot Voice Editing

图片

We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit timbre leakage which changes the speaker's perceived identity. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at https://voiceshopai.g......

DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with  Latent Space

图片

In real-life conversations, the content is diverse, and there exists the one-to-many problem that requires diverse generation. Previous studies attempted to introduce discrete or Gaussian-based continuous latent variables to address the one-to-many problem, but the diversity is limited. Recently, diffusion models have made breakthroughs in computer vision, and some attempts have been made in natural language processing. In this paper, we propose DiffusionDialog, a novel approach to enhance the diversity of dialogue generation with the help of diffusion model. In our approach, we introduce continuous latent variables into the diffusion model. The problem of using latent variables in the dialog task is how to build both an effective prior of the latent space and an inferring process to obtain the proper latent given the context. By combining the encoder and latent-based diffusion model, we encode the response's latent representation in a continuous space as the prior, instead o......

Adapting LLaMA Decoder to Vision Transformer

图片

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficie......

Efficient and Scalable Chinese Vector Font Generation via Component  Composition

图片

Chinese vector font generation is challenging due to the complex structure and huge amount of Chinese characters. Recent advances remain limited to generating a small set of characters with simple structure. In this work, we first observe that most Chinese characters can be disassembled into frequently-reused components. Therefore, we introduce the first efficient and scalable Chinese vector font generation approach via component composition, allowing generating numerous vector characters from a small set of components. To achieve this, we collect a large-scale dataset that contains over \textit{90K} Chinese characters with their components and layout information. Upon the dataset, we propose a simple yet effective framework based on spatial transformer networks (STN) and multiple losses tailored to font characteristics to learn the affine transformation of the components, which can be directly applied to the B\'ezier curves, resulting in Chinese characters in vector format. ......

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

图片

Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraint......

Tuning-Free Adaptive Style Incorporation for Structure-Consistent  Text-Driven Style Transfer

图片

In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized e......

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic  Gaussian Splatting

图片

The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric con......

Gaussian-LIC: Photo-realistic LiDAR-Inertial-Camera SLAM with 3D  Gaussian Splatting

图片

We present a real-time LiDAR-Inertial-Camera SLAM system with 3D Gaussian Splatting as the mapping backend. Leveraging robust pose estimates from our LiDAR-Inertial-Camera odometry, Coco-LIC, an incremental photo-realistic mapping system is proposed in this paper. We initialize 3D Gaussians from colorized LiDAR points and optimize them using differentiable rendering powered by 3D Gaussian Splatting. Meticulously designed strategies are employed to incrementally expand the Gaussian map and adaptively control its density, ensuring high-quality mapping with real-time capability. Experiments conducted in diverse scenarios demonstrate the superior performance of our method compared to existing radiance-field-based SLAM systems. ......

Move Anything with Layered Scene Diffusion

图片

Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in les......

InstantMesh: Efficient 3D Mesh Generation from a Single Image with  Sparse-view Large Reconstruction Models

图片

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g, depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and co......

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth  Diffusion

图片

We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image. ......

GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

图片

In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The project page is https://goo......

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like  Multi-talker Conversations

图片

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge in the field. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix is capable of first converting dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues tha......

这篇关于Style Transfer、Move Anything、InstantMesh、RealmDreamer、GoodDrag的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/895013

相关文章

论文阅读笔记: Segment Anything

文章目录 Segment Anything摘要引言任务模型数据引擎数据集负责任的人工智能 Segment Anything Model图像编码器提示编码器mask解码器解决歧义损失和训练 Segment Anything 论文地址: https://arxiv.org/abs/2304.02643 代码地址:https://github.com/facebookresear

Unity3D 运动之Move函数和translate

CharacterController.Move 移动 function Move (motion : Vector3) : CollisionFlags Description描述 A more complex move function taking absolute movement deltas. 一个更加复杂的运动函数,每次都绝对运动。 Attempts to

12C 新特性,MOVE DATAFILE 在线移动 包括system, 附带改名 NID ,cdb_data_files视图坏了

ALTER DATABASE MOVE DATAFILE  可以改名 可以move file,全部一个命令。 resue 可以重用,keep好像不生效!!! system照移动不误-------- SQL> select file_name, status, online_status from dba_data_files where tablespace_name='SYSTEM'

WebAPI (一)DOM树、DOM对象,操作元素样式(style className,classList)。表单元素属性。自定义属性。间歇函数定时器

文章目录 Web API基本认知一、 变量声明二、 DOM1. DOM 树2. DOM对象3. 获取DOM对象(1)、选择匹配的第一个元素(2)、选择匹配多个元素 三、 操作元素1. 操作元素内容2. 操作元素属性(1)、常用属性(href之类的)(2)、通过style属性操作CSS(3)、通过类名(className)操作CSS(4)、通过classList操作控制CSS(5)、操作表单

Midjourney 随机风格 (Style Random),开启奇幻视觉之旅

作者:老余捞鱼 原创不易,转载请标明出处及原作者。 写在前面的话:       Midjourney 最近推出了 "Style Random"(随机风格),这项功能可以让我们使用独特的随机 sref 代码创建图像,从而每次都能获得不同的美感。通过对这些功能的探索和尝试,我发现了一些很棒的风格,我很高兴能与大家分享,这样可以节省大家的时间,不用自己动手测试。在本文中,我将展示十个M

Android style(样式), theme(主题)资源

本文内容摘自《疯狂Android讲义 第三版-李刚著作》 样式和主题资源都用于对Android应用进行“美化”,只要充分利用Android应用的样式和主题资源,开发者就可以开发出各种风格的Android应用。 样式资源(style): 如果我们经常需要对某个类型的组件指定大致相似的格式,比如字体,颜色,背景色等,如果次都要为View组件重复指定这些属性,无疑会有大量的工作量,而且不利于项目后

10Python的Pandas:样式Style

Pandas 提供了多种样式选项,可以让你对数据框的显示进行格式化。这些样式可以帮助突出显示数据中的某些元素、设置颜色、格式化数字等。以下是一些常用的 Pandas 样式示例: 1. 基本样式设置 要为整个数据框应用样式,可以使用 style 属性。例如,你可以为所有的数值设置显示格式: import pandas as pd# 创建示例数据框df = pd.DataFrame({'A':

【Python报错已解决】`AttributeError: move_to requires a WebElement`

🎬 鸽芷咕:个人主页  🔥 个人专栏: 《C++干货基地》《粉丝福利》 ⛺️生活的理想,就是为了理想的生活! 文章目录 引言:一、问题描述:1.1 报错示例:1.2 报错分析:1.3 解决思路: 二、解决方法:2.1 方法一:正确获取并传递WebElement对象2.2 步骤二:使用find_element_by_*方法直接获取元素 三、其

后起之秀 | MySQL Binlog增量同步工具go-mysql-transfer实现详解

点击上方蓝色字体,选择“设为星标” 回复”资源“获取更多资源 一、 概述 工作需要研究了下阿里开源的MySQL Binlog增量订阅消费组件canal,其功能强大、运行稳定,但是有些方面不是太符合需求,主要有如下三点: 需要自己编写客户端来消费canal解析到的数据server-client模式,需要同时部署server和client两个组件,我们的项目中有6个业务数据库要实时同步到redis

Failed to find style 'vpiCirclePageIndicatorStyle' in current theme

使用 ViewPagerIndicator 时 , 布局文件报的错 Missing styles. Is the correct theme chosen for this layout? Use the Theme combo box above the layout to choose a different layout, or fix the theme style refe