Vector Fusion: Text-to-Vector SVG by Abstracting Pixel-Based Diffusion Model

Recently, a published paper proposed a new method for text-to-image synthesis. This method, called diffusion, quickly became popular due to its impressive results. This paper will discuss diffusion and how it has revolutionized the text-to-image synthesis task.

Diffusion is a method that takes two images and creates a new image that contains the content of both images. The method is based on the principle of image analogies, which says that two images are analogous if they contain the same content but differ in some way (e.g., one is a rotated version of the other).

Diffusion works by first finding pairs of corresponding pixels in the two input images. These pixel pairs are then used to compute a new set of pixels that are placed in the output image. The output image is then created by blending the new pixels with the original pixels of the two input images.

The advantages of diffusion over other text-to-image synthesis methods are its speed and accuracy. Diffusion is able to generate high-quality images in a fraction of the time of other methods. Additionally, diffusion is not limited to a specific domain; it can be applied to any set of images.

The results of diffusion have been nothing short of impressive. In recent years, there have been a number of text-to-image synthesis methods proposed, but none have been able to match the quality of images produced by diffusion. For example, one of the most popular text-to-image synthesis methods, called GAN-INT, produces images that are often blurry and lack the detail of images produced by diffusion.

Diffusion has quickly become the go-to method for text-to-image synthesis and is currently the best text-to-image model available.

Text to Vector

All of the previous models are raster-based, which means they'll lose clarity when zoomed, on the other hand, vector images (Scalable Vector Graphics) replicated pixels on zooming, which preserves clarity.
Ajay. et.al's new method "VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models" achieved Text-to-Vector.

Abstract

Diffusion models have shown impressive results in text-to-image synthesis. Using massive datasets of captioned images, diffusion models learn to generate raster images of highly diverse objects and scenes. However, designers frequently use vector representations of images like Scalable Vector Graphics (SVGs) for digital icons, graphics, and stickers. Vector graphics can be scaled to any size, and are compact. In this work, we show that a text-conditioned diffusion model trained on pixel representations of images can be used to generate SVG-exportable vector graphics. We do so without access to large datasets of captioned SVGs. Instead, inspired by recent work on text-to-3D synthesis, we vectorize a text-to-image diffusion sample and fine-tune it with a Score Distillation Sampling loss. By optimizing a differentiable vector graphics rasterizer, our method distills abstract semantic knowledge out of a pretrained diffusion model. By constraining the vector representation, we can also generate coherent pixel art and sketches. Our approach, VectorFusion, produces more coherent graphics than prior works that optimize CLIP, a contrastive image-text model.

examples of generated vectors

Prompts
"a train. minimal flat 2d vector icon. lineal color. on a white background. trending on artstation."

"Vector graphics are compact but can be scaled to arbitrary size while staying sharp, so the output is infinitely scalable"

One-stage text-to-SVG generation
We optimize vector graphics by optimizing an image-text loss based on Score Distillation Sampling. VectorFusion uses an inverse graphics approach, enabled by the DiffVG differentiable SVG renderer.

Generating 128 SVG paths from scratch.

Fine-tuning for better quality and speed
VectorFusion also supports a more efficient and higher-quality multi-stage setting. First, our method samples raster images from the Stable Diffusion text-to-image diffusion model. VectorFusion then traces those samples automatically with LIVE. However, these samples are often difficult to convert to vector graphics, dull, or don't reflect all the details of the text. Fine-tuning with Score Distillation Sampling improves vibrancy and consistency with the text.

Generating 64 SVG paths, initialized from a diffusion sample.

Pixel Art
By restricting SVG paths to be squares on a grid following Pixray, VectorFusion can generate a retro video game pixel art style.

Pixel art generated with VectorFusion, initialized by pixelating a diffusion sample.

Sketches
It's simple to extend our method to support text-to-sketch generation. We start by drawing 16 random strokes, then optimize our latent Score Distillation Sampling loss to learn an abstract line drawing that reflects the user-supplied text.

Sketches generated with VectorFusion.

The Vector Diffusion Gallery Can be Accessed Here

Vector Fusion: Text-to-Vector SVG by Abstracting Pixel-Based Diffusion Model

Text to Vector

Abstract

Prompts"a train. minimal flat 2d vector icon. lineal color. on a white background. trending on artstation."

One-stage text-to-SVG generationWe optimize vector graphics by optimizing an image-text loss based on Score Distillation Sampling. VectorFusion uses an inverse graphics approach, enabled by the DiffVG differentiable SVG renderer.