Imagine a model that can transform words into images—turning a simple description into a vivid, visual representation. In this project, we explored the power of Conditional Variational Autoencoders (CVAE) for generating images from text. By training models on datasets like Fashion-MNIST and MS-COCO, we examined how well these models can capture semantic meaning and produce images that reflect the nuances of both short labels and detailed captions.

Our approach integrated advanced feature extraction techniques (e.g., CLIP and Seq2Seq with Gated Fusion) to enhance the generation process, while also addressing challenges like image sharpness and fine-grained details. This project lays the groundwork for future improvements in generative models, particularly in applications where blending textual and visual data is essential.

Find all the details in the full project report below: