Abstract:
In this paper, we propose a novel multimodal Retrieval-Augmented Generation (RAG) algorithm designed to efficiently convert a purely textual large language model (LLM) into a multimodal system without the need for additional resource-intensive multimodal training. Our approach leverages compact text-based LLMs by incorporating external knowledge retrieval from multimodal input (text, images, and audio). This approach reduces computational cost while maintaining competitive performance offering a cost-effective alternative to traditional heavyweight multimodal models.
Our proposed solution employs a modular architecture comprising four main components: retrieval, recognition, matching, and generation. The provided system retrieval component utilizes a trimodal general representation model (ONE-PEACE) combined with Principal Component Analysis (PCA) for dimensionality reduction, allowing efficient retrieval from a large-scale Wikipedia-based knowledge database using Approximate Nearest Neighbor search (Annoy). The recognition component integrates specialized unimodal models (BLIP for images, AST for audio classification, and Whisper for speech recognition) to generate concise textual descriptions. These descriptions are then semantically matched with retrieved data using sentence embeddings from MPNet, ensuring relevant context integration.
We extensively evaluate our system using the MMBench and Tiny LVLM benchmarks, demonstrating its capabilities in visual reasoning, perception, knowledge acquisition, and visual commonsense tasks. Benchmarking results indicate that our approach achieves competitive or superior performance compared to several existing multimodal LLM architectures, particularly excelling in minimizing hallucinations and effectively handling cross-modal commonsense reasoning tasks.
Key words and phrases:lLM, multimodality, nLP, rAG.