多模态大模型关键技术学习资料:多模态指令微调多模态思维链LLM辅助视觉推理多模态上下文学习
一、多模态指令微调
Visual Instruction Tuning. pdf Visual Instruction Tuning with Polite Flamingo. pdf 
X-LLM Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. pdf 
DetGPT Detect What You Need via Reasoning. pdf 
Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models. pdf 
VisionLLM Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. pdf 
VideoChat Chat-Centric Video Understanding. pdf 
Video-LLaMA An Instruction-tuned Audio-Visual Language Model for Video Understanding. pdf 
Shikra Unleashing Multimodal LLM's Referential Dialogue Magic. pdf 
PMC-VQA Visual Instruction Tuning for Medical Visual Question Answering. pdf 
PandaGPT One Model To Instruction-Follow Them All. pdf 
mPLUG-Owl Modularization Empowers Large Language Models with Multimodality. pdf 
Multilnstruct Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. pdf 
MultiModal-GPT A Vision and Language Model for Dialogue with Humans. pdf 
LMEye An Interactive Perception Network for Large Language Models. pdf 
MiniGPT-4 Enhancing Vision-Language Understanding with Advanced Large Language Models. pdf 
LLaVAR Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. pdf 
MIMIC-IT Multi-Modal In-Context Instruction Tuning. pdf 
Macaw-LLM Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. pdf 
M3IT A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. pdf 
LLaVA-Med Training a Large Language-and-Vision Assistant for Biomedicine in One Day. pdf 
Listen, Think, and Understand. pdf 
LLaMA-Adapter Efficient Fine-tuning of Language Models with Zero-init Attention. pdf 
LLaMA-Adapter V2 Parameter-Efficient Visual Instruction Model. pdf 
InstructBLIP Towards General-purpose Vision-Language Models with Instruction Tuning. pdf 
LAMM Lanauage-Assisted Multi-Modal Instruction-Tunina Dataset. Framework. and Benchmark. pdf
。。。。。。。。。。。。。。
二、多模态思维链
Visual Programming Compositional visual reasoning without training. pdf 
MM-REACT Prompting ChatGPT for Multimodal Reasoning and Action. pdf 
Learn to Explain Multimodal Reasoning via Thought Chains for Science Question Answering. pdf 
Visual Chain of Thought Bridging Logical Gaps with Multimodal Infillings. pdf 
Visual ChatGPT Talking, Drawing and Editing with Visual Foundation Models. pdf 
Let's Think Frame by Frame Evaluating Video Chain of Thought with Video Infilling and Prediction. pdf 
Multimodal Chain-of-Thought Reasoning in Language Models. pdf
Chain of Thought Prompt Tuning in Vision Language Models. pdf
 EmbodiedGPT Vision-Language Pre-Training via Embodied Chain of Thought. pdf 
Caption Anything Interactive Image Description with Diverse Multimodal Controls. pdf 
Chameleon Plug-and-Play Compositional Reasoning with Large Language Models. pdf 
Explainable Multimodal Emotion Reasoning. pdf
三、LLM辅助视觉推理
ViperGPT Visual Inference via Python Execution for Reasoning. pdf 
Visual Programming Compositional visual reasoning without training. pdf 
SuS-X Training-Free Name-Only Transfer of Vision-Language Models. pdf 
Mindstorms in Natural Language-Based Societies of Mind. pdf 
Visual ChatGPT Talking, Drawing and Editing with Visual Foundation Models. pdf 
LayoutGPT Compositional Visual Planning and Generation with Large Language Models. pdf 
Socratic Models Composing Zero-Shot Multimodal Reasoning with Language. pdf
 MM-REACT Prompting ChatGPT for Multimodal Reasoning and Action. pdf 
Retrieving-to-Answer Zero-Shot Video Question Answering with Frozen Large Language Models. pdf 
Prompt, Generate, then Cache Cascade of Foundation Models makes Strong Few-shot Learners. pdf 
PointCLIP V2 Adapting CLIP for Powerful 3D Open-world Learning. pdf 
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation. pdf 
ChatGPT Asks BLIP-2 Answers Automatic Questioning Towards Enriched Visual Descriptions. pdf 
GPT4Tools Teaching Large Language Model to Use Tools via Self-instruction. pdf 
HuggingGPT Solving Al Tasks with ChatGPT and its Friends in HuggingFace. pdf 
IdealGPT Iteratively Decomposing Vision and Language Reasoning via Large Language Models. pdf 
Caption Anything Interactive Image Description with Diverse Multimodal Controls. pdf 
AssistGPT A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn. pdf 
Chameleon Plug-and-Play Compositional Reasoning with Large Language Models. pdf
四、多模态上下文学习
Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition. pdf 
Link-Context Learning for Multimodal LLMs. pdf 
Multimodal Foundation Models For Echocardiogram Interpretation. pdf 
Proactive Human-Robot Interaction using Visuo-Lingual Transformers. pdf 
Lightweight In-Context Tuning for Multimodal Unified Models. pdf 
MMHQA-ICL Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images. pdf 
HowToCaption Prompting LLMs to Transform Video Annotations at Scale. pdf 
Large Language Models are Visual Reasoning Coordinators. pdf 
Language as the Medium Multimodal Video Classification through text only. pdf