多模态大模型关键技术学习资料：多模态指令微调多模态思维链LLM辅助视觉推理多模态

Mama-2022

329

收藏 2024-07-01

多模态大模型关键技术学习资料:多模态指令微调多模态思维链LLM辅助视觉推理多模态上下文学习

一、多模态指令微调
Visual Instruction Tuning. pdf Visual Instruction Tuning with Polite Flamingo. pdf
X-LLM Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. pdf
DetGPT Detect What You Need via Reasoning. pdf
Video-ChatGPT Towards Detailed Video Understanding via Large Vision and Language Models. pdf
VisionLLM Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. pdf
VideoChat Chat-Centric Video Understanding. pdf
Video-LLaMA An Instruction-tuned Audio-Visual Language Model for Video Understanding. pdf
Shikra Unleashing Multimodal LLM's Referential Dialogue Magic. pdf
PMC-VQA Visual Instruction Tuning for Medical Visual Question Answering. pdf
PandaGPT One Model To Instruction-Follow Them All. pdf
mPLUG-Owl Modularization Empowers Large Language Models with Multimodality. pdf
Multilnstruct Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. pdf
MultiModal-GPT A Vision and Language Model for Dialogue with Humans. pdf
LMEye An Interactive Perception Network for Large Language Models. pdf
MiniGPT-4 Enhancing Vision-Language Understanding with Advanced Large Language Models. pdf
LLaVAR Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. pdf
MIMIC-IT Multi-Modal In-Context Instruction Tuning. pdf
Macaw-LLM Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. pdf
M3IT A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. pdf
LLaVA-Med Training a Large Language-and-Vision Assistant for Biomedicine in One Day. pdf
Listen, Think, and Understand. pdf
LLaMA-Adapter Efficient Fine-tuning of Language Models with Zero-init Attention. pdf
LLaMA-Adapter V2 Parameter-Efficient Visual Instruction Model. pdf
InstructBLIP Towards General-purpose Vision-Language Models with Instruction Tuning. pdf
LAMM Lanauage-Assisted Multi-Modal Instruction-Tunina Dataset. Framework. and Benchmark. pdf
。。。。。。。。。。。。。。
二、多模态思维链
Visual Programming Compositional visual reasoning without training. pdf
MM-REACT Prompting ChatGPT for Multimodal Reasoning and Action. pdf
Learn to Explain Multimodal Reasoning via Thought Chains for Science Question Answering. pdf
Visual Chain of Thought Bridging Logical Gaps with Multimodal Infillings. pdf
Visual ChatGPT Talking, Drawing and Editing with Visual Foundation Models. pdf
Let's Think Frame by Frame Evaluating Video Chain of Thought with Video Infilling and Prediction. pdf
Multimodal Chain-of-Thought Reasoning in Language Models. pdf
Chain of Thought Prompt Tuning in Vision Language Models. pdf
EmbodiedGPT Vision-Language Pre-Training via Embodied Chain of Thought. pdf
Caption Anything Interactive Image Description with Diverse Multimodal Controls. pdf
Chameleon Plug-and-Play Compositional Reasoning with Large Language Models. pdf
Explainable Multimodal Emotion Reasoning. pdf

三、LLM辅助视觉推理
ViperGPT Visual Inference via Python Execution for Reasoning. pdf
Visual Programming Compositional visual reasoning without training. pdf
SuS-X Training-Free Name-Only Transfer of Vision-Language Models. pdf
Mindstorms in Natural Language-Based Societies of Mind. pdf
Visual ChatGPT Talking, Drawing and Editing with Visual Foundation Models. pdf
LayoutGPT Compositional Visual Planning and Generation with Large Language Models. pdf
Socratic Models Composing Zero-Shot Multimodal Reasoning with Language. pdf
MM-REACT Prompting ChatGPT for Multimodal Reasoning and Action. pdf
Retrieving-to-Answer Zero-Shot Video Question Answering with Frozen Large Language Models. pdf
Prompt, Generate, then Cache Cascade of Foundation Models makes Strong Few-shot Learners. pdf
PointCLIP V2 Adapting CLIP for Powerful 3D Open-world Learning. pdf
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation. pdf
ChatGPT Asks BLIP-2 Answers Automatic Questioning Towards Enriched Visual Descriptions. pdf
GPT4Tools Teaching Large Language Model to Use Tools via Self-instruction. pdf
HuggingGPT Solving Al Tasks with ChatGPT and its Friends in HuggingFace. pdf
IdealGPT Iteratively Decomposing Vision and Language Reasoning via Large Language Models. pdf
Caption Anything Interactive Image Description with Diverse Multimodal Controls. pdf
AssistGPT A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn. pdf
Chameleon Plug-and-Play Compositional Reasoning with Large Language Models. pdf

四、多模态上下文学习

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition. pdf
Link-Context Learning for Multimodal LLMs. pdf
Multimodal Foundation Models For Echocardiogram Interpretation. pdf
Proactive Human-Robot Interaction using Visuo-Lingual Transformers. pdf
Lightweight In-Context Tuning for Multimodal Unified Models. pdf
MMHQA-ICL Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images. pdf
HowToCaption Prompting LLMs to Transform Video Annotations at Scale. pdf
Large Language Models are Visual Reasoning Coordinators. pdf
Language as the Medium Multimodal Video Classification through text only. pdf