• AI Research Insights
  • Posts
  • 🔥 What is Trending in AI Research?: COVE Method + MUTEX + GPT-4V’s Vision Integration and Its Impact + LLM-Grounder.......

🔥 What is Trending in AI Research?: COVE Method + MUTEX + GPT-4V’s Vision Integration and Its Impact + LLM-Grounder.......

This newsletter brings AI research news that is much more technical than most resources but still digestible and applicable

Hey Folks!

This newsletter will discuss some cool AI research papers and AI tools. Happy learning!

👉 What is Trending in AI/ML Research? 

How can we mitigate the issue of large language models generating plausible but incorrect information? This paper from Meta addresses the problem of hallucination in language models and introduces the Chain-of-Verification (CoVe) method to counteract this. CoVe works by having the model (i) draft an initial answer, (ii) create verification questions to fact-check this draft, (iii) independently answer these questions to ensure unbiased responses, and then (iv) produce a final, verified answer. Testing revealed that using CoVe significantly reduces hallucinations in different tasks, such as Wikidata list-based queries, MultiSpanQA in closed book settings, and extended text generation.

This paper from DeepMind posits large language models can be seen as potent compressors. By analyzing prediction through a compression lens, the research found that models like Chinchilla 70B can compress ImageNet patches and LibriSpeech samples significantly more effectively than domain-specific compressors like PNG or FLAC. This underscores the model's power not just in prediction but in data compression as well. Additionally, the authors highlight that traditional compressors can be transformed into conditional generative models by leveraging the equivalence between prediction and compression.

How can robots effectively understand and follow tasks specified by humans across various communication modalities? This study from UT Austin introduces MUTEX, a novel approach that leverages multimodal task specifications for robotic policy learning to address this challenge. MUTEX employs a transformer-based architecture that specializes in cross-modal reasoning. The two-stage training method integrates masked modeling and cross-modal matching objectives. Upon completion, MUTEX can interpret instructions from any or a combination of the six studied modalities: video demonstrations, goal images, text-based goals, and instructions, as well as speech-based goals and instructions. Tests on a new dataset demonstrate MUTEX's superior performance over single-modality-focused methods.

GPT-4 with vision, known as GPT-4V, empowers users to instruct the model to analyze images provided by the user. This integration of image analysis into large language models (LLMs) represents a significant advancement that is now being made widely accessible. Some consider the inclusion of additional modalities, such as image inputs, into LLMs as a crucial frontier in the field of artificial intelligence research and development, as highlighted in various sources. Multimodal LLMs hold the potential to expand the capabilities of language-focused systems by introducing novel interfaces and functionalities. This, in turn, is now allowing them to address new tasks and offer unique experiences to their users.

GPT-4V, similar to GPT-4, completed its training in 2022, with early access becoming available in March 2023. The training process for GPT-4V was akin to that of GPT-4, involving initial training to predict the next word in text using a large dataset of text and image data from the internet and licensed sources. Subsequently, reinforcement learning from human feedback (RLHF) was used to fine-tune the model, ensuring its outputs align with human preferences.

How can household robots efficiently understand and respond to complex language queries in their environment without extensive labeled data? Introducing LLM-Grounder, a zero-shot, open-vocabulary 3D visual grounding system that leverages Large Language Models (LLMs). This novel approach decomposes intricate natural language queries using an LLM into semantic elements. Then, tools like OpenScene or LERF are employed to pinpoint objects in a 3D setting. Subsequently, the LLM assesses the spatial and commonsense relationships between these objects for accurate grounding. It excels in new 3D scenes and varied text queries without needing labeled data. Tested on the ScanRefer benchmark, LLM-Grounder achieved unprecedented zero-shot grounding precision, especially for intricate language inquiries, marking it as a game-changer for robotic 3D vision-language tasks.

How can we accelerate advancements in multi-modal applications for histopathology, given the limited data available in this domain? This paper introduces "Quilt", an innovative dataset sourced from YouTube that provides an unprecedented 768,826 image-text pairs from expert histopathology videos. This dataset surpasses the previous largest histopathology collections, which peaked at around 200K samples. Merging Quilt with additional data from platforms like Twitter and research papers results in "Quilt-1M", the most expansive vision-language histopathology dataset with 1 million paired samples. Leveraging this vast resource, the researchers fine-tuned a CLIP model, achieving superior performance in histopathology image classification and cross-modal retrieval tasks.

👉 What is Trending in AI Tools? 

  • Adcreative AI: Boost your advertising and social media game with AdCreative.ai - the ultimate Artificial Intelligence solution. [Marketing and Sales]

  • Parsio (OCR + AI chat): Automate your data extraction with an AI-powered document parser. [Productivity]

  • Notion: A feature-rich note-taking and project management tool that serves as an all-in-one workspace for teams and individuals alike. [Project Management]

  • Decktopus: The ultimate online presentation tool that harnesses the power of AI to help you craft captivating presentations effortlessly. [Presentation]

  • Rask AI: a one-stop-shop localization tool that allows content creators and companies to translate their videos into 130+ languages quickly and efficiently. [Speech and Translation]

  • Aragon: Get stunning professional headshots effortlessly with Aragon. [Profile]