• AI Research Insights
  • Posts
  • AI News: What happens if you run a Transformer model with an optical neural network?; Amazon's Multimodal-CoT outperforms GPT-3.5; Stanford Human Preferences Dataset; T2I-Adapter, and more!

AI News: What happens if you run a Transformer model with an optical neural network?; Amazon's Multimodal-CoT outperforms GPT-3.5; Stanford Human Preferences Dataset; T2I-Adapter, and more!

This newsletter brings AI research news that is much more technical than most resources but still digestible and applicable

Hi there, today we will share some research updates from Auditing LLMs: A 3-Layer Approach, Did you know that you can use Neural Radiance Fields (NeRFs) to take an optimal selfie?, You Sing, I Play! Meet SingSong, AI Systems Can Optimize Their Own Code, Meet pix2pix-zero, The Problem of Gender Presentation Differences in a Fine-Grained Pattern and many other cool updates. So, let's start...

What happens if you run a Transformer model with an optical neural network?: Optical neural networks have the potential to be far more energy-efficient than electronic chips, but how well-suited are they to running Transformer models? To test this idea, the research team from Cornell University performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. They presented the research results in a paper investigating the energy-efficiency advantage that could be achieved in executing state-of-the-art Transformer models on optical hardware. They conclude that for large enough models, an energy-efficiency advantage of >8,000x versus current electronic hardware (GPUs) should be possible.

Amazon outperforms GPT-3.5 by 16%: Amazon's new framework called Multimodal-CoT, trained with 1 billion parameters, has surpassed the previous state-of-the-art LLM (GPT-3.5) by 16%, achieving a remarkable accuracy rate of 91.68% compared to the GPT's rate of 75.17%. The framework divides the reasoning process into two phases: rationale generation and answer inference. The model produces more persuasive arguments by including the vision aspects in both stages, which helps to create more precise answer inferences. This work is the first of its kind that studies CoT reasoning in different modalities.

Stanford Human Preferences Dataset (SHP): Models like ChatGPT are trained on tons of human feedback. But collecting this costs a lot. To solve this problem, researchers release the Stanford Human Preferences Dataset (SHP), a collection of 385K naturally occurring collective human preferences over text. Given some context and two possible responses, SHP preferences reflect the helpfulness of one response over another. The preferences are over responses to questions/instructions in 18 domains, from cooking to legal advice, drawn from Reddit. They were inferred from the simple observation that if comment A was written after B but had a higher score despite getting less visibility, then ostensibly A > B. If A was written before B, then we can't conclude this -- the higher score could have come from more visibility. Apart from SHP, the research team is also releasing a couple of preference models called SteamSHPs, finetuned to predict which response will be more helpful. They can be used out-of-the-box for NLG evaluation or as a reward model for RLHF.

Can we use Stable Diffusion for data augmentation?: The latest research paper introduces DA-Fusion, a data augmentation strategy using pretrained diffusion models to semantically modify image attributes. This complements existing data augmentation and improves few-shot learning. The research team fine-tunes Stable Diffusion for data augmentation with text inversion. They insert new tokens in the text encoder for each class to augment and train their embeddings using a handful of labeled training images.

T2I-Adapter: Researchers propose T2I-Adapter, a simple and small (~70M parameters, ~300M storage space) network that can provide extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models. T2I-Adapter aligns internal knowledge in T2I models with external control signals. Researchers can train various adapters according to different conditions, and achieve rich control and editing effects.

OpenAI is now helping Coca-Cola improve its marketing and operations: OpenAI's recent collaboration with Coca-Cola marks a significant move, indicating a strategic shift towards capturing massive value from verticals. This move signals a departure from its previous role as a horizontal provider of services like ChatGPT, DALLE, and Codex. As a result, thin-wrapper startups should be concerned, as OpenAI is well-equipped to compete for their business using privileged internal models and highly effective campaigns that are 100 times more powerful. Notably, OpenAI's services for Coca-Cola closely resemble those offered by Jasper, CopyAI, and other startups that have large market shares but limited proprietary technology.

Thinking about LLM caching service?: Although Langchain already has its caching service, for others, there were none. The Helicone team has started caching support to Helicone. Now you can easily cache your OpenAI requests with their proxy so that duplicate requests don't drive up your bill. This also makes testing and developing a lot easier, faster, and cheaper.

BigVGAN: A universal audio synthesis model, trained on speech only, works for out-of-distribution scenarios, e.g., unseen singing voices and music audio. BigVGAN, trained only on clean speech (LibriTTS), achieves state-of-the-art performance for various zero-shot (out-of-distribution) conditions, including unseen speakers, languages, recording environments, singing voices, music, and instrumental audio.

Do You Know Marktechpost has a community of 1.5 Million+ AI Professionals and Engineers?