A Way Station

Cloudflare AI Firewall

We have firewalls for your bits, firewalls for your apps, now you can get firewalls for your AI. Despite the humorous title, LLM abuse is a major concern for businesses deploying RAG apps and LLM chatbots. Hosting LLM apps presents risks for brand damage, or even direct financial damage as Air Canada found out recently[1]. Cloudflare has released a firewall for AI which is really just an extension on their existing WAF offering that presently supports rate limiting and sensitive data detection with more features such as prompt validation on the way. ...

The Impact of Input Length on the Reasoning Performance of Large Language Models

Mind your prompt lengths. A new paper explores the relationship between input length and performance of large language models. The study found that performance can begin to significantly degrade with as few as 3000 tokens. In this case, the tokens we are talking about are extraneous to the input needed to answer the question, but it brings to light the importance of managing your context. This has broad applicability in RAG applications where different information retrieval (IR) technologies are used to return relevant content that the LLM uses to answer your question. Choosing IR methods that maximize signal to noise can be critical for the performance of your LLM. To say nothing of the cost reduction of using fewer tokens if you are using an API. ...

Guiding Instruction-based Image Editing via Multimodal Large Language Models

A new paper has been published, this time by Apple, discussing the use of a Multimodal Large Language Model (MLLM) to enhance instruction-based image editing. If I am reading this correctly, instruction-based image editing can sometimes struggle when given ambiguous or brief instructions by humans, this approach involves using an MLLM to “translate” or enhance the given instructions into instructions that will achieve the desired result from the instruction-based editing models. ...

RAPTOR Recursive Abstractive Processing for Tree-Organized Retrieval

A new information retrieval paper was published recently, RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) represents a leap forward in the domain of retrieval-augmented language models. Developed by a team from Stanford University, RAPTOR addresses the critical limitation of existing models that struggle with incorporating comprehensive document context during retrieval, thus hindering their ability to adapt to new information and access detailed knowledge. RAPTOR introduces a novel method that recursively embeds, clusters, and summarizes text chunks, constructing a hierarchical tree that captures information at various levels of abstraction. This tree structure, rich in layered summaries, allows the model to retrieve information that spans across a document efficiently, ensuring that even complex, multi-step reasoning tasks benefit from a holistic understanding of the content. The paper summarizes it thus: ...

Automatic Agent Learning from Scratch via Self-Planning

A new paper, “AUTOACT: Autonomous Agent Creation for Task Completion,"1 by Shuofei Qiao et al., introduces the AUTOACT automatic agent learning framework. This framework stands out by eschewing the traditional reliance on extensive annotated data and synthetic trajectories, a stark contrast to models like GPT-4. AUTOACT’s strength lies in its ability to synthesize planning trajectories and implement a division-of-labor strategy. This facilitates the creation of sub-agent groups that work in tandem, showing promise in complex question-answering tasks and potentially surpassing the capabilities of established models like GPT-3.5-Turbo. ...

Glaze and Nightshade

As generative AI models continue to grow and progress, the need for new content to train them on increases. This insatiable appetite for data clashes against the creative spirit of individual artists, though it’s not only individual artists being affected as we see in the New York Times v. OpenAI case. The somewhat digestive nature of model creation renders traditional techniques for protecting images, such as watermarks ineffective. Glaze and Nightshade, two new tools built by the University of Chicago, aim to restore some of the balance that has been lost between content creators and model creators. ...

A Structured Approach to Developing RAG Applications

Integrating Retrieval-Augmented Generation (RAG) into your business can significantly enhance how you interact with data and respond to queries. Here’s a guide to help effectively integrate this technology into your business: Identify Assets: Start by identifying all your data sources, including databases, internal documents, and web content. Knowing the breadth of your data is crucial for a targeted RAG deployment. Identify Questions: Clearly define the types of queries your RAG system will address. Distinct categorization helps customize your RAG application for various needs, from customer inquiries to complex analytical tasks. ...

Data Sovereignty and Architectural Choices in RAG Applications

Selecting an appropriate architecture for RAG applications involves balancing data sovereignty with operational efficiency. This section outlines various architectural choices, each with its unique implications for data control and processing capabilities: Cloud-Based Data and LLM (e.g., ChatGPT Assistant API): Pros: Benefits from scalability, easy integration, and access to advanced AI models. Cons: Introduces concerns about data privacy in the cloud and dependence on external services. Local Data with Cloud LLM (Vector Database and OpenAI API): ...

Self-taught models are gaining steam

A new research paper, “Self-Rewarding Language Models”, explores a novel approach to LLM training. Unlike traditional models, these models generate and evaluate their own training data, enabling continuous self-improvement beyond initial training limits[1]. This is another step along the path to potentially realizing AGI. Data quality has been and remains one of the key challenges for LLM technology. This method, reminds me of the approach Microsoft used for Phi-2’s static training. In that case, they used GPT-3.5 to generate synthetic textbook data. However in this case the model under training is doing the generation[2]. ...

Reduce the latency and cost of LLM inference with prompt compression

A new paper from Microsoft proposes using small models to compress prompts before passing them to larger models like gpt-4. The researchers were able to both achieve up to a 20x reduction in prompt tokens with some performance loss or a 4x reduction with a performance increase. Performance in this case means produced the desired output[1]. Usage is straightforward: from llmlingua import PromptCompressor llm_lingua = PromptCompressor() compressed_prompt = llm_lingua.compress_prompt( prompt_complex.split("\n\n"), instruction="", question="", target_token=200, context_budget="*1.5", iterative_size=100, ) instruction = "Please reference the following examples to answer the math question,\n" prompt = instruction + compressed_prompt["compressed_prompt"] + "\n\nQuestion: " + question request_data = { "prompt": prompt, "max_tokens": 400, "temperature": 0, "top_p": 1, "n": 1, "stream": False, "stop": "\r\n", } response = openai.Completion.create( "gpt-3.5-turbo-0301", **request_data, ) There are 4 big challenges to deploying LLMs in production performance, cost, latency and security. This project hits 3 of the 4. Though it is possible that this approach might even be useful to mitigate prompt injection if a small model that was trained to recognize and strip prompt injection attempts were created. ...