A new paper from Microsoft proposes using small models to compress prompts before passing them to larger models like gpt-4. The researchers were able to both achieve up to a 20x reduction in prompt tokens with some performance loss or a 4x reduction with a performance increase. Performance in this case means produced the desired output[1].
Usage is straightforward:
from llmlingua import PromptCompressor llm_lingua = PromptCompressor() compressed_prompt = llm_lingua.compress_prompt( prompt_complex.split("\n\n"), instruction="", question="", target_token=200, context_budget="*1.5", iterative_size=100, ) instruction = "Please reference the following examples to answer the math question,\n" prompt = instruction + compressed_prompt["compressed_prompt"] + "\n\nQuestion: " + question request_data = { "prompt": prompt, "max_tokens": 400, "temperature": 0, "top_p": 1, "n": 1, "stream": False, "stop": "\r\n", } response = openai.Completion.create( "gpt-3.5-turbo-0301", **request_data, ) There are 4 big challenges to deploying LLMs in production performance, cost, latency and security. This project hits 3 of the 4. Though it is possible that this approach might even be useful to mitigate prompt injection if a small model that was trained to recognize and strip prompt injection attempts were created.
...