One weird trick to drastically reduce your token usage / by Chris Shaffer

I use a set of techniques to trim my token usage for reusable LLM workflows.

For one workflow I benchmarked, ~40% of the output tokens. (that might represent 20-25% of the costs, depending on vendor, and there are other benefits besides)

This is an attempt to distill them down to more “science” and less “art”.

High-level description of the workflow in question

  1. Use an LLM to figure out what the user is asking

  2. Based on that, build a query using one of [set of query-building tools]

  3. Execute the query

  4. Format the results & return to the user

Low-hanging fruit

These are pretty basic token wastage, but I’ve seen them in the wild:

  • A generate-query tool and an execute-query tool: one tool writes a query, which is then read by the LLM simply to be regurgitated into the next tool

  • Using the LLM directly to format the results rather than traditional code - there’s no reason for an LLM to do this, when the LLM could write simple Python to do it more quickly, cheaply, and reliably

The more common scenario

Typically, the waste bites on the last step. The agent returns the result.

These days, I have every tool curry itself and stash its output rather than return it directly.

def curry_build_query_for_action_tool(
    useful_lists=[],
    tool_results={}
):
    @tool
    def build_query_for_action_tool(action: str) -> str:
        """Build a Mongo query from scratch for a given action."""
        tool_results['build_query_for_action'] = build_query_for_action(
            action,
            useful_lists,
        )
    return build_query_for_action_tool

The LLM now just says "I executed tool: xyz" but it doesn't know exactly what the tool returned.

The code wrapping the agent workflow can then format it, return it to the user, or pass it to another agent as required. If the agent for any reason needs the tool’s output, it can ask for it with another tool call.

But IRL, a pretty good chunk of agentic workflows, (a majority, in my anecdotal experience) have at least one step with a bulky output that is simply regurgitated to the user, without the agent ever looking at it again - often without the agent ever being used again.

Side Benefits

This isn’t purely about cost reduction. Any time you’re reducing token usage, you’re also:

  • Making your workflow more deterministic, and therefore more reliable day-to-day, less prone to breakage, and more portable between models

  • Keeping context de-cluttered, improving future behavior of the agent in question

  • Improving speed and responsiveness

  • Allows you to sharpen your prompts (e.g., for a tool that returns code, you don't have to beg the LLM to pretty please not wrap its answer in backticks, or explain it, or introduce it, or put JavaScript comments into JSON with a cherry on top)