Starting to evaluate agent usefulness

Starting to evaluate agent usefulness

Nathan Butters

I recently talked with a friend and collaborator about agents and the usefulness of products such as Anthropic’s Claude (incl. Cowork and Claude Code). We did not agree on much during the conversation, with me sounding very rigid and curmudgeonly and him taking on an optimistic view I did not expect. I respect this person a great deal, so as he continued to argue for the benefits and usefulness of these systems in his life I felt a desire to better understand where he was coming from. Towards the end of the conversation he gave me a blanket challenge:

Use agents for tasks you’d never consider using them for. See how they perform. It may not change your mind and, if it doesn’t, at least then you’d be able to say you tested your biases.

Implied in his challenge was something I find very hard to do: set aside my biases to give something a proper chance when I’ve already made up my mind. Still, I took up the challenge and this article represents the first of a handful of pages I’ll be writing on my observations from using 3rd party agents (mostly Claude), Ollama, and LangGraph to bring agents into my day-to-day. I don’t know whether the findings will be consistent throughout, so I hope you understand I expect my thinking to change. I just don’t know if it will be more optimistic or if I will want to pull the rip cord and eject out of tech all together.

Overview of what I plan to do

I recognize I will have to use AI when it’s incorporated into products I use or that people and companies I interact with leverage it. Does that mean I have to use it for myself? What changes if I choose to use it? What changes if I don’t? These questions rattle around my head all the time. They take up more space than I would like. Every time I see a new “report” or “finding” about how AI is helping people, or how it’s hurting them, I find myself questioning the motivations of the authors. I seek to think through who benefits from the perspective shared, who is funding the research, and how does it match with larger narratives around the technology. There are many people thinking deeply about these questions, who have more time and resources to devote to them than I do, so I will not focus on addressing them directly. I will focus, instead, on what I see as a risk management profession who has to exist in a world permeated with AI.

I feel a lot of friction about using agents because I don’t know if they provide enough value to warrant integrating them into my life and workflows. I am working on a project to explore this more directly, using this great paper on the uses of LLMs for assurance arguments, but for our current purposes it’s critical just to note they set up 14 questions someone should have a plan to answer BEFORE integrating LLMs into their work. I agree with their findings and so tend to gravitate towards spending a lot of time before touching the technology thinking through my goals and expectations. My friend told me I needed to let go and focus more on discovering the usefulness of agents through their usage. He has a point. Trying a different methodology might support me putting my biases to the side.

Tasks I’ve identified for exploration

Potential Use CaseWhat is the techWhat I hope to doEffort or skill requiredValue to meOutcomeMy current judgment
Design a custom logo/wordmarkClaude CoworkTake a simple sketch I have of a wordmark and generate a logo and wordmark from itLowMediumIt eventually created a usable placeholder logo and wordmarkDesign does not appear to be a strength of these systems, especially when making small modifications based on abstract concepts like balance
Resume rewrites for specific jobsClaude CoworkAssess my resume’s fit for a role and then use it, and my LinkedIn profile, to generate a role-specific resumeLowHighIt has generated a resume for each role I’ve used it forThe text generated feels largely made up and not reflective of my experience, no positive responses yet for modified resumes
Draft a business planClaude CoworkTake a set of goals and a website and turn it into a business planLowHighIt provided a structured document with some rough estimates for revenue and expensesThe structure seemed find, the numbers did not match across the document
Create a Mod for a video gameClaude CoworkCreate a mod for Banner Lord 2 to add a new story and more depthMediumHighUnknownNot Started
Oxalate EstimatorVLM + AgentEstimating the amount of oxalates in food based on images, recipes, or labesMediumHighInitial POC showed promise but continually broke when given a recipeJust Started
Irrigation SystemAgentUsing an agent to gather upcoming weather information from the internet and determine if an automated watering system should run.MediumMediumUnknownNot Started
Triage bugs in live codeClaude CoworkAssisting me by identifying issues with dependencies and configuration.LowHighIt helped me identify and resolve a bug that came about after an update to a dependencyThe system appears to handle this task well
Demo CreationClaude Cowork and Claude CodeMediumHighThe demo came out greatThis tech is useful for mocking a front end, with custom functionality, quickly
Creating an AgentOllama + CopilotMediumLowI finished the agent in the time alottedIt could be very useful, but it’s also dangerous to use too much

Of the use cases above, three of the tasks (Triage bugs in live code, Demo creation, Creating an Agent) fall directly within what I believe agents to be useful for. I will still test them so I do not end up focusing only on peripheral use cases of the systems. The creators of this technology want people to believe it can support or replace human work across many domains. I consider this exploration an expansion of the years of risk management I’ve been doing in the space at a more personal level.

Update Log

Potential Use CaseLast Activity DateCurrent StatusNotes
Design a custom logo/wordmarkJune 17, 2026This took me the better part of a week’s worth of Claude Credits on Free Tier.
Resume rewrites for specific jobsJuly 2, 2026I am unsure whether the changes made are useful.
Draft a business planMay 30, 20206The framework was helpful, the financials… not so much.
Create a Mod for a video gameJune 22, 2026🆕
Oxalate EstimatorMay 26, 2026🔜
Irrigation SystemJune 1, 2026🔜
Triage bugs in live codeJune 3, 2026Definitely a place worth exploring more.
Demo CreationJune 24, 2026This helped me build quickly once I figured out how best to shape my own experience.
Creating an AgentJune 2, 2026It’s tough to code in front of people in 30 minutes, especially when they expect you to prompt Claude and then just chat…
Share this post