Can't You Just Use Claude for That? Why Transaction Categorization Needs More Than a Chatbot

BLOG

AI & Finance

"Can't you just use Claude for that?"

4 min read

Apr 20, 2026

Dylan Keating, PhD

TABLE OF CONTENTS

This is some text inside of a div block.

We get asked this more than you'd think. Here's our honest answer.

‍

It's a fair question. AI tools have gotten genuinely good at a lot of things, and transaction categorization sounds like exactly the kind of structured, pattern-matching task a large language model should handle well. Paste in some transactions, ask for categories, get clean output. Simple enough.

We've had enough customers ask us about this that we decided to actually test it — not to win an argument, but because if a general-purpose AI could do what we do, that would be worth knowing.

Here's what we found.

‍

For low volumes, it kind of works

If you're building something small — a personal finance tool, an internal prototype, a demo — pasting transactions into ChatGPT or Claude and getting reasonable categories back is genuinely viable. General models are good at reading merchant names, inferring context, and producing structured output. For 20 or 50 transactions, you'll get something usable.

We want to be upfront about that, because the rest of this only makes sense in context: we're not built for that use case. We're built for what comes next.

‍

At scale, the math stops working

Our median customer processes around 10,000 transactions per day. There's no practical way to run that volume through a chat interface — that's not what chat interfaces are designed for.

Customers who say "I tried it and it works" have tried it on a handful of transactions. That's not a workflow.

If you wanted to do it properly via API with a real pipeline, you're looking at roughly $115/month in token costs alone for a median-volume customer, and over $1,700/month for high-volume users. That's before the engineering time to build async job queues, retry logic, and failure handling — because at that volume, a 2–3 minute response time per batch isn't a minor inconvenience, it's an architectural problem.

Our system returns results in 5 seconds. That's not a marginal improvement. It's a different category of tool.

‍

The harder problem: what general models don't know

Speed and cost are fixable, in theory. The domain knowledge gap isn't — at least not without rebuilding what we've spent years accumulating.

A few examples that came out of our testing:

We asked a leading AI model: "What kind of transaction is BR.0072?" It answered confidently: "Most likely a Branch Transaction / Internal Transfer." That's wrong. BR.0072 is an NSF fee — a specific internal code used by Canadian banks. Our system catches it correctly every time, because we've mapped the actual codes Canadian financial institutions use in the wild.

On gambling detection: the well-known platforms aren't the problem. BetMGM, FanDuel, OLG — a general model handles those fine. The hard cases are the payment processors and voucher networks that gambling platforms route transactions through. GIGADAT. FLEXEPIN. ILIXIUM. BAYTREE. To a general model, these look like generic fintech companies. We know they're gambling-adjacent. That distinction matters enormously for affordability assessments and risk scoring, and it can't be fixed by prompting harder — it requires maintained, curated knowledge about how Canadian financial transactions actually flow.

The same applies to micro-lenders. We track dozens of Canadian payday and short-term lenders by name — iCash, Money Mart, GoDay, Lend Direct, Prêts Alpha, and more — and actively update that list as new entrants appear. A general model trained on broad internet data isn't going to have current coverage of the Canadian lending landscape.

‍

Consistency matters more than it sounds

LLMs are non-deterministic by design. The same transaction description can produce different categories on different runs — especially through chat interfaces where temperature settings aren't controlled. For most use cases, that's fine. For financial data, it isn't.

A transaction shouldn't be classified as LOAN_MICRO on Tuesday and OTHER on Wednesday. That inconsistency isn't just a data quality problem — it's a risk and compliance problem. Any system using categorized transaction data for underwriting, affordability assessments, or fraud detection needs to be able to explain and reproduce its outputs. Non-deterministic categorization makes that nearly impossible.

Our model is deterministic. Same input, same output, every time. The only thing that changes a prediction is a deliberate model update or a customer correction — both of which are tracked, verified, and intentional.

‍

The system gets better. A prompt doesn't.

When a customer submits a correction through their dashboard, it's applied in real time and feeds back into model retraining. The longer a customer uses Flinks, the more accurate their categorization gets — for their specific transaction patterns, and across the platform.

A general AI prompt doesn't self-improve. Someone has to notice the errors, diagnose them, update the logic, and redeploy. At scale, that's a part-time job.

‍

The short version

General AI is good at reading 20 transactions. We're built for 300,000 a month — with Canadian-specific domain knowledge, deterministic output, 5-second response times, and a model that improves every time you use it.

If you're evaluating transaction categorization infrastructure and want to talk through your specific volume and use case, we're easy to reach.

‍

Dylan Keating, PhD