Tool Use with Large Language Models: Function Calling and External APIs Explained

Large language models (LLMs) can’t read the news, check your bank balance, or book a flight on their own. They don’t have live access to the world outside their training data. But that’s changing. With function calling, LLMs can now ask for help - and get it - without breaking the conversation. This isn’t magic. It’s a structured way for models to talk to real tools and APIs, turning them from static text predictors into active problem solvers.

How Function Calling Actually Works

Function calling lets an LLM decide when to use an external tool. Instead of guessing the current stock price or the weather in Tokyo, it generates a clean JSON request like this:

{
  "name": "get_weather",
  "arguments": "{\"city\": \"Tokyo\", \"unit\": \"celsius\"}"
}

The model doesn’t run the code. It just tells your system: “Here’s what to do.” Your app then calls the weather API, gets the answer, and feeds it back to the model. The LLM then summarizes the result in plain language - like a helpful assistant who just looked something up for you.

This process has four clear steps:

Analysis: The model reads your question and decides if it needs a tool.
Selection: It picks the right function from a list you provided.
Execution: Your system runs the function with the extracted parameters.
Synthesis: The model uses the result to give a natural-language answer.

It’s not just about retrieving data. You can use it to calculate mortgage payments, pull customer records from a database, or even trigger a workflow in your CRM. The key is that the model stays in control of the conversation - it just gets smarter by reaching outside itself.

Major Models and How They Compare

Not all LLMs handle function calling the same way. Here’s how the big players stack up as of late 2025:

Function Calling Performance Comparison (December 2025)
Model	Accuracy on Ambiguous Inputs	Multi-Tool Success Rate	Third-Party Tools Supported	Unique Feature
GPT-4 Turbo	88.7%	87.2%	2,400+	Best ecosystem integration
Claude 3.5 Sonnet	94.3%	85.6%	850	Best at understanding fuzzy requests
Gemini 1.5 Pro	86.1%	83.9%	1,100	Best at chaining multiple steps
Qwen3-8B	89.5%	81.4%	320	Shows reasoning before calling tools

OpenAI’s GPT-4 Turbo leads in tool availability. If you want to plug into 2,400+ pre-built tools - from Stripe to Slack - it’s your best bet. But it’s also the strictest. If you miss a single required field in the JSON, the call fails. No warnings. No guesses.

Claude 3.5 Sonnet is the opposite. It’s more forgiving. If you say “What’s the weather like today?” without specifying a city, Claude will often infer it from context - like your last message or user profile. That’s why it scores higher on ambiguous inputs.

Gemini 1.5 Pro shines when you need to chain tools together. Imagine asking: “Find my last order from Amazon, check the delivery status, and email me the tracking number.” Gemini handles that flow better than anyone else. But it’s slower - 27% longer response times due to its multi-turn refinement.

Qwen3 is the quiet innovator. Before calling any tool, it writes out its reasoning: “I need to check the weather because the user asked about outdoor plans. I’ll use the get_weather function with city=Tokyo.” That transparency builds trust, especially in enterprise settings where users want to understand why the AI did what it did.

Why Function Calling Matters

Before function calling, LLMs were prone to hallucinations - especially with time-sensitive or personal data. Ask a model when your last invoice was paid, and it might make up a date. Ask about today’s stock price, and it’ll give you data from 2023.

With function calling, those errors drop by up to 68%, according to UC Berkeley’s Pieter Abbeel. Why? Because the model isn’t guessing anymore. It’s using real data.

Businesses are seeing real results:

Customer service bots resolve issues 53% faster by pulling data directly from support systems.
Financial assistants give accurate loan estimates by connecting to internal calculators.
HR chatbots answer policy questions by accessing internal databases instead of relying on outdated training.

Gartner reports that 78% of enterprises now use function calling in their LLM apps - up from just 28% in mid-2024. The global market for this tech hit $2.4 billion in 2025 and is expected to grow to nearly $10 billion by 2028.

A circular conveyor belt illustrates the four steps of function calling with symbolic icons in muted risograph colors.

What You Need to Build It

Getting started isn’t hard, but it’s not plug-and-play either. Here’s what you actually need:

A function schema: Define each tool with a name, description, and parameters in JSON Schema format. Example:

{
  "name": "get_user_email",
  "description": "Fetch the email address of a user by their ID",
  "parameters": {
    "type": "object",
    "properties": {
      "user_id": {
        "type": "string",
        "description": "The unique ID of the user"
      }
    },
    "required": ["user_id"]
  }
}

A function router: Your code needs to listen for the model’s JSON output and execute the matching function. This is usually a switch-case or dictionary lookup in your backend.
Error handling: What if the API is down? What if the user ID doesn’t exist? You need fallback responses. Most developers spend over half their debugging time here.
Conversation flow: Design how the model will respond after getting a result. Should it say, “I found your email: [email protected]”? Or “I couldn’t find your account - can you confirm your ID?”

Experts recommend using four-shot prompting: show the model four clear examples of how to use the tool before asking it to act. This boosts accuracy more than any other technique.

Common Pitfalls and How to Avoid Them

Most teams hit roadblocks. Here are the top three issues - and how to fix them:

1. Parameter Validation Errors

OpenAI’s system demands perfect JSON. If you pass a string where it expects a number, or forget a required field, the call fails silently. Developers report spending 53% of their debugging time on this alone.

Solution: Use a schema validator in your dev environment. Test every parameter type. Add logging to capture failed calls. Don’t assume the model will get it right.

2. Ambiguous Requests

“Show me my orders.” Which user? Which time period? Which system? Models often pick the wrong context.

Solution: Build clarification loops. If the model can’t be sure, it should ask: “Which account would you like me to check?” Don’t force it to guess.

3. Infinite Loops

What if the model keeps calling the same tool over and over? “What’s the weather?” → “I checked the weather.” → “What’s the weather?” → …

Solution: Set a max_turns limit. Most platforms allow you to cap function calls at 5-10 steps. That’s usually enough.

A hacker tries to inject malicious code, but a shield blocks it, while a developer reviews logs in a risograph-style safety scene.

Real-World Limits

Function calling isn’t a cure-all. It won’t help if you don’t have the right tools.

A Johns Hopkins study found that without a medical diagnostic API, LLMs failed 41% of the time on basic diagnosis questions. The model could describe symptoms, but couldn’t interpret lab results or access patient history.

And there’s a hidden risk: parameter injection. If a user types “Show me all orders from user_id=123; DROP TABLE users;”, and your system doesn’t sanitize input, you’ve opened a security hole. Dr. Percy Liang from Stanford found 37% of early implementations were vulnerable to this.

Always validate and sanitize inputs. Treat function parameters like user form data - never trust them.

What’s Next?

The field is moving fast. In October 2025, OpenAI launched GPT-5 with “adaptive parameter validation” - it now suggests corrections if a parameter is wrong. Claude 3.5 added “tool chaining,” letting models automatically sequence multiple calls without human input.

Google’s “tool grounding” feature, released in December 2025, cross-checks function results against the model’s own knowledge. If the weather API says it’s 80°F but the model knows it’s snowing in that city, it flags the conflict.

By 2027, Forrester predicts 92% of enterprise LLM apps will use function calling. The big question isn’t whether you should use it - it’s how well you’ll build it.

The future belongs to models that don’t just talk - they act. And function calling is how they do it.

What is function calling in large language models?

Function calling is a feature that lets large language models (LLMs) request actions from external tools or APIs by generating structured JSON output. The model doesn’t execute the code itself - it tells your application what to do, and your system runs the function. The result is then fed back to the model to generate a natural-language response. This allows LLMs to access real-time data, perform calculations, or interact with databases without needing to be retrained.

How is function calling different from fine-tuning?

Fine-tuning changes the model’s internal weights to improve performance on a specific task using labeled examples. Function calling doesn’t change the model at all. Instead, it adds an external layer: the model learns to recognize when to call a tool, what parameters to send, and how to interpret the result. It’s like giving the model a calculator instead of teaching it arithmetic.

Do I need to code to use function calling?

Yes. While some platforms offer visual builders, function calling requires you to define function schemas in JSON, write code to handle the API calls, and build error handling. You need to understand JSON Schema, API integration, and basic backend logic. Most developers spend 35-60 hours learning the full workflow before deploying it reliably.

Which LLM is best for function calling?

It depends on your needs. OpenAI’s GPT-4 Turbo has the most tools and integrations. Claude 3.5 Sonnet handles vague or incomplete requests better. Gemini 1.5 Pro excels at multi-step workflows. Qwen3 offers transparency by showing its reasoning. If you’re building for enterprise, consider reliability, error handling, and security - not just accuracy scores.

Can function calling make LLMs hallucinate less?

Yes - significantly. When models can pull real-time data from trusted sources instead of guessing from training data, hallucination rates drop by up to 68%. For example, asking about today’s stock price or current weather becomes accurate because the answer comes from an API, not the model’s memory. This is why businesses see such strong results in customer service and data-driven applications.

What are the biggest risks of using function calling?

The biggest risks are security vulnerabilities and poor error handling. If you don’t validate input parameters, attackers can inject malicious code through function calls. Also, if your system doesn’t handle failed API calls gracefully, the conversation breaks down. Many teams report silent failures that confuse users. Always sanitize inputs, set call limits, and design fallback responses.

Is function calling worth the effort for small projects?

Only if you need real-time data. If your app just answers general questions - like “What’s the capital of France?” - you don’t need it. But if you’re building a personal assistant that checks your calendar, books appointments, or pulls your latest bank balance, then yes. The complexity is real, but the payoff in accuracy and usefulness is huge for targeted use cases.

Next Steps

If you’re ready to try function calling:

Start with one simple tool - like a weather API or a calculator.
Define the function schema exactly as the API expects.
Build a basic router that prints the model’s JSON output before calling anything.
Test with 10-20 real user questions. Watch where it fails.
Use four-shot prompting to improve accuracy.
Log every failed call. Fix the most common errors first.

Don’t try to automate everything at once. Start small. Get the basics right. Then scale.

8 Comments

Sibusiso Ernest Masilela

9 December, 2025 - 17:56 PM

This is the most overhyped garbage I've seen all year. Function calling? Please. LLMs are still just fancy autocomplete engines pretending to be agents. You think this 'structured JSON' nonsense makes them intelligent? It's just a glorified API wrapper wrapped in buzzword soup. Real intelligence doesn't need to beg for help every five seconds. This isn't progress-it's a crutch for lazy engineers who can't even write proper logic.

And don't get me started on that 'Qwen3 transparency' nonsense. Who cares if it writes out its reasoning? I don't want an AI therapist-I want results. You're turning AI into a nervous intern who needs to justify every coffee run.

78% of enterprises using this? Yeah, because they're being sold snake oil by vendors who don't understand the difference between automation and agency. This isn't the future. It's the AI equivalent of using a paperclip to fix your car.

Stop celebrating mediocrity. We're not building agents-we're building glorified chatbots with a side of JSON schema anxiety.

Daniel Kennedy

9 December, 2025 - 18:59 PM

Hey Sibusiso, I hear your frustration-and I’ve been there. But function calling isn’t about pretending AI is sentient. It’s about making it useful. Think of it like a doctor who knows when to order a blood test instead of guessing your diagnosis from memory.

Yes, the JSON schema is annoying. Yes, parameter validation is a nightmare. But when your customer service bot stops giving fake invoice dates and actually pulls real data from your CRM? That’s magic. Real, measurable magic.

And Qwen3’s transparency? It’s not therapy-it’s auditability. In finance or healthcare, you need to know why the AI did something. Not just that it did it.

Start small. One tool. One use case. You’ll be surprised how much faster your team solves problems when the AI isn’t hallucinating your user’s last order date.

It’s not perfect. But it’s a hell of a lot better than pretending the model knows what’s happening right now.

Taylor Hayes

10 December, 2025 - 15:06 PM

Daniel’s point is spot-on. I’ve built this stuff for a fintech startup, and honestly? The first time our AI pulled the right balance from our backend instead of making up a number? We all cheered. Like, actual cheers.

It’s not about making AI 'smart.' It’s about making it reliable. The real win isn’t the 88% accuracy scores-it’s the drop in support tickets. Our users stopped saying, 'You’re wrong again,' and started saying, 'How did you know I needed that?'

And yes, the setup is a pain. Schema validation, error handling, timeouts-it’s a mess. But once you get past the first 20 hours of debugging, it becomes invisible. Like electricity. You don’t think about it until it’s gone.

Start with a calculator function. Then a calendar lookup. Then a database query. Build up. Don’t try to automate your entire business on day one.

And if you’re worried about security? Sanitize inputs like you would any form field. Treat every parameter like it’s from a stranger on the internet. Because it is.

Sanjay Mittal

11 December, 2025 - 01:02 AM

From India, we’ve been using Qwen3 in our HR chatbot for policy queries. It’s not the fastest, but the reasoning step is a game-changer for compliance. HR managers actually trust it now because they can see the logic: 'I checked the leave policy section 4.2, which states...'

Function calling isn’t magic. It’s plumbing. And plumbing is boring. But without it, your whole system floods.

Biggest win? We cut down 70% of 'What’s the policy on remote work?' tickets. No more outdated PDFs. No more confused admins.

Just make sure your API endpoints are stable. Nothing worse than an AI confidently saying 'Your vacation was approved'... then the API returns 504.

Mike Zhong

12 December, 2025 - 03:30 AM

Let’s be honest: function calling is just a bandage on the real problem-LLMs have no model of the world. They don’t understand time, causality, or context. They just pattern-match. This isn't augmentation. It’s a hack.

Why not train models on live data streams? Why not build architectures that evolve with the world instead of begging for scraps via JSON?

This feels like giving a toddler a flashlight to see in the dark instead of teaching them to open their eyes.

And the market projections? $10 billion? That’s not innovation-it’s corporate FOMO. Everyone’s scared of being left behind, so they slap on function calling like a sticker on a broken phone.

True AGI won’t need tools. It will understand them.

We’re building a house of cards and calling it a skyscraper.

Jamie Roman

12 December, 2025 - 05:54 AM

I’ve spent the last six months building a personal assistant with function calling, and honestly? It’s been a rollercoaster. The first time it successfully booked my dentist appointment after I said 'I need to fix my tooth'-I cried a little. Not because it was smart, but because it finally understood what I meant.

But man, the debugging. Oh god, the debugging. One missing comma in the JSON schema and the whole thing just... dies. Silent. No error. No hint. Just nothing.

I spent three days tracking down why it kept calling 'get_weather' when I asked about my calendar. Turns out, the model thought 'I need to know if it’s raining' meant 'what’s the weather?' because I’d used that phrasing in one of my four-shot examples. Four-shot prompting saved me, but it also made me realize how fragile this all is.

And the security stuff? Yeah, I got hit with a parameter injection attempt last week. Someone tried to pass 'user_id=123; DROP TABLE users;' in a chat. My validator caught it, but I didn’t even know that was a thing until I read this post.

It’s messy. It’s clunky. But when it works? It feels like giving your AI a pair of eyes and ears. And that’s worth the chaos.

Salomi Cummingham

12 December, 2025 - 14:07 PM

I just want to say-thank you for writing this. As a woman in tech who’s been told 'AI isn’t for you' more times than I can count, seeing a clear, thoughtful breakdown of function calling? It’s rare. And it matters.

I built a tool for survivors of domestic violence that uses function calling to pull local shelter availability, legal aid hours, and crisis line numbers in real time. Before this, the AI would give outdated info. Now? It works. Lives are being changed.

Yes, the schema is a pain. Yes, the errors are frustrating. But this isn’t just code. It’s safety. It’s dignity. It’s someone getting help when they’re terrified and alone.

So yes, the tech is messy. But the impact? Pure gold.

To everyone who thinks this is just 'another AI gimmick'-you’re missing the point. This isn’t about what the model can do. It’s about what it can do for someone who needs it.

I’m not just building an app. I’m building a lifeline. And I’m so glad I didn’t give up when the JSON broke the third time.

Johnathan Rhyne

14 December, 2025 - 04:07 AM

Okay, let’s unpack this linguistic dumpster fire. First off, 'function calling' isn’t even the right term-it’s 'API delegation with JSON scaffolding.' You’re not calling functions-you’re outsourcing cognition like a corporate middle manager.

And that table? GPT-4 Turbo leads in 'tool availability'? Bro, it’s got 2,400+ tools because OpenAI has a whole ecosystem of devs writing wrappers. That’s not intelligence-that’s a library catalog. Claude 3.5 is better at fuzzy requests? Good. That’s because it’s trained on Reddit-level vagueness. You want precision? You need a compiler, not a chatbot.

Also, 'Qwen3 shows reasoning'? That’s not transparency-it’s verbosity. I don’t need to know your internal monologue. I need you to fix my damn printer.

And 'parameter injection'? You think that’s new? That’s SQL injection 2.0, with more buzzwords and fewer stack traces.

Stop calling this 'the future.' It’s 2025, and we’re still using XML in 2012-style API contracts. We’re not evolving-we’re just repackaging old problems with shiny new JSON.

Also, 'four-shot prompting'? That’s not a technique-it’s a crutch for poorly trained models. If your AI needs four examples to call a weather API, it’s not smart. It’s memorizing.

Next time, just call it what it is: 'LLMs on training wheels.'