Name: Build Hour: Agentic Tool Calling
Uploaded: 2025-05-22T18:32:51.147Z
Duration: 55 min 16 s
Description: Build Hour: Agentic Tool Calling

Transcript for "Build Hour: Agentic Tool Calling": Hey, everyone. Welcome back for another build hour. This is actually our first one of 2025, and we're really excited to be with you here today. My name is Sarah Urbanis, and I lead start up marketing here at OpenAI. And I am joined by Ilan. Yeah. I'm Ilan. I'm on the developer experience team. So we always like to start build hour with kind of the goal of why we're here. And it's really to empower you with the best practices, tools, and AI expertise scale your company using our APIs and models. Now this series is really for all of you. So we take your feedback into what you want to hear more of, what you're building with, and, hopefully, this is a really valuable hour of your week that after this, you can accelerate what you're building and what you're creating with OpenAI. We have a new, page here, webinar.openai.com/buildhours I wanted to plug. We heard your feedback that you wanted kind of a centralized place for all upcoming build hours. So now you can actually see all of the topics that we have upcoming. So we've been busy in 2025. As the kids say, we've been cooking. We had a lot that we've shipped this year and just kind of wanted to show a quick high level of what we shipped, what it does, and maybe what it means for you. We've had the responses API, four one zero three zero four mini, codec CLI, codecs. You can take a screenshot of this. We, unfortunately, don't have time to go through all of it in this hour. We might be here for a while. But we're going to try to show you as many new features and models as possible today. So, obviously, there's one big one that is missing from this list that we launched, which is ImageGen. And we know that that got a lot of attention from you all. Next week, we're actually doing a build hour on ImageGen where we'll talk about how you can leverage the API for whatever you're building. And, yes, we might make a Studio Ghibli image or two during that build hour. But today's is really going to focus on the flagship models, increased context, and also codex and how you can start building with it today. So for our agenda, we're first going to start out with some core concepts. We will talk about what's new, maybe set up a task with codex, and then go into some core concepts and talk about reasoning, agentic tool calling, and tasks. As always with build hour, we really want this to be hands on and get you into our code base. So we're actually going to anchor most of today's session on demos. And I think that we shared the code repo in the chat so you can follow along with what Ilan is building and build it yourselves after this, hopefully. So we'll first do a demo on how to implement tasks and live build a task system, interface, and back end, then talk a little bit about delegation, just share some directional guidance for evals. There's a lot we could dig into here, but, hopefully, we'll give you a few tips to get started. And then finally, we will end with q and a as always. This is a new platform that we're using today, so just wanted to share. If you click q and a, there should be a text field that pops up, ask questions. We have a couple folks in the room here who are going to be answering, and then Ilan will answer live at the end. Yeah. Absolutely. Alright. You ready to do some coding? Let's do it. Alright. Over to you. Cool. Hey, everyone. Yeah. So we we have a lot of stuff that's new. And to start off, we we can just go over a few of the agents that we launched this year. Right? So we launched earlier this year, we launched Deep Research o three codex. You know, 2025 is the year of agents, and we're really seeing it happen. And to start off, let's just do a little quick demo of codex. I know some of you might be interested in using this soon. So, just to kick things off. So here, I actually have, the repo for the codex CLI, which is not that them all be showing. I will instead be showing, how to use codex on ChatGPT. However, the reason I have this open is just to grab, like, a sample task and show you how you could use codex. So here, we have this issue that someone pointed out that codecs does not respect the, respect the API keys, through environment variables for other vendors. Right? So, you know, this isn't great. So what I'm gonna do is just, like, highlight the whole issue, and just drop it into codex. Right? So I'll say, like, you know, can you fix this? And then, you know, I'll give it a bit of context. I'll give it the title. And as you can see, I have codex enabled. And so what it'll do is it will, it has, a local copy of this repo that it can, like, run tests and actually write code in. And so I'm just gonna kick this off, and off it goes. And in the background, it's just going to go through, check all of the different parts of this issue, and then do the changes it can. And so if you take a look at what's happening in this log, you know, it's downloading the repo. And we're not gonna watch the whole thing happen. Right? But this is kind of the interface that you'll expect with codex where you you have this, like, long term task where you just wanna give it the end state. You just say, this is what I want you to do, Figure out a way, and off it goes. And it's using the environment, and we'll get back to it at the very end. Right? But this is just a sneak peek of of codex. You can already use it, right now. Now the reason this is relevant to our build hours today is because, you will be implementing your own agents soon. Right? And and part of the reason of bringing these agents up is to give you a background for, like, what yours can look like. Because the technology that we use internally for o one zero three codecs, everything, is actually mostly out in the open for you to build as well. So this is how you can build your agents. The responses API is an incredibly powerful API, especially as of yesterday where we launched hosted tools, MCP, so much stuff. Unfortunately, we don't have a lot of time to get into everything that we launched. But, essentially, with a single API call, you can now set off this, like, entire sequence of events where the agent can, query files, call MCP servers, etcetera. The agents SDK is a really handy way to implement, this same loop, but where you can do local function calling. And it has a couple other features like handoffs, and, of course, hosted to tools in MCP. These these are some tools that you can use to build. So today, we're gonna talk about agentic tool calling, right, and how all the things that we talked about really come into this one idea. So, what is agentic tool calling? Really, it comes down to reasoning and tools. Right? So you can think of something like deep research, codex, or o three, as just going through reasoning with tools. But, you know, why is this interesting? Right? And the the big part of this is really reasoning. Right? So, last year, we trained o one, where we really taught models to reason for the first time. And what this meant is instead of showing them, like, here's how you do a task step by step and hoping that it learns from our examples, we instead let it figure out how to get to solutions as it reasoned, and we would just grade on whether it was correct or incorrect. Right? So through reinforcement learning, o one learns to hone its own chain of thought to refine strategies it uses. And so this is the first component of agentic tool calling is is reasoning. Right? We we train the models on on solutions, not the steps that make them up. They figure out the steps, and reasoning emerges. Right? And then you can take, function calling or tool calling, where you take actions and fetch information. And when you combine them, that's when you get agentic tool calling. And so what you can see in this diagram is pretty much what happens is now within this this reasoning, this chain of thought that the model is doing where it, it figures out how to think about things, we now also give it access to tools. So it can figure out not just how to think about things, but how to do things as well. And so the paradigm is very similar. We didn't train it, on, like, the this chain of thought, like, on specific steps that we wanted to take. We just train it on results, and it learns to take the actions to actually achieve those results, which is the really powerful part of RL that we're now bringing into models, not just for thinking, but for doing as well. And what you see is this long term agency, that is really, really powerful. And so this is what we're calling agentic tool calling. And what results is a model that is goal oriented, is resourceful. Right? It'll figure out a way to get you what you asked for. It's very robust to recovery. So so if it gets failures during the tools, it can actually course correct and and get to the end. And it's really consistent over a long horizon tasks. Right? You can actually have, like, tens, hundreds of function calls in a row. The codex, I think, does in the order of this, and it stays consistent. And so this is the power of agentic tool calling. So with this agentic tool calling capability, we can actually start thinking about a new primitive or a new abstraction, which is tasks. Right? Everything used to be very chat based, and now we're entering this world of of long horizon tasks. So what goes into a long horizon task? We're like, what do you consider it? And this is where we start to go a little bit from theory more into the actual practical end of, like, what does it take to put together a task? So, of course, you have the agent. I I guess, to talk about these, you you have, the agent, which describes, what does the task do. You have infrastructure, which is how do you actually run it? How does it do it? You have your product, which is how does a user use it or interact with it? And then evaluation, you know, did it do what you wanted it to do? How well did it do it? So for agents, you're gonna be thinking about goal specification, which is kinda different from before. Instead of specifying step by step what you want to happen, you have to specify what the end state that you you want is. So you you also want to specify the tools, which will give it access to the different resources that you have. And this way, your agent will be able to interact with your own systems. This is also where you might wanna use delegation, where you're no longer talking to just one agent and waiting for it to be done. It might be able to kick off other tasks as well, and how you choose to interact with, like, async long running function calls and human loop. Now infrastructure is where you start to think about these, like, parallel tasks and parallel execution and how you manage the state between, like, one agent and your product and your back end as well as the runtime environment. So Codecs has a runtime environment for each of the different, like, repos that it has, and you might wanna set up one for you. And then this is also where you choose how to handle failures, retries. This is like the nitty gritty of, like, okay. We have agents. Like, we have these concepts. How do you actually run them, and what does it look like with code and with your, like, back end architecture? Then you have your product where you can choose how the user interacts and what they get to see. And so this is where you can surface progress. Like, how do you keep them informed of what the agents are doing? This is also where you work with the user essentially to provide the agent all the context it needs to actually accomplish the tasks. And this might be explicitly by asking the user or it might be, explicitly how we might do it in a second where it is just gathered from the context itself or, like, from from the application context itself. And this is also where you can choose, like, how you visualize tasks. Right? Like, codex has this, like, list of tasks below, and then when you open it, you can see, like, all the things that it's doing with these nice animations. And so this is where you you can actually have fun with it. Right? How do you keep the user entertained and and give them insight into what is going on? Because I I forget off the top of my head where, where this is from, but, like, I I think it's pretty intuitive. Like, a a user who just, like, doesn't know what is happening behind the scenes and is just waiting, will, a, get more, like, stressed, and and impatient, and, b, will actually trust your output less if they don't see what is actually going on. So, finally, you have evaluation. You know, this is where emails come in, where you wanna collect examples, define how you wanna grade. And so this is a little bit different from before where you really wanted to evaluate, like, turn by turn. In in a chat conversation, this is, like, the most common way that we used to do it. Now with tasks, you actually are more interested in the end result, and maybe less interested in what each of the turns the model took was. So you might wanna set up, like, these graders that are now often, like, LLM based graders where you give it a rubric and you describe what criteria you're looking for. I like to sit with people when we've worked with companies and just ask them, like, okay. Is this good? And then have them say yes or no, and then it's like, why? And then just ask them and go through and really break down a task and what makes something good. Maybe a little sneak peek here is if you have a few examples, you can actually fine tune a model to be your grader. And this is actually a great case for reinforcement fine tuning that we launched, where if you have a good golden set of examples, you can train on that. And then finally, tracing. To do all this, you wanna, it's not evaluation is is not just about running emails. It's also about, like, monitoring, evaluating it online during its interactions with the user. So these are parts that really go into, tasks and and and when you consider how to build them. So I think this is enough. Oh, and, Yeah. They they I I put things in different groups here, but everything is really connected. Right? Like, you know, async and human loop might actually have to do with, like, continuing long running tasks on the infrastructure, and, like, the product might be, include delegation. So, you know, this is, like, let's say the four corners, but there's no clear lines between them. So think with that said, it's time to start coding. So we're gonna take a stab at implementing tasks. And today's prompt is, let's say we have a lot of, like, a a lot of tickets, like a backlog of tickets. Right? Customer feedback, etcetera. Let's build an agentic task system that can actually take them on and resolve them. This is kinda broad, but it's kinda to show how broadly these agents can work in practice. Practice. And so, this is kind of let's ignore the issue. This is roughly what we're aiming for. Right? We let's say we have this customer service portal or, like, this feedback portal with a few different tasks. What we wanna implement is this system that can actually take these tasks and operate on them. And this is gonna require, like, defining your agent, defining some infrastructure to run it, and also defining, like, how you do the interactions, or what interactions the user has and and how they see this. So let's get started with the agent, in a in a pretty simple way. Right? So this is actually a very powerful API call already. When you specify a response, you can actually give it hosted tools and have that run-in the background. I added this because yesterday, we launched, background mode, which actually takes a lot of this and, like, makes it a little bit easier if you're not using local function calls and all you're doing is MCP and hosted tools. I'm gonna skip skip this for now because there's a lot to get through. So for the first thing is we're gonna use the assistance I'm sorry, the agent's SDK. And what this SDK does is it wraps this, like, loop that we actually implemented it in, the my first build hours, which was implementing an agent. It's very similar, and it's a it's a based on on the Swarm if you're if you're familiar with that. So here we have the simplest agent, and let's just run it real quick to see what the interaction is. So hi. You know, can you say hi to everyone watching today? And as you can see, we can stream the results, and we can see the reasoning. Okay. O three is pretty friendly. So it's funny because getting to this point last time actually took a lot of work. But now because we have the agent's SDK, it really is just throwing together, a quick agent. Now let's start adding some tools. Right. So if we go back to this, we might, wanna take a look at one of the tasks, and it says, you know, customer reports being charged twice for their monthly subscription. So let's start building some tools for the agent to use here. Now we're gonna have to think about tools and also the prompt and how we specify the goal. So I have some code ready, but I like, doing this live with you all. So let's just start typing up some functions. So, you know, cursor is very helpful here. It, it sorta knows already what I wanna do. Here, we're not gonna search the web. Instead, you know, let's say we wanna give it access to, you know, you know, I guess, get, you know, get user data. Right? And we can take a a user name, and then we can just return some mock data. And then let's, you know, ask it to be, like, include recent order history. Cool. So now, when it queries for a user. And similar to the Swarm, just by specifying a function, with Python, it'll actually take this and turn it into the right schema and go to the model and execute it. So super nice way to do this. Cool. So we have this first function. Let's try this out really fast and just say, you know, I'm maybe let's give it some instructions. Say, great. Thanks, Cursor. We can say, no. Ilan, I'm upset because, actually, this isn't quite a goal. Let's pick a goal. Right? So we have the user data, and then maybe let's add another function for, like, refund. Refund. Cool. Yeah. Let's mock it out. And let's find let's make one for order. Get out your details. Okay. Chris is actually amazing. So what we set up here is we wanna have this ability to, like, get recent orders. Now we can then take those orders and return some details on them, including the price. And so for this flow, if I want the model to be able to perform, a refund, I wanna be able to specify, like, the functions, my problem, and have it figured all out. So this is all live. So anything can happen, but let's see. I can say, like, I really didn't like the last thing I got. I want a refund. So model's reasoning. Can you just check my recent orders? So now it uses the first function to get the orders. It shows them to me. So what we're seeing here is pretty it's still pretty interactive. This is, kinda the normal approach. But let's say let's add an instruction. This is why specifying the end state is more important than specifying the steps. Let's say, you know, get all the context you need upfront, then execute task to completion without asking for more. And let's add one more and just say, like, you know, if you have everything you need, just go. You know, this is not entirely what you might wanna do, but this is just to show how I can sell, like, you know, I'm Ilan, and I want the refund on my last order. So let's see. Okay. It's getting the user data. It's getting the order details, it's updating the refund, and that's the end. And so this is kinda to show by specifying the end state, you can actually have the model figure out the steps required to get there based just on the tools. Cool. So this was a simple start. Now let's move on to product and actually integrating it. So, if we have this front end, we are actually wanna connect to, to a back end. Right? And in order to keep using the agents SDK, which right now is in Python, JavaScript one coming soon, we'll wanna set up, like, a very simple, like, Flask setup. Right? So I'll do this one by hand just to show you guys how to do it. Let's do server. I guess I have one right here. Right. So I'll start off with, you know, from Flask. Okay. Very, very basic Flask server. Now let's say we have, like, a, you know, task endpoint. Here's where we're gonna wanna run our agent to actually perform the task. So this pattern and and then stream back the results. So this pattern is I don't really know what a good name for it. Maybe, like, a foreground task, but, essentially, it means that when a when a, frown connects to it, it stays that, like, the connection is what keeps the task alive. And if we go back to our slides, this is what what this would be, where each new connection, actually starts its own task, and, like, the the connections are managed from the front end. And so this is useful if you want, a use like, first of all, it's very simple to implement. And second of all, if you if you implement it this way, then when you connect, you're essentially gonna get be getting the stream of the events for that task. So, actually, let's let's start with, the simple server here. I'll walk through it. I think it's it's better than watching me type everything. This is slightly more code than I usually do for these live build hours. So let's take it ease. Let's take it slow. First, we define our endpoints, which is gonna be an SSE endpoint. We take we we grab from the body the input items and the previous response ID, and we use the assistance API runner Run streams. Right? And what that does is it gives us a stream of all the events that the the agent's doing. We're passing in this agent that we can actually import from the one that, we define. So if we defined it where was I? I think it was in one agent. So I'll import it from there. Yes. Fifty fifty. Oh, it's not happy with that import name. That's fine. I have another one implemented. We'll be using that one. I'll show you in a second. But, essentially, we have this runner. We can run it. And then for each of the events, we actually get different kinds of events from the agents SDK. We wanna get the raw, response events, which represents the actual events coming from the responses API because the agents SDK is by default backed by the responses API. And what we do is in this event stream that we're defining, run it, and then we just yield them back, and we encode them with SSE. This is mostly just piping, but I think it's important to show you to show, like, what kind of thing might go into this. And then at the end, we'll yield, like, a done, and then we return a stream. Now already with this, we have implemented something that looks like this where from a front end, we can make a connection to a back end, run the event, and then wait for everything to come back. So this is a good place to start if you are just prototyping or making something. But in reality, you might wanna handle, like, be able to handle a task where it is it is created in the background and then disconnected. And you might see this in ChatGPT where you can type something and close it, like, completely close it, and then come back later, and it'll be finished. Right? So this is implemented with a different approach with a background task queue. So let's take a a second to just walk through this architecture because this is what we're gonna be using in practice for the front end that we just created. So we have our front end, which needs to be kept in, in check with the back end that has the tasks. What we can do is open an events connection to the back end that just receives all of the new task events, from the back end. What are these events? This is like like adding new items to the tasks, updating to dos, pretty much anything that modifies the state of these tasks. Now how do we actually kick this off? We can use, a task endpoint. So here, I've I've labeled in yellow what is like an SSE endpoint that is meant to stay open, and I've labeled in gray anything that can, is just a post request. And this has the shape of a background task, right, where you just make a post request, which starts the stream, and that's it, and you disconnect. And you could actually start this task from anywhere. So this architecture actually can scale nicely. What you can implement this way is once you have your back end, you can send tasks to a task queue, which is actually responsible for running your agents. And then you can take the events, And any events that come out of that, just associates, associate them with which task it's related to, and you stream them back. So, this is the the main architecture. Let me walk you through what the implement implementation can actually look like. Okeydoke. We have a lot going on, so I'll go slowly. This is the task object that we are defining. We just want it to have an ID. This items list represents the items that actually make up the conversation history or, like, the action history of an agent. We will get into to dos for a second. Let me delete this for now. And we have this status. Now, the global variables that we have here that you might actually wanna persist somewhere, but we are not for this demo, are the actual tasks mapped from their ID, this asyncio event queue, which just lets us run things what we're considering in the background, and then our fast API. Now let's go all the way to the bottom, and I'll show you the two routes that we have. So here's the two routes that I showed you in the diagram earlier. The first one is the events route. And so what this does is you can see it's it's a pretty simple one. It's maybe a little bit hard to follow. But, essentially, what it does is forever right? As soon as you start a connection, it'll just forever stream these events. Wait for this events queue to receive something and then forward it to the front end. And that's all it's doing. Right? It's just taking from this event queue and sending it to the front end. And you'll see why this is important. Because when we have this events queue, if we have this endpoint that a friend can connect to, it'll just guarantee that anything we push to this events queue gets routed to the front end. And what we can do is push updates from any of the agents up to this event queue. So that's the events the events endpoint. Now let's take a let's take a look at the tasks endpoint. Just like before, we take the body. We parse it. We grab out the items. We grab out the previous response ID, which, if you're not familiar, is a very convenient way to specify what the, previous requests were so that you don't have to pass the context each time. And this becomes really important when you wanna do this chain of thought tool calling because you wanna make sure that when the model is calling a function, it actually has in its history the rest of the chain of thought. So it really is function calls within a chain of thought. Now we can create this task object, save it to our tasks. And then here is where the important part is. We can publish, and I'll show you this function in a second, that we have created a task and then what the task ID is. And so if we scroll up to publish, all we're doing is taking the events queue that we were talking about before and giving it the event itself, encoding it as SSE so that the front end can take it and decode it. Cool. So, finally, once we create the task, then the last things left to do is to actually kick off our worker, kick off our agent, our task. And so this task is actually, bit overloaded. This is just a function that is part of the asyncio library. It considers a task, like a background task, which actually fits very nicely with our analogy. So we can take the asyncio worker, and then give it a worker function that we'll get into in a second, and kick it off in the background. Now and then return the task ID for the front end to be able to have it. But, so this is kinda where the juicy bit happens. Right? What is happening in the worker? Well, it's very similar to what we had before. All we're doing is we're gonna take our runner, give it the agent, which includes the prompt and the tools that are defined, give it the input items, which is the input from the user, the previous response ID so it can stay consistent with the previous conversation, give it the task in a context variable. And so this is important later, for it to be able to modify its own task. And and this is a nice pattern in the agents SDK where you can supply objects or context to an agent runner so that when an agent is running, they're not in the LLM's context, but they are accessible by function calls. So the model doesn't see it by default, but it can use it like a like a memory bank, where it doesn't see a memory, but it can it can make changes to it. It's just a convenient way. It's a it's a way to represent, like, a closure or just any kind of state that is tracked along with this run. And then when we have before, which is for all the events, we filter out and for all of them that are response events, we publish them down to the front end so it can actually render them. And we also set max turns to a hundred. You know, you can set this to any value you want. Great. And so this will run until the agent is finished, and then we'll just set task to done and push down update. And you'll see this pattern a lot where, we like, I make an update to the task object, and then I push that same update down to the front end. And that's just to keep them synchronized so the tasks that we represent in the back end are the same as the ones we represent we represent in the front end. So this was a lot. Right? But this is actually everything that we have. We have the tasks endpoint, which actually kicks off a task and returns the task ID. We have the events endpoint, which connects the front end to the back end and streams all of the events from all the tasks. And we have the actual worker, which takes the runner, runs it in a streaming mode, and then all the events just publishes so the front end can have. Now these are the components. I'm gonna go back to the diagram so we can see what this looks like. This is this is what we have made. Right? We we have made a system where you can start tasks, give it to the task queue. They'll run the agent, and then we'll forward the events back down. So let's take a look. I'm just gonna refresh everything just so we have the best shot. Here on the left, I'm running the front end. So I'm gonna run it again. And here we have the back end. Let me just make sure I'm not lying to you and I run the actual one. Server. And right now, I'm importing an agent, but since we already defined one that was, you know, so nice, let's just use this one. So I can really just grab this and pull it in. No. This isn't the great best practice, but it's because I I'll change the name to, you know, my agent. And now from here, I can my Amazing. So, you know, this is all live. It might work. It might not. But, essentially, what we did here is the agent that we defined earlier, we're gonna run and see it in the front end. So server queue. Did I also bring in the ah, Let's kill this. Let's rerun that. Go back to our front end. Rerun this. Okay. Now if everything happened correctly, and and I wired up the front end separately. I'm not gonna be going through the front end because that code is kinda a little bit all over the place, and I don't wanna be switching languages. But, essentially, when I hit start investigate right now, when I connected to the website, we can actually see in the servers that it sent a get events, call from the website to the back end so it starts receiving events. And right now, we haven't streamed anything. But now when I hit start investigation, it'll supply the context, start a task, and you can see that the events are actually streaming from the back end. So we're gonna get user data because we supplied everything together, and then it's probably gonna tell me it can't do something because we didn't give it the right tools. Yeah. It gave me it gave me some some response because, again, we didn't design this agent for this website. So let's actually design an agent for this website. And for this, I'm gonna go into this this agent that I've defined earlier. So if we just go through the prompt, you know, you're a helpful assistant. We're gonna ignore the to dos for now. Running in noninteractive mode, final output. Great. What we're saying here is just keep going until you're finished. And we're gonna delete anything about to dos. I'll get to that in a second. But now if we refresh everything so sorry for all the, the motion sickness going back and forth. Then we should actually be using this new agent. And so what can this agent do? You know, can get weather, search open tickets, read document, get runbook by category, search policies, get emails, add ticket, write document. We specify these functions up here, and I made this mock API that essentially just returns, like, mock data for all of these. But this is just an extension of what we were doing before. And in reality, obviously, instead of hitting a mock API, you'll hit a real API. Everything would work just the same. So let's give this a shot. If we start here, it starts a task. Ignore the progress for now. That is the next thing we're gonna be doing. But we should be able to see it reasoning. And since we're streaming all the events, we can actually run multiple tasks in parallel. And so here, we can see, you know, the reasoning happening, get user data. It's loading it. And for this one, it's gonna go through a different process. So is it gonna give up immediately? It probably will. Oh, did I not switch it out? Oops. My bad. Let's go here, and let's go back from using this to, server agents. Once again, I'm just gonna do this one. Great. Okay. So now it's actually going through the functions that we defined, going step by step. And we can see some progress. Right? But, it would be nice to be able to have more of an insight. So here's where we get back into the product sense. Right? How do we surface progress to the user? You know, change of thought events are are a nice way to do it, and there's other ways to render it. But there's this cool pattern where you can make to dos, or or make like, have the model essentially have a function to surface progress to the user. So how would we implement this? Well, what we can do is we can take the task object and add this to dos field. And once we add this to do's field, you know, for now, it just does nothing. But the magic happens when we give the model functions to update this to do field. So this is a new function just like the ones we were implementing before, but it's a little it's a little meta. It actually uses the contact the task in context we passed before, and the model can supply the actual test that we want to run. And so for each of the text in the to dos, we create a new to do, look at the, add it to the actual task, and then publish it back down. And then we also get a give it a function to just check off to dos. So what have we done? We've declared these two additional functions, and then we're gonna add them to our agents, to add to dos and set to dos. And this is a very tiny thing, but it actually feels very magical when it works. And so let's take a look at that. And maybe we wanna add the thing I removed here. Cool. Now everything about to dos is there, so always create a plan with to dos and always start by setting the to dos. And then check them off as you go. Cool. Fingers crossed. So let's try the same task again. And so now if we're lucky and the model does what we wanted to do, the suspense is killing me. Ta da. Now we have some to dos, and, we can see them in the front end. They they can actually drive our progress and the idle. And as the model goes on, we get this, like, kinda magical insight into how far along it it is, without having to build any, like, monitoring system or any other pieces. We can just have the model go through and and check them off. Yeah. And and just like before, we can, like, kick off multiple in parallel. And, very, I I actually am not the biggest fan of chat interfaces. I love just clicking. And so this way of just, like, grabbing all the context, shoving it in, so that the the model has everything it needs is is, like, in my, in my opinion, a very delightful user experience. Anyway, so now we can see that it's gone through. Now, you may be wondering why don't I see the function calls to check the to dos. And the answer is I am explicitly filtering them out in the front end. And this is how you keep some of the magic. Right? If you just don't show the user how it's doing it, then it's just gonna keep going, and it'll look more natural without, like, checking off the to dos. But, yeah, I want you to just maybe take a beat, take this in. Like, what does this mean for your product? If you have anywhere that would benefit from these, like, long horizon tasks or anywhere where you want goals, fulfilled or anything that requires multiple more open ended steps, this is something that is actually quite useful in these patterns of, running tasks in the background, showing progress, just really come in handy. And so if we go back, there's a whole other bit about delegation. I'm not gonna go into it too much. Maybe we can talk through it really fast. So delegation is a notion of, like, right now, I am kicking off the tasks by hand. You know, we went before. Like, we have this task. What if we want the model to to kick them off? Right? Like, what what if we wanna instead of clicking ourselves, which I actually love, don't get me wrong, but if we wanna, like, have ChatGPT or, like, your agent actually kick off tasks, but we can keep talking to it. We can implement this pattern where you front load the context gathering. You know, it asks you follow-up questions. If you used deep research, this is what it feels like. Before it starts off on a long task, it makes sure it has enough information. It starts off that task with a function call, which actually returns immediately, and you can keep talking to it. Meanwhile, the task is running in the background with an architecture similar to the one that we ran. And, you know, you can keep chatting with it. It's non blocking. This is optional, but this is an approach I quite like. And then once it's done, it can come back and actually, update you on it. I included, an example of of delegation, and I think we have enough time to go through it really fast. So why don't we do that? So this one is is once again text based. I didn't build out the UI for this, but, I said I wasn't gonna talk about it. Maybe I will. Where we need a system to run this in the background. Right? Like, if I just do a function call it, like, calls o three, for example, and, like, maybe on the front end, I'm talking to, like, GPT 4.1 Mini. If if I don't have this enabled, and for example, I I go here and I say And just to just to remind you what I have on screen, I have this agent with no tools in, set right now, that is just 4.1. Maybe maybe let's set the these two tools. Right? This is all I have. What are these tasks doing? This function just calls the responses API with our input, and then get tasks can retrieve can retrieve this. So right now, this is not super interesting because I can say, like, you know, I'm gonna ask him to do something hard, you know, like, write a poem, where every word starts with the next, I don't know, prime number, if it was indexing into the alphabet, something like that. If I ask this to form any, it's probably not gonna be able to do it. But I'll say, you know, start a task with this. And by starting a task with this, if we go back, to it's actually gonna call o three. So let's see if this works. Oh, it really just tried. Okay. I'm gonna say, you know, instructions. You know, if the user asks you something too hard, start a task with it. Cool. So let's try that again. I don't wanna type this out, so I'm gonna copy it. Oh, it actually maybe it did a good job. Wow. Four four four one mini is a good model. But for this example, we really just wanted to call the function. So it starts the task. And now what is this task doing? We are waiting for a response from o three. And o three is gonna do this very, very carefully. Oh, it actually did some of the work for it. Wow. Four four one is a really good model, but, we're blocking. We're just sitting here. We can't keep talking to it until o one. O three is done. However, if if I enable background task, then what I can do is a very similar thing, but now it should return immediately. And so I'll say this. It should start the task, and then we can why is it not happy with me? 500. Okay. We launched yesterday. We're gonna ignore this for now. But I think what I wanna show you here is if you, have a background system that you're either depending on OpenAI for or you implement it yourself like we just did for the agents SDK, then you can, use function calls to essentially hand off a not not hand off, like, delegate a task, that will happen in the background and then just check up on it or push updates to your main agent. And so what the experience of the user can be is, non blocking. You can just keep talking with it, and service them. Sadly, I couldn't show you right now. Maybe let's see. Maybe I do have a simple server. Should we should we try this? It's cool. It's cool. Oh, no. No. This is, sorry. This is this blocks. This blocks. We could do with the other one, but I don't really wanna fumble around too much, live. Done that a bit too much. Yeah. We would have to implement task. It's fine. We're we'll skip it. But, hopefully, you can believe me. Right? We have this back end system, and so what you can do is have a function to start, which returns immediately, and you'll you'll be off and running. So I wanna end this here, so we can get straight to questions. There's a couple of things we wanted to talk about we didn't get to. We'll be sharing more resources after. But, yeah, shall we jump into questions? Let's do it. If you bring back over the deck, I actually dropped them in for you. Can you update it? Refresh? Yeah. So what are the most efficient ways to orchestrate sequential and conditional tool calling? So I get this question sometimes, and and it always it's an interesting question because the whole point of agent to Kong is you are not you don't have an expectation for, like, sequential or conditional. Like, you want the model to do whatever. Like, you just wanted to figure out how to get to a solution. So, that being said, if you do wanna have something kinda conditional or sequential, or sequential, just use Python. Right? Like, code is an amazing way to express sequential and conditional things. If you want o three to, like, always call three functions in a row, instead of, like, hoping it calls those three functions, which is also higher latency and more expensive for you, just put them all in one and then have that function do the the three things that you wanted. So, that's kind of my answer there is, like, Python is very powerful. Use code when you want, like, the this sort of, task. You can also, like, enumerate it in in the prompt. O three is really good at, like, following those instructions. How should memory be managed in agents handling long horizon tasks? Another good question. There's many, many different ways. Right? Its own context is like a pretty decent memory bank. You can also have explicit memory systems. So in a very similar way that we were, like, creating to dos and checking off to dos, in an external state that the model can't actually see. It's not in context. You might wanna implement, like, a way to give it like, a way for the model to to remember things, to save facts, and then recall them later, if it's necessary either with a vector store or something similar. We actually did something very similar in the original version of ChatGPT's memory where it would be able to, like, explicitly choose to save a fact based on what you said, and then at a later time, like, bring up similar facts to a prompt. And the way you can do that is just doing these, like, vector comparisons, embedding comparisons. I won't get too too much into it, but there's many, many ways, and external memory store is useful. How many tools is too many tools? You can you can take it up a lot. I try to stay under 20 ish. That's, like, very, very, very heuristic. But at that point, it's not so much about can it even handle it. It's like, what are you really describing? And so the agents SDK, and I didn't get into this here, has this notion of handoffs. And if you check out the agents and assistants build hours, we do get into it more. But, essentially, it lets you specify multiple different agents that can pass off or hand off a conversation between each other where the one that has the most appropriate tools will get it, and will get it in a way that doesn't have to be rerouted with each time. So 20 ish, maybe. Can you use OpenAI hosted functions together with your own custom functions? Yeah. Of course. This is actually a really great pattern where, I didn't implement this here. But if you wanna, for example, use, like, code interpreter to analyze some results from, database queries, you can absolutely do that. You can have functions that retrieve certain information or, like, load in, certain, like, numbers, for example, or or some table, and then the model will choose to do that and then use code interpreter to run, and then give you results. So, yes, it's actually highly recommended to use these, together. Does the responses API in Italy support MCP? As of yesterday, yes. So as of yesterday, any remote MCP server, you can actually add to the responses API, and it'll make those remote calls. And so this is where the actually, the the magic of the background mode comes in. It's like, usually, the only reason the responses API has to come back to you, before, like, the agent is done is when it it wants you to run your own local functions. But if you have no local functions, if you've implemented everything with a remote MCP server, then you can actually just run it and set background to true and just kinda, like, forget it and check-in later. And that that can be, like, essentially one very long responses API call that can do all these different things. MCP, file search, image generation, etcetera. Okeydokey. Just realized this guy. This is this is, this is cute. Cool. I had to add that in. This is really sweet, and everybody in the room is answering questions. This made them smile. So I'm glad that this has been helpful and really appreciate you taking the time to teach us how to agent. I love that. This is cute. This is cute. I like it. Are there any more? Was that the last one? No? That was the last one. And we are running perfectly on time. So if you go to the next slide there on resources. So thank you, Ilan, for blessing us with this wonderful hour of building and doing lots of things live that we haven't built with before. No. We're blessed by everyone building all of this. This is all possible because of everyone at OpenAI. So this is yeah. Yeah. This is true. What a happy note to end on. We're going to follow-up with some resources. So we're going to share the GitHub, which has all of the repos from prior build hours. You can see upcoming build hours on the landing page as well as recorded build hours. If you wanna spend more time with Ilan, you can watch his previous build hours where he talks about assistance and agents. And then we'll also send out a link for a practical guide for building agents. There were a few questions that were coming in for the chat that I actually think this would be a really good starting point for. Our next build hour is going to be next week, and it's all going to be about ImageGen and the API. We have lots of really exciting demos to go through, a customer story that we will share, and we're looking forward to seeing you in more build hours. So thanks for tuning in, and happy building.