Video: Build Hour: AgentKit | Duration: 2788s | Summary: Build Hour: AgentKit | Chapters: Welcome to OpenAI (2.8799999s), Agent Classification System (256.72s), Prompt Generation Tools (449.09s), Information Gathering Agent (780.29s), Agent Workflow Overview (1132.235s), Deploying Workflow Interfaces (1224.76s), Customizing Chat Interface (1301.245s), Evaluating Agent Performance (1414.355s), Evaluating Agent Performance (1544.45s), Evaluating Multi-Agent Systems (1859.84s), Eval Best Practices (2009.1001s), Real-World Agent Examples (2113.95s), Q&A: AgentKit Features (2390.75s), MCP Server Implementation (2518.195s), Branching Agent Logic (2591.6s), Multimodal Agent Capabilities (2672.7002s), Conclusion and Resources (2707.525s)
Transcript for "Build Hour: AgentKit": Alright. Hi, everyone. Welcome to OpenAI build hours. I'm Tasha, product marketing manager on the platform team. Really excited to introduce our speakers for today. So myself kicking things off, Samarth from our applied AI team on the start up side, and Henry who runs product for the platform team. Awesome. So as a reminder, our goal here with Build Hours is to empower you builders with the best practices, tools, and AI expertise to scale your company, your products, and your vision with OpenAI's APIs and models. You can see the schedule down here at the link below, openai.com/buildhours. Awesome. Our agenda for today. So I will quickly go over our agent kit, which we launched just a couple weeks ago at dev day, then hand it off to Samarth agent kit demo. Henry will then run through eval, which really helps bring those agents to life and let us trust them at scale. If we have time, we'll go over a couple of real world examples and then definitely leaving time for q and a at the end. So feel free to add your questions as we go through. Awesome. So let's do a quick snapshot of what agents was like, building with for the last really, like, several months or even a year. It used to be super complex. Orchestration was hard. You had to write it all in code. If you wanted to update the version, it would sometimes introduce breaking changes. If you wanted to connect tools securely, you had to write custom code to do so. And then running evals require you to manually extract data from one system into a separate evals platform, daisy chaining all of these separate systems together to make sure you could actually trust those agents at scale. Prompt optimization was slow, and manual. And then on top of all of that, you know, the build UI to bring those agents to life, and that takes another several weeks or months to build. So, basically, it was in massive need of a huge upgrade, which is what we're doing here. So with agent kit, we hope that we've made some incremental improvements to how you can build agents. Now workflows can be built visually with a visual workflow builder. It's versioned, so no no breaking changes are introduced. There's an admin center, called the connector registry where you can safely, connect data and tools. And we have built in evals into the platform that even includes third party model support. As Samarth will show us in a bit, there's an automated prompt optimization tool as well, which makes it really easy to perfect those prompts, automatically rather than trial and error yourself manually. And then finally, we have TrackKit, which is a customizable UI. Cool. So bringing it all together, this is sort of the agent kit tech stack. At the bottom, we have agent builder, where you can choose which models to deploy the agents with, connect tools, write and automate and optimize those prompts, add guardrails so that the agents perform as you would expect them to even when they get, unexpected queries. Deploy that to ChatKit, which you can host yourself or with OpenAI, and then optimize those agents at scale in the real world with real world data from real humans by observing, and optimizing how they perform, through our eval platform. Cool. So we're already seeing a bunch of startups and Fortune five hundreds and everything in between using agents to build a breadth of use cases. Some Some of the more popular ones that we're seeing are things like customer support agents to triage and answer a chat based customer support tickets, sales assistance similar to the one that we'll actually demo today, internal productivity tools like the ones that we use at OpenAI to help teams across the board, work smarter and faster and reduce duplicate work, knowledge assistance, and even doing research like document research or general research. And the screenshot here on the right is just a few, templates that we have in the agent builder that shows some of the major, use cases that we're already powering. Okay. So, let's make this all real with a real world example. A common the challenge that businesses face is driving and increasing revenue. Let's say that your sales team is too busy outbounding the prospects, building relationships, meeting with customers. We want to build a go to market assistant to help save sales time and increase revenue. And with that, I'll kick it over to Samarth to show us how to do it. Great. One of the biggest questions that we get at OpenAI is how do we use OpenAI within OpenAI? And hopefully, this kind of rolls the curtain a little back so you can take a peek at how we actually build some of our go to market assistance. We'll cover a few different topics today, like, maybe the agents that are capable of, doing data analysis, lead qualification, as well as outbound email generation. So what I'll do here is move over and share. Great. So we're actually on our Atlas browser. Feel free to download that. I had a fantastic time using it these past few weeks, and, I think it saved me hours, if not, you know, days worth of time doing something sometimes. And, I'm a big fan. Okay. So we'll get started. And when we get into the agent builder platform, the first thing that we really see, is a start node and the agent node. You can think of the agent as the atomic particle within, you know, the workflow that you go in and construct. And behind it is the agent's SDK, which actually powers the entirety of agent builder. Whenever we build these agent builder workflows, it doesn't have to live within the OpenAI platform. You can copy this code, host this on your own, and you might want to even, you know, take this beyond traditional chat applications and do things, like being able to trigger these via webhooks. So for this example, we have three agents in mind that we're looking to build out. The data analysis one where we'll pull from Databricks, a lead qualification one where we'll scour the Internet for additional details, and outbound email generation, where we want to maybe qualify an email with things on a product or a marketing campaign that we're launching. Sound good? Sounds great. I'm on board. Okay. Great. So we'll get started by building our first agent here. Since we have, three different types of use cases in mind for what we're actually trying to build, what we wanna do is use a very traditional architectural pattern using a triage agent. So the way that we think about this is that agents are really good at doing specialized tasks. So if we break down this question to, you know, the proper sub agent, we might be able to get better responses. So for this first agent, let's call this a question classifier. Typing is hard. A copy over the prompt that we've we've put in here. I'll just take a quick peek at what this looks like. And really what we're doing here is asking the model to qualify or classify a question as either a qualification, a data, or an email type of question. Really, the idea here is that we can then route this query bit depending on what the model selected as what its output should be. And rather than having a traditional text output, what we want to do here is actually force the model to output in the schema that we recognize and can use for the rest of the workflow. So let's say, let's call this variable that the out the model will output in category and select the type as Enum. What this means is the model will only output a selection, from the list that we provide here. So, from my prompt, I had the email agent, the data agent, and the qualification agent. Great. And real quick, how did you write the prompt? Did you write that all yourself? Or I know the importance of prompts in steering the agent. How did you come up with that? I think writing prompts is one of the most cumbersome things that we can do. I I there's a lot of time spent spinning wheels on what one of the most key ways that I write prompts myself is use chat g p t and g p t five to be able to create my v zero of the prompts. Within agent builder itself, you can actually go in and, edit the prompt or create prompts from scratch to be able to use as, the bare bones for what you might, you know, spin on in the future for your agent workflows. For now, we'll leave it as the one that, we pasted in here, but we'll in the rest of this workflow, we'll take a peek at what using that actually looks like. Great. So now that we've actually got got the output, agent builder actually allows us to make this very stateful. So for example, I have a, a set state icon here. Sorry. Let me just again, drag and dropping also can be difficult. So what we wanna do here is take that output value from the previous stage and assign that to a new variable such that the rest of this workflow is able to reference it. We'll call this category again, and assign no default value for now. Using that same value, I can now conditionally branch to either the data analysis agent or the rest of my workflow to handle maybe additional steps I want to do prior to executing the email, or the data qualification use case or the customer qualification use case. What we'll do here is drag this agent in, and we'll set that the we'll set the conditional statement here to say, if the state category is equal to data let's see. Oh, it looks like I spelled it wrong. What do you put in? Great. As you can see, there's helpful hints where we were actually able to see, what actually went wrong and be able to really quickly go back and debug that. So here in this case, we wanna see if it's a data you a data agent will route to that separate agent. And if it's not, we'll probably use, additional logic to go in and scour the Internet for those, you know, inbound leads that we want to qualify or an email that we want to write. Let's stick with the data analysis agent for now and go over what it's like to actually go in and connect to external sources within agent builder and largely agents SDK. What I wanna do here is actually instruct the model on how to use Databricks and create queries that it can use, in in cohort with an MCP server. So what we've done here is, added a tool for the model to be able to go and access this MCP server and query Databricks however it chooses fit. If my query is really hard, it might require, you know, joins Databricks and g p t five would be able to use those together to then be able to create a concise query. So since I built my own server for now, I'll add it here and let's call this I'll add my URL first. I'll call this the Databricks MCP server. And what I'll do here is actually choose the authentication pattern. You can also select no authentication, but for things that are protected resources or might with live within authenticated platforms, you might wanna use something like a personal access token to go do that last mile of federation. So in this case, I'll I'll I'll use a, a personal access token I created within my Databricks instance and hit create here. Let's give it a second to pull up the tools, and we can see that a fetch tool is actually submitted here. What this allows us to do is select a subset of the functions that are actually allowed to the MCP server, to really allow the model to not get overwhelmed with the choices of potential actions that it can take. So I have that tool there. And I'll also, I'll go back. One thing I might have missed here is actually setting the model. What I wanted to do is make this really snappy, and so what I can do is choose a non reasoning model there. But for this one, I really want the model to iterate on these queries and react to the way that the model or the the the results of the model were actually perceived to, the agent. And so, what we'll do here is do a quick test query to make sure the piping works. So maybe I'll say, show me the top 10 accounts. That should be good enough. And what we can see is the model actually stepping through the individual stages of this workflow. So in the beginning, you can see that it classified this question as a data question, save that state, and then route it. We can see that when it reached that agent and decided to use that tool, it actually asked us for consent to be able to go and take that action. You can configure that logic on the front end to be able to handle how to actually show to the user, hey, the model actually wants to go and, select an action there. With MCP, you're able to do both read and write actions, and we have a few of these MCP servers out of the box. Think like Gmail. We have a ton more, out of the box that you're able to connect to. SharePoint. SharePoint. Totally. And so here we can see that the the model is actually, you know, thinking about how to construct that query, and we can see the we can see a response here. We didn't ask for the model to really format this result for us, but we can actually really quickly do that with this agent itself. And by just asking the model and say, I would like the results to be in natural language. And just by, you know, spinning on, the generate button within agent builder itself, you're able to provide these inline changes depending on the results that you see in real time. Okay. Cool. So the next thing I wanna do is actually create another agent to do some of that research that we were mentioning that we might be useful for something like generating email or, qualifying a lead. So we'll call this the information gathering agent. Looks like it's stuck here. I might have to give it a quick refresh in a moment. Let's see. Platform's a bit funny. Great. Cool. So we're at this information gathering agent, and what we wanna do is tell the model, how to actually go and search the Internet for the leads that we want. Particularly, we're looking for a subset of the information that might be publicly available for a company. So think about, like, the company legal name, the number of employees they have, the company description, maybe their annual revenue, as well as their geography. And what we wanna do here again is use a structured output to define what our output should look like when the model goes in, searches the Internet for this. This gives us a good mapping and for the model itself to know what to look for when it's writing these queries, and we're able to then, you know, instruct the model in terms of the way that it should search for the across the Internet. Great. What we wanna do here is also change the output format for, the schema that we want to enter. Maybe we want to put the the fields that we previously just showed into a structured output format. You can also add descriptions, in the in the properties, but for now, we're gonna leave those blank. Great. So now that when the model goes to this information gathering agent, it will hit this, agent, search the Internet, and output in the format that we're looking for. Cool. Since we saved the state of the the query routing in the beginning, we can go ahead and reference this again, when we're when we're going to route again via email or to the lead generation, and lead enhancement agent. So what we'll do here is set this equal to email, and then, otherwise, we'll just route it to the other agent. Awesome. Yeah. And the sub agent architecture is great because it means that you get better quality results a bit faster than you would just using one general purpose agent, which is helpful for actually having impact and helping the sales team be more productive in coming. What we'll do here is paste in a prompt for this email agent. But really the highlight for for this for the email agent is that we're looking to generate emails that are not just from, you know, information from the query or from the the Internet, but we also wanna upload files that might map to the way that we're actually thinking about building, emails in general for marketing campaigns. So what you may have in this case is something like PDFs that contain information on what the campaign is. Maybe you have other PDFs that contain information of how you should write emails. All of this is really useful information for the model in order to spec out what that email should actually look like. So what we'll do here is add a tool to actually go and search these files. You can attach vector stores that you may have already, to the workflow and be able to use those out the box. You're also able to add these via API. But for now, what we'll do is just drag in a couple files that we have. We have one that's a standard operating procedure for how to write emails, and we have another document on a a a potential promotion that the sample company has. And what we've done is allowed the model to then go in and search the vector store for this type of information in order to actually go and generate that email. On the lead enhancement agent, instead of writing one ourselves, let's pretend like we have a, like, a general segmentation of, the market that we want to actually assign various account executives to. So in this case, what we want to do is essentially, be able to output a quick schematic of how we're gonna do that assigning process depending on the information that was gathered from the Internet. And without writing a prompt, agent builder will be able to output an entire, you know, version of that prompt as a as a starting point. Super cool. Great. Before I move away from agent builder and and show, like, this working end to end, what I wanted to show is that agent builder doesn't just support text and structured output formats. We also support really rich widgets. So what this looks like in practice is that, we can instead of outputting text or JSON, upload a widget, and I'll show you in a little bit what it looks like to actually create a widget and use a widget. But we can actually go in and upload a widget file itself. So I'll drag this in here. Or maybe I have to. Great. So we can see a quick preview of what this widget looks like. Rather than just outputting in text and maybe, you know, traditionally, like, chat g p t, you what you see is, like, a markdown formatted result. We wanna maybe render something richer such that if you do host this on your own on your on your own website, you're able to have that multimodal, component as well. So what we'll do here is create this component. And now if I say draft in email to should should we use OpenAI? OpenAI Sure. About great. So you can see that it went to the information gathering agent. Since we've given access to the web search tool from the did we do that? Let me make sure that I did that. Which is may have skipped Might may have skipped that step. Here we go. Great. So again, it sorry? I was just gonna say, I love that you can test the workflow live here and debug it like we're doing Yeah. Before going to production. Totally. And the really nice thing is as you run questions through this workflow, we save the traces of exactly how the model has executed, you know, various queries and then more holistically the way that the workflow has orchestrated. So this is really rich information as you're continuing to iterate on your workflow. And, Henry, you'll touch on this a ton, but the ability to really peel back the curtain and see how the model is thinking about this and then assign graders, I think, really allows you to scale out this process of evaluations as well. Yeah. So great. Looks like here it's searching for the monthly. We'll let this run for a little bit and see see what happens at the end. Okay. It looks like it might take a little bit to do that. We'll we'll we'll get back to that one. End to end, so what we've built here is essentially an agent that allows you to do three different things. The first one is that allows you to go and query Databricks for being able to pull in that in VPC and potentially, you know, information that might live beyond behind some form of, information wall and be able to pull that within the agent workflow itself. And then alternatively, being able to qualify, write, emails, and then also qualify inbound that you might get from customers. All this lives within a workflow that you can then host within, chat kit, which we'll cover, or you can take this out and use it in your own code base to handle, what those chat workflows actually look like. Super cool. One of the questions I was wondering was, what's the difference between pulling a tool from the left hand sidebar in, like, drag and dropping that in as a node as opposed to adding that tool into the, agent node specifically? Totally. Great question. So, when I added, like, the search tool to the information gathering agent, I've allowed the model to determine if it should actually go in and use that tool. Sometimes I always want the tool to run prior to an agent actually getting that information. So I can add one of these nodes to be able to ensure that the model is actually doing this action prior to the agent actually receiving that information. Makes a ton of sense. Yeah. So agent kit then, I feel like, is a good combination of deterministic and somewhat, if you want, non deterministic outcomes to be true. Yeah. Cool. Great. I wanna pivot to so we built this amazing workflow. Now we want to go in and deploy it. I think one of the most fantastic things that we released at our most recent dev day was the ability to go in and host these workflows that you've built. So using using the, workflow ID that we've gone in and built, we're able to actually power these chat interfaces that, might require a ton of engineering otherwise to support things like reasoning models, as well as being able to support, you know, complex agent architectures and the handoffs that you might want to show to users. What this looks like in production is that you're able to match the entirety of your, your brand guidelines to the actual chat interface that you're building, and we'll take a peek at how some of our real customers are using this today. But really the I wanted to highlight the fact that you can, you know, entirely customize this to, you know, the color scheme, the font families, as well as the starter prompts that your users might go in and use. Say, for example, we have a workflow that looks at our utility bills, where we might want it to go and connect to an MCP server, pull up your billing history, analyze those past bills, and then be able to, show a really rich widget to the user. The entirety of that process and the customization of what the user sees is entirely, configurable through Checkat. So here in the question, how's my energy usage, rather than just showing a traditional text response, we see a really rich graph that allows you to visualize the output. This is super cool. Yeah. And I think for our use case, example just to drive at home, one of the widgets that we have available that maybe you'll show us shortly is, an email widget. So if you wanted the agent to actually draft a email to OpenAI, which I think it's still researching information for because there's so much public information out there, and then sales can just click to have that button, to have that email sent to the customer. Totally. Yeah. Let's take a look at a few of what those widgets could be. So we've released the gallery where you can take a peek at some of the ones that we think are really cool. You can also click into these and see what the code is to actually build these. But what I think is really cool is being able to generate these through natural language. Like, for example, if I wanted to have an email, component or a widget that I wanted to mock up that, contains some specific brand guidelines or, formatting in a way of that widget that really appeal to my brand, I'm totally able to do that via natural language. And so using this, you can then export that into agent builder and then show that UI when, agent builder invokes that that widget in chat kit. Okay. Great. Before moving into Henry, I wanted to show an example of what this looks like in real life. We have a website here that renders a pic a globe with or a picture of the earth. And what we wanna do is be able to control this globe that we have, by a natural language. So where should we go today, Tasha? Well, I think our next dev day exchange is in Bangalore, so I'm gonna say India. Let's go to India. So what we should see here is another agent builder powered workflow, but we can see how not only did, a widget populate on the right side, we actually were able to control the JavaScript that was rendered on the on the actual website itself. So being able to have this customizability and portability into the websites and browsers that you use every day is something that, we find really fascinating with Checkat. That's the fastest trip to India I've ever taken. Totally. Awesome. So we covered, the the build side as well as the deploy side into chat kit. What really is the most important part and, you know, the the hardest part of a lot of building agents is the evaluate part. Yep. That's how we know that we can trust the agents, in real world scenarios in production at scale with all of the glorious, and weird edge cases that come up. So with that, I'd love to hand it over to our friend in The UK, Henry, who can walk us through an email's demo. Thank you so much, Tasha and Samarth. And hi, everyone. I'm Henry. I'm one of the product managers who worked on AgentKits. And so today, I wanna talk a little bit about how once you've built that agent, once you've got that workflow, and you've defined it in the visual builder, I wanna talk about how you can test it. I wanna talk first about how you can test an individual node and get confident that specific agent or that specific node is gonna perform as you want it to. Because ultimately your agent is only as good as its weakest link. Like you need every single component to be dialed in and performing how you want it to. Once you've got every one of those nodes in a place you're comfortable with, you then wanna be able to assess the end to end performance. After that you can look at traces, but traces are hard to interpret. And so now we have a trace grading experience too that allows you to take those traces and evaluate them at scale. So let me pull up my screen and start talking you through a bit of a demo and show how we can, how we can do this. So, here you can see an agent that I built. This is based on a real example from one of our financial services customers. This takes an input of a company name, it assesses is this a public or private company and it completes a series of analysis on that company before ultimately writing a report for one of the professional investors of that company to review. So as I mentioned, you have a whole bunch of agents here and every single one of these agents needs to perform well and needs to perform as you want it to. And so, how would you get confident in the performance that it's gonna do that? How do you get visibility and, and kind of transparency into how it's gonna perform? So when you're defining this agent and you're looking into one of these nodes, you can see there's an evaluate button here on the bottom right. So you click that evaluate button. That's gonna take that agent node which has a prompt, it has tools, it has a model assigned and it's gonna open it in a dataset. So here you can see this data set UI and this allows you to visually build a simple eval. And so I'm going to now attach, just a couple of rows of data into this eval. You can see a company name and then you can see some ground truth revenue and income figures as well. So I've imported that to this data set and that's gonna allow us to run this eval. So here you can see everything that was passed through from the visual builder. You've got the model, you've got the tool of web search, you've got the system prompt and the user message that we had assigned and then you can additionally see this data that I uploaded. So this is just three rows, a couple of company names and then some ground truth values for the revenue and income figures that our web search tool should return for those, those companies. So what I can do now, I can run the generation. So this is obviously the first stage of any eval is to run generation and then once you completed the generation then you complete the evaluation stage. So while that generation is running, I wanna show how we can attach columns. And so here we can add new columns for, let's say, ratings where we can attach a thumbs up and thumbs down rating. And then let's additionally add columns for free text feedback. So this is where I can attach kind of a free text annotation. Maybe I'm happy with something, maybe I want to attach some kind of longer form feedback on that data as well. And so what you can see now is that this output is coming through. And if I click into this, I can tab through these generations that have been completed. So you can see here I was asked to complete some analysis of Amazon, of Apple, and then of Meta as well is still running. And I can scroll through that and I can see the generation that was completed. So what I can then do is I can attach these free text labels or attach these annotations, sorry, that I just created. So I can say this one's good. So maybe this one's bad. I can say maybe this one's good. And then I can attach feedback. I can maybe say, this is too long, for example. Now once I've done those annotations, I can also add graders. So let me add a grader here. I'm gonna just create a simple grader that's gonna evaluate a financial analysis and it's gonna require that this financial analysis contains upside and downside arguments that it considers competitors but it ends with buy, sell, or hold rating. So I'm gonna save that and I'm gonna run it and this is now gonna run through, in fact let me just let me just change that. Okay. Let's just leave that. So that's now gonna run through and complete those kind of grader ratings. So that's gonna take a little while to run through because we've got a lot of data in there. So I'm gonna tab over to a dataset that I created earlier where you can see these graders have now completed. If I click into these, I can see the rationale. I can see why the grader has given the result that it has done. So here you can see for example, this grader has failed because there's no explicit recommendation and there's no competitor comparison. So what we could do at this point, now just maybe recap even where we are. Here we've got those generations that have been completed. We've got all those annotations and we've got all these grader outputs. What do you do at this point? How do you make your agent better? So one thing you can do is just do some manual prompt engineering and try and find patterns in that data and then try and rewrite your prompt. That obviously takes a long time and requires you to find those patterns and to spend a bunch of time, you know, trying to solve them. What we see as a better solution is automated prompt optimization. So you can see here there's this new optimize button. So if I click that, it's going to open a new prompt tab in this data set and that's what we're gonna automate the rewriting of the prompt. And this is how you save yourself having to do that manual prompt engineering every time. So this is where we're taking those annotations, we're taking those greater outputs and we're taking all the, the prompt itself and we're using that to suggest a new prompt. And again, this will take a minute or two to run through. So I'm going to tap here to one that I made earlier and you can see here the rewritten prompt that completes a fundamental financial analysis but is much more thorough and complete than the initial kind of pretty scrappy and rough prompts that I had completed. So that's an overview of how you can take that single node from that agent builder and how you can robustly evaluate that single agent. But we're not building a single agent here. This is a multi agent system and we wanna test every one of the nodes individually, but ultimately what we care about is that end to end performance. So how do we get confident in that? How do we test that? So as Samarth mentioned, these agents emit traces. And here you can see some example traces from when I previously run this agent. So clicking through this, I can see every span, I can click into every spam and I can start to identify, you know, what happened when this agent ran. Now as I'm clicking through this, I might start to notice problems. For example, here you can see there's a bunch of sources that have been pulled by the web search tool. For example, CNBC and Barron's. Maybe we don't want these third party sources to be cited. Maybe we want only first party authoritative sources. So we should say web search sources should be first party only. Let's just run that with GPT five and Nano so it's nice and fast. And then as I click through more of these, I might find additional problems. Let's say we identify another pattern that the end result doesn't contain a buy, sell, hold rating. So we say end result needs to contain a clear buy sell hold rating. And again, I'm building up these requirements that I can then run over specific traces. And now this set of requirements, you can think of as like a grader rubric. And this grader rubric is built up with a series of criteria that define a good agent. And then once I've got that set of criteria built up and I've tested it in a couple of traces, I can then click this grade all button at the top here. And this is gonna export the set of traces that I've scoped this to. So in this example, just these five traces. And it's gonna take the set of graders that I've defined on the right. And it's gonna open that in a new EVAP. And this allows you to assess a very large number of traces at scale because clicking through every one of these traces and trying to find problems doesn't work that well. It takes a lot of time. It doesn't scale well. But instead, you can run these trace graders for a very large number of traces and that will help you identify just the spans that are problematic and just the traces that you wanna dive into. So that was an overview of how we have this kind of embedded eval experiences tightly integrated with the agent builder. I also just wanted to flash a couple of best practices that we've seen from work with a large number of customers now, on this platform. And a couple of lessons that we've learned, first, starting simple. Don't over complicate things but do start early. Have a handful of inputs and a simple grader you define right at the start of the project. Instead of leaving evals right to the last minute as like, I'm just about to ship this thing, I better do some testing, which I know some people do. It's much better to like start early, embed evals, and do kind of eval driven development where you're rigorously testing your prototypes, finding problems in the prototypes, and then quantitatively measuring your improvement as you hill climb against your eval. Much better way to build a product and likely to result in higher performance. Secondly, using human data. It's really hard just coming up with hypothetical inputs using LMS to generate synthetic inputs. You'll probably get much better performance if you get real user data, real inputs from real end users because that captures all the messiness of the real world. And then finally, make sure you invest a bunch of time annotating generations and aligning your LLM graders. Because this is how you make sure that your subject matter expertise is really encoded into the system. So that your graders are actually representing what you want your product to do. So that was a high level overview, of Baruvar product. This is all in GA, so we'd love for you to to give it a spin and please let us know, any feedback at all. And with that, I'll pass back over to Tasha and, Samarth. Thanks, Henry. I feel like we could do a whole hour session on you all. That was awesome. One quick question for you actually before, you step out is, how large of an email dataset do you recommend? We got this from chat. Is it, a 100, a 10? How do you know what the right, dataset size is to get the results you want? Yeah. So the best thing to do is to get started early. And so even, like, 10 to 20 examples goes a long way. And having, having that set of data in there to just test your application against is is really helpful. So even just, you know, ten, twenty, a couple of dozen, rows is helpful. And then as you get closer to production, clearly, the more is the more is better. But it's really you know, I wouldn't think of it as a a question of just how many rows because there's kind of a quality times quantity, multiplier that you have to, have to consider here. Having, you know, 50 rows of really high quality inputs that are very representative of a large set of user problems and then having graders that are really aligned with the data that the behavior you wanna see, that can perform phenomenally. But if you use an LLM to generate a thousand rows of synthetic inputs, it's not gonna be that helpful. So I'd say the quality is almost more important than just the quantity. That makes a lot of sense. Yeah. Yeah. And just to add on top of that, like, one of the questions that we get a ton of is, like, how do we create a diverse dataset to run evals from? Especially if you haven't put a lot of these tooling into production already. When we're building our go to market assistant, our engineering team that actually supports those workflows sits right next to our go to market team to understand what subject matter experts are actually asking or curious about. This allows us to build a good diverse set of questions that on every iteration that we continue to optimize, we're capturing the nuances and the real queries that people are actually interacting with. Super cool. Awesome. Well, thanks, Sunny. So with that, I'd love to cover a couple real world examples, and then we'll leave some time for our q and a at the end. So, our first one here is a short video of a procurement agent that Ramp built. So they used ChatKit to actually visualize, this UI to the person requesting a software. They used agent builder on the back end to actually orchestrate the agent flow, and they used to make sure that it would work, at scale in production. So while this isn't live on our platform yet, we hope that it will be in the near future, and that was a quick run through of, what they actually built in in the prototype. Awesome. So, Ramp with the agent kit stack, was able to build this prototype 70% faster, which I think is pretty amazing. Equivalent to, like, two engineering sprints instead of two quarters. Rippling, I actually think you worked on this project a little bit. Do you wanna maybe share what they built and how it went? Yeah. Totally. We we're initially thinking about, like, how we can spec this out through the agents SDK. And, one of the hard challenges was, like, getting that alignment between subject matter experts as well as, you know, the ability to build workflows that were logically sound. And so we really sat with them to understand what was their real go to market use cases and be able to work backwards from there. Chatting with their team, I think it was a pleasure to use a tool like Agent Builder, and we got we got a ton of really, good feedback on next versions that we're looking to roll out. That's awesome. Similarly, HubSpot who has been doing a lot of amazing, work in the AI space, they used, ChatKit to enhance their, Breeze AI assistant. If you wanna actually advance. Awesome. Thanks. All good. So, yeah, they saved weeks of run-in time, like we mentioned at the start. Building agents from start to finish is super time consuming because of each of the complex steps involved. So if we can even help with just one of those, numerous steps, the UI, aspect in this case, that's, that's a useful lift. And then finally, Carlyle and Dane, which were two, amazing eval customers of ours. So, they were able to see a 25% efficiency gain, in their eval dataset, which is fantastic. Cool. Okay. So maybe to round it out before we go over to q and a, when we launched agent kit, these are some of our early, customers who built on the product. And you'll see that AgentKits currently powering tech stocks, startups, fortune five hundreds, everything in between. These are the different types of agents. There's a bunch of, breadth of use cases here from, work assistance to a procurement agent, policy agents. Albertsons, the large grocery retailer, has a merchandising intelligence agent, Dane code modernization. So really cool to see just the wide range of use cases here. Awesome. With that, we can go to q and a from the chat. Maybe trying to go to the next one. Cool. Okay. So how can I add a four loop blocks, Mark, do you wanna take that? Yeah. Good question. So we don't have a four loop, but, we do have a while loop that's available within agent builder. You're able to actually be able to, conditionally continuously run different agent workflows depending on if the completion criteria has met. Obviously, with the agents SDK, you can take it out into a code base and then orchestrate that on your own. Maybe use, like, our our interpretation is that of that as, like, v zero. But instead of a four loop, we do support why loops so such that you're able to actually iterate, throughout the workflow until that, end criteria has been met. Hopefully, that helps. What else do we got? How does agent kit compare to the agent's SDK? I would say that agent kit so far is, so well, I'll I'll I'll back up a bit. AgentKit is a a suite of products that we've tried to opinionate as to the most useful tools that we at OpenAI find find, from our day to day as we build agents. Agents SDK powers the entirety of agent kit and most of everything that you can do within agent kit, you're also able to do within the agents SDK or it's via, available via an API. So far, we're continuing to roll out a ton of these changes to make that parity happen a little bit more closer. But we imagine in the future that agent kit will also contain, you know, some features that allow you to extend the ability to host these workflows, on the cloud. And so rather than using, like, traditional chat kit implementations, you could also trigger these workflows via an API as well. This allows you to essentially host the agents SDK on the cloud. Yeah. Very cool. Yeah. And I would say, agent builder is like the equivalent of the agents SDK functionally, but it's the Canvas visual based way to actually orchestrate those agents, whereas agents SDK is like the jump straight in straight into the code version of it. So yeah. Very cool. Great. How do you build out of the box MCP servers versus building your own? Yeah. Totally. So we have a few MCP servers. So we support, remote MCP servers, which means that the MCP servers has to be hosted on the cloud or, hosted on the publicly available Internet to some degree. When you're building our own MCP servers, a lot of the considerations that we have are around authentication require us to build our own MCP servers. That said, a lot of the providers that you use every day, like, think Gmail, your calendar, etcetera, those all have out of the box connections likely that you're able to just paste in an API key and get started with all the tools that we support. Some of these, I think, you know, we don't have full capabilities to do things like write. So for example, if you want to write an email via the Gmail API, I don't believe that is currently supported, so you might wanna spin up your own MCP server there. The thing I really like about MCP is that it allows for that authentication in black boxes, what that flow actually looks like. So whether you wanna bring your own personal access token or go through something like OAuth and then pass in that last token that you get to, the MCP server. Both are totally great options to be able to authenticate to secured sources. Do you have any more questions? Yes. When do you recommend the classifier agent with branching logic to different agents? Yeah. I think this is a great question. It's one that we get a ton because, as you add more tooling and instructions to a model, what we've seen is that the performance generally deteriorates. Imagine a world where you have a 100 tools. Right? Allowing the model to select which one of those 100 tools becomes increasingly difficult. More realistically, you might not have a 100 tools, but you might have 20. And each agent or each use case for, an agent might use those tools into entirely different ways. So So one way that I like to think about agents is that I like to stratify the logic for what is a core competency for this agent, what are the net set of tools that I want this agent to use, and only in that specific type of way. The moment I start confusing the model on how to invoke these tools, how to interpret the instruction with the context of those tools, I like to branch off to a different agent. So in in the in the cases that we had, where, you know, we're looking for three different, GTM use cases, Maybe the email agent that we're building, you know, that outputs a widget is not the best one to also do lead qualification. So, those use cases where you're maybe using the same tools, but, you wanna structure the outputs a little bit differently. You want the model to interpret the outputs a little differently. It's good to branch out to different agents. Cool. Alrighty. Can we use agent kit for a multimodal use case especially for analyzing images and files? Totally. So, this is a great use case for agent kit. We do support file inputs within that preview section that we covered. You're able to even play around in the playground with uploading files. I I what I find really interesting is that, like, we propagate this behavior to ChatKit as well or ChatKit propagates that behavior to agent builder as well, where if you upload files within ChatKit, that is also passed into hosted agent builder back ends. Oh, super cool. Yeah. Okay. So we are at the end here. We would love to leave you with a few resources if you're interested in exploring more. Agent kit docs, super helpful place to get place to get started. We also released a cookbook the other week, that walks you through a very similar use case to the one that we showed today, in a bit more detail even. ChatKit Studio, if you wanna play around with ChatKit and see how you can customize it, and then finally, to learn more about upcoming build hours and past build hours, the build hour we call on GitHub. Awesome. And with that, I think we're at a close. If you wanna, right. Okay. Upcoming build hours, we have two. Agent RFT, so building on what we talked about today, how do you actually customize models for tool calling and custom graders and things like that? That will be November 5. So really excited to build on today's session, with that next session. And then on December 3, agent memory patterns. So hope to see you at both of those. You can, get more information about registering at this link. Awesome. Well, that's it. Thank you so much for putting this awesome demo together. It was super fun. Yeah. Yeah. Thank you all for watching, and I hope you have fun building agents.