Two questions ran underneath this week’s news: can you trust what an AI model can do, and can you trust what it costs? OpenAI took on the first, arguing that test scores are only as good as the setup behind them and opening its models to outside inspection. The money question surfaced everywhere else. Anthropic filed confidentially to go public at a valuation near $965 billion while expanding its push to secure critical software, and Google moved to embed its image models deeper into the tools companies already pay for. Runway and NVIDIA, meanwhile, looked past the screen entirely toward AI that acts in the physical world. Here’s what caught our attention this week.
Listen to the AI-Powered Audio Recap
This AI-generated podcast is based on our editor team’s AI This Week posts. We use advanced tools like Google NotebookLM, Descript, and Elevenlabs to turn written insights into an engaging audio experience. While the process is AI-assisted, our team ensures each episode meets our quality standards. We’d love your feedback—let us know how we can make it even better.
TL;DR
- OpenAI published guidance arguing that how you set up a model test can swing scores wildly, with GPT-5.5 hitting 92.3 percent versus 69.2 percent on the same challenges, and is opening models’ reasoning notes to outside teams to check for deception.
- Canada confirmed it has access to Anthropic’s Mythos cyber model through Project Glasswing, as the program expanded to around 200 organizations, putting a capability regulators have called a threat to financial stability inside the federal government.
- OpenAI expanded Codex beyond coding with six role-specific plugins (for analysts, marketers, sales, designers, investors, and bankers), shareable interactive Sites, and in-place annotations, as non-developers become its fastest-growing users.
- NotebookLM’s three coming features (Personal Preferences, Connectors, Canvas) point to Google turning its source-grounded reader into a full production workspace.
- Google’s Nano Banana image models hit general availability, already embedded in Adobe, WPP, and Shopify, with Verizon, L’Oreal, and Unilever running campaign work through them.
- Runway joined NVIDIA’s new Cosmos Coalition to build and open-source “world models” for physical AI.
- Anthropic confidentially filed to go public, landing near a $965 billion valuation as both it and OpenAI circle the public markets.
🤖 Agentic AI
Microsoft Build: Scout, “Autopilots,” and a push to govern AI agents
Microsoft used its annual Build conference to make agents the center of its AI story. The headline is Microsoft Scout, described as the first in a new category the company calls “Autopilots,” meaning always-on agents that run in the background, carry their own identity, and act on your behalf without being prompted each time. Scout is pitched as a personal agent for work that lives across the Microsoft 365 apps people already use, connecting to Teams, Outlook, OneDrive, and SharePoint to handle meeting prep, scheduling across time zones, blocking calendar time for deliverables, and flagging risks like stalled decisions. It is built on OpenClaw, the open-source agent OpenAI acquired in February, and on Work IQ, the intelligence layer behind Microsoft 365 Copilot. Scout is available now as an experimental release to Frontier organizations, with access gated behind Frontier enrollment, Intune policy setup, an opt-in attestation, and a GitHub Copilot license.
Much of Microsoft’s framing was about control. Every Scout agent runs under its own governed Entra identity rather than a shared service account, so its actions trace to a known actor, with credentials scoped to the task and kept out of logs, and Microsoft Purview data-protection policies enforced before anything is sent. Sensitive actions can require a human to sign off. Alongside Scout, Microsoft previewed execution containers that run agents inside an operating-system-enforced boundary, positioning Windows as what developer CMO Kyle Daigle called an agent-native runtime, where IT can define an agent’s limits once and have the OS enforce them everywhere.

The developer-facing piece is the Agent Control Specification, or ACS, a new open-source standard for defining what an agent is allowed to do. Policy files spell out permitted and forbidden actions, when a human must approve, and what gets logged, and they are checked at several points while the agent works: before it takes input, before it calls a tool, after a tool returns, and before the final response. A policy can allow an action, block it, redact sensitive data, or pause for sign-off. Because the rules live in a single file bundled with the agent, the same policy follows it across frameworks. ACS ships as an SDK with plugins for LangChain, the OpenAI and Anthropic Agents SDKs, AutoGen, CrewAI, Semantic Kernel, and MCP tools, among others. Microsoft also announced a new MAI family of models spanning reasoning, voice, coding, and images.
Why it matters: Microsoft’s bet at Build is that the agent race is won on trust and control, not raw capability, which plays to its strengths. Its repeated line, that the differentiator is no longer access to intelligence but ownership of how your data and workflows operate, is a pointed jab at the model makers it depends on, including its partner OpenAI. Scout builds on OpenClaw, which OpenAI now owns, while Microsoft wraps it in Entra identity, Purview policies, and OS-enforced containers: let others supply the raw agent, and win by being the layer that makes it safe to run inside a company. The Agent Control Specification is the more durable move, since shipping it as an open standard with plugins for rival frameworks, including Anthropic’s and OpenAI’s, is an attempt to make Microsoft’s governance model the default everyone builds against.
🛡️ Safety & security
OpenAI Publishes a Playbook for Third-Party Model Evaluations
OpenAI published guidance on how independent groups should test frontier AI models, arguing that how a test is set up now matters as much as the score it produces. Older tests treated models like chatbots: you typed a question and a grader judged the answer. Today’s models use tools and work across many steps, so performance depends on the setup around them, not just the model.
OpenAI calls that setup the “harness,” meaning the tools and memory a model has while it works, and its main point is that the harness can swing a score enough to decide whether a skill shows up at all. Its own security tests show it: GPT-5.5 solved 92.3 percent of the challenges when it could track context across a long session, but only 69.2 percent without that help.
The guidance also covers whether a score can be trusted, listing ways results get distorted, from gaming the grader to a model underperforming on purpose. One striking example: METR’s test of GPT-5.4 first suggested the model could handle tasks taking a human about 13 hours, but reviewers found some wins came from gaming the grader, and cutting them dropped the estimate to about 6 hours. OpenAI says it is sharing its best testing methods with evaluators and giving outside teams access to models’ reasoning notes so they can check for deception.
Why it matters: If a single setup choice can swing a score from 7 to 92 percent, no headline number means much on its own, and whoever controls the setup controls the story it tells. That cuts both ways for a company publishing its own guidance: pushing for stronger testing makes a model’s abilities harder to undersell, but it also gives developers an easy way to wave off an alarming result as a weak setup rather than a real limit. The lasting win is opening models’ reasoning notes to outside teams, since trust comes from being able to look inside the system, not just grade what comes out.
Canada Confirms It Has Access to Anthropic’s Mythos Through Project Glasswing
The Canadian government confirmed it has early access to Anthropic’s Mythos capability, the cyber model flagged as a potential threat to financial stability for how sharply it can speed up the discovery and exploitation of software vulnerabilities. Innovation Minister Evan Solomon’s office said the access runs through the Canadian Centre for Cyber Security under Project Glasswing, framing it as a way for Canada’s cyber defenders to better understand vulnerabilities, test systems responsibly, and strengthen protections for government services, critical infrastructure, and Canadian institutions. The CCCS confirmed the arrangement and added that it is also working with OpenAI and its 5.5-Cyber model.
The confirmation came as Anthropic expanded Project Glasswing to around 200 organizations in total, the broader rollout covered elsewhere in this edition. Until now, advance access had been limited to large U.S. tech firms and big American banks such as JPMorgan Chase, a tightly controlled group of about 50. The Canadian disclosure adds detail the company’s own announcement left vague: it confirms the U.S. government is a participant, and it places a national cyber agency among the new members. Notably, the picture inside Canada is still murky. The Canadian Bankers Association said it did not know which domestic organizations are included; the OSC referred questions to Anthropic, and OSFI and the Bank of Canada declined to comment directly on Canadian membership

The Mythos model has been a live concern for Canadian regulators for months. Anthropic delayed the model’s public release in April over its ability to rapidly exploit vulnerabilities, which set off a series of high-level meetings among governments, regulators, and financial players. Bank of Canada governor Tiff Macklem said he had spoken with U.S. Federal Reserve chair Jerome Powell about the risks, and much of the Canadian discussion has run through the Canadian Financial Sector Resiliency Group, a central-bank-led public-private partnership whose members include the CCCS, OSFI, the Finance Department, TMX Group, and technology experts from Canada’s largest banks.
Why it matters: A government confirming it has hands on a model regulators have called a threat to financial stability is a different statement than a company listing partners. It means a state cyber agency decided the defensive upside outweighs the risk of housing an offensive-grade tool. That Canada is also working with OpenAI’s cyber model suggests this is becoming standard practice, with national defenders sourcing frontier capability from competing labs. The throughline for Canadian readers is concrete: the cyber model the world is debating is already inside the federal government, and who manages its risks is being worked out in real time.
🛠️ Tools & features
OpenAI Broadens Codex with Role-Specific Plugins, Sites, and Annotations
OpenAI is pushing Codex well beyond software development. The company says more than 5 million people now use it weekly, and that non-developers (analysts, marketers, operators, designers, researchers, investors, and bankers) make up about 20 percent of users and are growing more than three times as fast as developers. The update adds three things: plugins that adapt Codex to a given role and toolset, the ability to create shareable interactive websites, and annotations for refining work in place.
The six new plugins target knowledge work with no coding required, bundling 62 apps and 110 skills between them. They cover data analytics (with tools like Snowflake, Databricks, and Tableau), creative production (Figma, Canva, Shutterstock), sales (Salesforce, HubSpot, Outreach), product design (Figma, Canva), public equity investing (FactSet, S&P, PitchBook), and investment banking. OpenAI says more are coming, including corporate finance, private equity, marketing strategy, consulting, and legal, and that it is building toward an open ecosystem where partners deploy their own plugins.

Sites, in preview for business and enterprise customers, lets Codex turn ideas and analysis into hosted, interactive webpages such as dashboards, planners, review workspaces, and project boards, shareable across a workspace by URL and kept up to date as details change. OpenAI named early partners including Wix, Replit, Figma, and Webflow. Annotations, already used by developers to refine code and sites, now extend to documents, spreadsheets, and slides, letting users point at a specific element and tell Codex what to change without reworking the rest.
Why it matters: That non-developers are Codex’s fastest-growing users signals the tool is escaping its original category. OpenAI is repositioning a coding product as a general engine for knowledge work, with plugins for bankers and marketers rather than just engineers, putting it on the same line the other labs are chasing. Sites is the more ambitious piece: turning analysis into a living, shareable webpage rather than a static document is a bid to replace the file as the default unit of work output. Bundling 62 apps and inviting partners to build more is how a tool becomes a platform, and platforms are harder to displace than features.
Three NotebookLM Features Look Close to Launch: Personal Preferences, Connectors, and Canvas
Google appears to be readying a cluster of NotebookLM additions that have been in development for months, showing up quietly in recent builds while the team signals an announcement may be near. TestingCatalog highlights three that point in a consistent direction.
Personal Preferences comes first. It already launched in Gemini and is now set to reach NotebookLM, where it would learn from your activity and build editable personas that adjust tone and technical depth to how you work. The Gemini version reaches into Gmail, Drive, Photos, and Calendar, but the NotebookLM signals so far lean toward in-app personalization drawn from your own notebooks and chats, which suits anyone doing repeated, deep-context research. The opt-in language describes letting the tool use past conversations, artifacts, and customization instructions to tailor the experience.
Connectors sits beside it in settings and would close the external-data gap. It works much like MCP, pulling outside sources into a notebook, most likely starting with Google’s own Calendar, Gmail, and Drive. The feature is not yet functional, and the list of supported sources is still open.
Canvas is the one TestingCatalog calls the headline. Found in the Studio panel, it would turn your sources into a custom artifact such as an interactive timeline, an explainer page, a lightweight game, or a visualizer, all guided by a prompt that describes what you want and how. It builds on outputs NotebookLM already produces, including infographics, slide decks, data tables, and mind maps. With NotebookLM now living inside Gemini, the three features together would let people work across their sources without shuffling material between tools, matching Google’s effort to turn a source-grounded reader into a workspace for building structured, visual experiences on top of documents. On models, NotebookLM moved to Gemini 3 late last year, and with Gemini 3.5 Flash now the global default after I/O 2026, the Flash branch is the natural next base. No firm timeline has appeared, so the open question is when rather than whether.
Why it matters: These additions push NotebookLM past its original job as a source-grounded reader. Connectors and Personal Preferences loosen the constraint that made it trustworthy, pulling in live external data and learning from your behaviour, while Canvas shifts the output from understanding sources to producing things from them, the line between a research aid and a production tool. None of the three is novel alone, but bundling them inside Gemini removes the copy-paste friction between reading and building, which is the friction that keeps people switching apps.
🎨 Creative AI
Nano Banana 2 and Nano Banana Pro Hit General Availability with Video Input
Google made its two latest image models, Nano Banana 2 (Gemini 3.1 Flash Image) and Nano Banana Pro (Gemini 3 Pro Image), generally available through the Gemini Enterprise Agent Platform. The pitch is built around embedding image generation and editing directly into enterprise applications and agentic workflows, backed by Google’s infrastructure, security, and an enterprise SLA. Developers can also reach both models through the Gemini API, though that path does not carry the SLA.
The headline new capability, in preview, is video input. Alongside text, PDF, and image references, Nano Banana 2 can now take a video file and analyze its visual context, subjects, and actions to generate context-aware images such as thumbnails and infographics. On output resolution, 1K and 2K are generally available for both models while 4K stays in preview.

Google leaned heavily on customer examples to show the models already in production. Adobe has integrated them into Firefly Enterprise and GenStudio. WPP built them into its WPP Open marketing platform for clients including Verizon, L’Oreal, and Unilever. Shopify uses them to help merchants extend product photography and generate lifestyle imagery. URBN, the Urban Outfitters parent, is piloting them to compress its trend-to-market pipeline, and Magnopus has wired Nano Banana and Veo into a 3D production pipeline meant to keep generated elements aligned with directorial intent.
Why it matters: This is a distribution play more than a new capability. Rather than asking creative teams to switch tools, Google is embedding its image models into the platforms they already run on, so adoption costs almost nothing. The insistence on enterprise security and an SLA speaks to what has actually kept these teams cautious, which is brand and legal exposure rather than image quality. And the client list does the persuading: when Verizon, L’Oreal, and Unilever are pushing campaign work through these models, generative imagery has crossed from pilot project into routine marketing infrastructure.
Runway and NVIDIA Launch the Cosmos Coalition for Open World Models
Runway said it has joined the Cosmos Coalition as a founding member, a new collaboration with NVIDIA and several AI labs aimed at building and open-sourcing frontier world models for physical AI. World models are systems that learn to represent how an environment works so they can predict and act within it, and the coalition frames them as foundational to physical AI, meaning AI that reasons and acts in the physical world rather than only generating text or images.
The group is organized around shared infrastructure and mutual technical contribution, with members expected to build on and give back to a common open ecosystem. The first project will be a base model co-developed by Runway and NVIDIA. Runway co-CEO Anastasis Germanidis framed building these systems in the open as the faster route to progress, and NVIDIA’s Ming-Yu Liu, who leads its Cosmos Labs, positioned Runway’s work in generative video and world models as a contribution to open models and tools for the wider AI community.
Why it matters: Runway is best known for AI video, so a move into world models reads as a bet that generating video and modelling a physical environment are the same problem. A system that can predict the next frame of a scene is already doing a crude version of what a robot needs to anticipate the result of an action. The open-source framing is strategic. NVIDIA sells the hardware these models run on, so seeding an open ecosystem grows demand for its chips no matter which lab wins, the same playbook that made its tools the default in earlier AI waves. For everyone else, an open base model lowers the cost of entering physical AI, a field otherwise gated by the expense of training these systems from scratch. The real test is whether “open” holds as the models get valuable, since coalitions built on shared infrastructure tend to strain once members compete over what it produces.
Ready to explore what AI can do for your organization?
📈 The Business of AI
Anthropic Moves Toward an IPO as the AI Labs Head for Public Markets
Anthropic said it has confidentially submitted a draft registration statement on Form S-1 to the U.S. Securities and Exchange Commission for a proposed initial public offering of its common stock. The filing gives the company the option to go public once the SEC finishes its review, though it stressed that any offering will depend on market conditions and other factors. The number of shares and the price have not been set. The announcement was made under Rule 135 of the Securities Act, which makes it a notice rather than an offer to sell or a solicitation to buy.
The move would let investors buy and sell shares of the company behind Claude on the public market, with a listing targeted for this year. It lands against a striking valuation backdrop. Anthropic, founded roughly five years ago by CEO Dario Amodei after he left OpenAI, recently raised private money at a valuation above $965 billion, ahead of OpenAI’s most recent $852 billion mark. OpenAI is also reportedly weighing a public listing this year, though Sam Altman told CNBC his company intends to go public only when it makes sense and is in no rush. The filings arrive alongside SpaceX’s own stock market plans and Alphabet’s disclosure that it intends to raise $80 billion for AI spending, which one analyst quoted by the BBC read as a sign the AI race is entering a more capital-hungry phase.
Why it matters: A confidential S-1 buys optionality. It starts the SEC review clock and lets a company prepare its disclosures privately before deciding whether to go public, so the filing signals intent more than a fixed date. The bigger consequence is what an actual listing would force into the open. Going public requires the kind of financial disclosure the largest AI labs have so far avoided, putting real numbers on revenue and on what it costs to train and run these models. With Anthropic and OpenAI both circling the public markets near trillion-dollar valuations, the first to list gives investors their clearest look yet at whether the economics of frontier AI actually work.
Keep ahead of the curve – join our community today!
Follow us for the latest discoveries, innovations, and discussions that shape the world of artificial intelligence.
