Enlarge / A video still of Project Astra demo at the Google I/O conference keynote in Mountain View on May 14, 2024.


Following OpenAI’s announcement of GPT-4o, a model claimed to understand video content and carry on discussions about it, Google unveiled Project Astra. This research prototype possesses video comprehension capabilities akin to OpenAI’s latest creation. Google DeepMind CEO Demis Hassabis introduced Project Astra during Google I/O conference in Mountain View, California on Tuesday.

Astra was described by Hassabis as “a versatile agent assisting in daily activities.” In a demonstration, the AI model exhibited its skills by recognizing objects producing sounds, offering inventive word combinations, elucidating code displayed on a screen, and pinpointing misplaced items. Moreover, the AI assistant displayed its potential in wearable gadgets like smart glasses, where it could analyze diagrams, propose enhancements, and craft amusing responses to visual cues.

Google states that Astra leverages the camera and microphone on a user’s device to offer support in daily chores. By continuously processing and encoding video frames and speech input, Astra forms a sequence of events and stores the information for swift recollection. This process enables the AI to recognize objects, address queries, and retain information of items that are no longer within the camera’s view.

Project Astra: Google’s vision for the future of AI assistants.

While Project Astra remains in its early stages without specific release plans, Google hinted that some of its features might be integrated into products like the Gemini app later this year (dubbed “Gemini Live”), representing a significant stride in the evolution of helpful AI assistants. This aims to create an agent with proactive thinking abilities that can strategize and plan on behalf of users, according to Google CEO Sundar Pichai.

More from Google AI: 2 million tokens

During Google I/O, the company disclosed multiple AI-related updates, some of which may be covered in separate posts later. For now, here’s a brief summary.

At the outset of the keynote, Pichai highlighted an “enhanced” variation of Gemini 1.5 Pro from February (retaining the same version number, peculiarly) set for imminent release. This new version will boast a 2 million-token context window, enabling it to handle extensive document collections or lengthy encoded videos at one go. Tokens are fragments of data used by AI language models to process information, while the context window dictates the maximum number of tokens a model can process simultaneously. Currently, 1.5 Pro supports up to 1 million tokens (compared to OpenAI’s GPT-4 Turbo with a 128,000 token capacity).

Asked for his take on the context window upgrade, AI researcher Simon Willison—in attendance at the keynote but not employed by Google—responded via text, stating, “The prospect of two million tokens is thrilling. But remember, at $7 per million tokens, a single query could amount to $14!” Google charges $7 per million input tokens for 1.5 on prompts exceeding 150,000 tokens through its API.

During the Google I/O 2024 keynote, Google said Gemini Advanced has the
Enlarge / During the Google I/O 2024 keynote, Google said Gemini Advanced has the “longest context window in the world” at 1 million tokens—soon to be 2 million.


In a related context, Google confirmed that the previously announced 1 million token context window for Gemini 1.5 Pro is now extended to Gemini Advanced subscribers, unlike its prior availability only via the API.

Google also introduced a fresh AI model named Gemini 1.5 Flash, positioned as a streamlined, speedier, and cost-effective iteration of Gemini 1.5. According to Google, “1.5 Flash marks the newest inclusion to the Gemini family of models and the fastest Gemini model accessible via the API. It’s optimized for high-volume, high-frequency tasks on a large scale.”

On the topic of tokens, Willison commented on Flash’s debut, noting, “The introduction of Gemini Flash is promising; it’s engineered to provide up to 2 million tokens at a reduced cost.” Flash is priced at $0.35 per million tokens for queries up to 128,000 tokens and at $0.70 per million tokens for longer queries. This makes it one-tenth the cost of 1.5 Pro.

“35 cents per million tokens! In my opinion, that’s the biggest highlight today,” shared Willison.

Google also unveiled Gems, which appear to be Google’s equivalent of OpenAI’s GPTs. Gems are custom roles for the Google Gemini chatbot that can assume specific roles defined by users, allowing for personalized interactions with Gemini in various capacities. Google suggests roles like “a workout companion, culinary assistant, programming collaborator, or an imaginative writing mentor” as potential Gems.

New innovative AI models

An illustration of the Google Imagen 3 website.
View Larger / An illustration of the Google Imagen 3 website.


During the Google I/O keynote presentation, an array of new creative AI models was unveiled by Google, designed for crafting pictures, sound, and videos. Imagen 3 marks the latest advancement in their image generation models, touted as their “top-tier text-to-image model, adept at producing images with superior intricacy, enhanced lighting, and fewer distracting imperfections compared to their previous iterations.”

Additionally, Google showcased their Music AI Sandbox, labeled as “a set of artificial intelligence instruments to revolutionize music creation.” This integrates their YouTube music initiative with their Lyria AI music composer to empower musicians with advanced tools.

Furthermore, Google introduced Google Veo, a text-to-video creator that produces 1080P videos based on prompts, demonstrating a quality matching OpenAI’s Sora. They revealed a collaboration with actor Donald Glover to develop an AI-generated showcase film expected to premiere soon. While not Google’s initial attempt at AI video creation, it appears to be their most proficient endeavor to date.

A provided sample video by Google depicted the scenario of “A solitary cowboy traveling on his horse across an expansive plain at a stunning sunset, embraced by warm colors and soft lighting.”

Google has announced that their latest AI creative tools are now accessible to a restricted group of creators through an exclusive preview, with open waiting lists for others interested in leveraging these tools.