Google I/O 2024: Between Gemini’s Future and the Fiercer Race for the Crown in AI

Maybe it’s a coincidence that OpenAI launched GPT-4p at the same time Google I/O was taking place. Or maybe not at all, but the thing is, the race for the crown in the AI industry is fiercer than ever

One thing is certain: we can find some similarities between the two companies’ recent launches. Both have put the focus on highlighting that their models are natively multimodal, capable of processing and understanding text, images, and audio simultaneously. 

The main course in both launches was the demo of their AI agents, which surprised us all with their capabilities reminiscent of sci-fi movies. Remember Scarlett Johansson in Her? It’s pretty similar.  

These demos make us think that the natural evolution of LLMs is AI agents due to their multimodal capabilities. AI agents go way beyond just generating text—they perform actions. Imagine having an AI that not only answers your questions but also books your flights, manages your calendar, helps you find dog walkers, swaps out ill-fitting shoes for the correct size, and drafts your emails. 

This transformation is pivotal for both personal and professional efficiency.  

For instance, Google’s Project Astra, using just the camera on your phone, can identify objects in real-time, comment on them, assist in complex tasks like code reviews, and help you find your missing glasses. 

But similarities aside, let’s see what Google I/O brings us as a novelty and what the Mountain View company has in store for us. 

Google I/O 2024: The Gemini Era

Led by CEO Sundar Pichai, one of the standout announcements was the introduction of Gemini 1.5 Pro, the latest iteration, promises unprecedented levels of intuitiveness and power. It’s designed to assist users in ways never imagined before, with capabilities that range from complex problem-solving to seamless daily interactions. Gemini’s multimodal capabilities allow it to process and understand text, images, and audio simultaneously, enhancing its ability to provide relevant and context-aware responses. 

Ask Photos

A fascinating feature of Gemini is “Ask Photos,” which lets users query their photo libraries using natural language. This AI-driven tool can identify people, locations, and objects within photos, making it easier than ever to find and organize images.  

For example, you could ask, “Show me how Lucia’s swimming has progressed,” and Gemini will recognize different contexts—whether Lucia is doing laps in the pool, snorkeling in the ocean, or showing the text and dates on her swimming certificates. Photos will package it all together in a summary, allowing you to relive amazing memories all over again. Ask Photos is set to roll out this summer with even more capabilities to come. 

Gemini in Google Workspace

So far, we’ve talked about two technical advances: multimodality and long context. Each is powerful on its own, but together, they unlock deeper capabilities and more intelligence, bringing Gemini 1.5 Pro to Google Workspace. People frequently search their emails in Gmail, and now, Gemini makes this much more powerful. For instance, as a parent, you want to stay informed about everything going on with your child’s school. Gemini can help you keep up. 

You can ask Gemini to summarize all recent emails from the school. In the background, it identifies relevant emails and even analyzes attachments like PDFs. You get a summary of key points and action items. If you missed a PTA meeting because you were traveling, Gemini can summarize the hour-long Google Meet recording for you. There’s a parents’ group looking for volunteers, and you’re free that day—Gemini can draft a reply for you. This makes life easier and more organized. Gemini 1.5 Pro is available now in Workspace Labs, with more capabilities to come. 

NotebookLM

Another exciting update is the inclusion of audio outputs in NotebookLM, Google’s AI-powered note-taking app. This feature allows users to transcribe and summarize meetings or lectures, making it easier to revisit and share key points. 

AI Agents, the Next Step for LLMs?

AI agents represent the next evolution in intelligent systems, showcasing advanced reasoning, planning, and memory capabilities. These agents are designed to think multiple steps ahead and work across various software and systems to accomplish tasks on your behalf, all under your supervision. 

While we’re still in the early days, the potential use cases we’re exploring are already impressive. 

Simplifying everyday tasks

Shopping: Imagine if Gemini could handle all the steps involved in returning a pair of shoes that don’t fit. It could: 

  • Search your inbox for the receipt
  • Locate the order number from your email
  • Fill out the return form
  • Schedule a UPS pickup

This automation makes the process hassle-free, allowing you to focus on more enjoyable activities. 

Moving to a New City: Moving can be overwhelming, but AI agents like Gemini can assist in numerous ways. For instance, if you move to Chicago, Gemini and Chrome can work together to: 

  • Explore the city and find local services like dry cleaners and dog walkers
  • Update your new address across multiple websites

Throughout these processes, Gemini will prompt you for necessary information, ensuring you remain in control and the tasks are completed efficiently. 

Project Astra: The Future of AI Assistants

Project Astra was one of the most groundbreaking announcements at Google I/O 2024.  

This real-time, multimodal AI assistant aims to be the universal helper we’ve always dreamed of. Demis Hassabis, the head of Google DeepMind and the leader of Google’s AI efforts, described Astra as a vision realized after decades of work. Astra can see the world, identify objects, and assist with tasks in real-time, all through a conversational interface

In the demo, Astra was shown identifying parts of a speaker, finding missing glasses, reviewing code, and more—all in real-time. Astra’s design is multimodal, meaning you can interact with it through text, voice, images, and video, making it a versatile assistant for any situation

Hassabis emphasized that the future of AI is not just about models but what these models can do for you autonomously

Veo and Imagen 3: Visual and Creative AI

Veo

Google unveiled Veo, the largest generative video model yet. Veo has significantly evolved from its origins, allowing the creation of clips in 1080p resolution and over a minute long. This model demonstrates a deep understanding of natural language and semantics to produce videos that accurately reflect user requests. Veo can understand and apply cinematic terms like “timelapse” or “aerial shots of landscapes,” meaning it not only creates the content you ask for but also applies the techniques and styles you desire. Additionally, Veo makes a significant leap in simulating real-world physics and rendering high-definition sequences, making it a groundbreaking tool for video production 

Imagen 3

Imagen 3, Google’s latest image generation model, offers unprecedented levels of detail and realism. This AI can create stunning images from text descriptions, providing a powerful tool for artists, designers, and content creators.  

Trillium: Advancements in Audio AI

Google also introduced Trillium, the 6th generation of their Tensor Processing Units (TPUs). Trillium is the most performant and efficient TPU to date, offering a 4.7x improvement in compute performance per chip over the previous generation, TPU v5e. These TPUs will be available to Google Cloud customers in late 2024. 

Alongside TPUs, Google also offers CPUs and GPUs to support any workload, including the new Axion processors, Google’s first custom Arm-based CPU, and Nvidia’s cutting-edge Blackwell GPUs, available in early 2025. These advancements are part of Google’s AI Hypercomputer architecture, a supercomputer system designed to tackle complex challenges with more than twice the efficiency of traditional hardware. 

Conclusion

The recent announcements from Google and OpenAI highlight the intensifying competition in the AI industry. Both companies showcased remarkably similar advancements, with a focus on multimodal AI agents that transcend text generation and delve into real-world actions. These developments hint at a future filled with AI assistants that seamlessly integrate with our lives, boosting personal and professional efficiency. 

Stay tuned for further insights! Follow Inclusion Cloud for more tech news and analysis as we explore the exciting possibilities that lie ahead in the realm of AI. 

Inclusion Cloud: We have over 15 years of experience in helping clients build and accelerate their digital transformation. Our mission is to support companies by providing them with agile, top-notch solutions so they can reliably streamline their processes.