
Upcoming Events
Thu, Jan 25th: ⚙️ MLOps Community x GenAI Collective 🧠 Tech Talks 🎙️
Tue, Jan 30th: 🧠 GenAI Collective 🧠 Inaugural Marin Meetup! 🍻
🗓️ Hungry for even more AI events? Check out Cerebral Valley’s spreadsheet!
If you didn’t get a chance to listen to our most recent episode of the Collective Intelligence Community Podcast, go listen to our very own Thomas Joshi and Pierce Kelaita’s interview with the CTO of OctoAI Jared Roesch where they discuss everything from Seattle’s AI scene to the future of AI infrastructure in 2024.
Apple Enters the LLM Conversation
Apple’s Ferret: Revolutionizing the Generative AI Ecosystem
For the better part of a year, everyone has been eager to see how Apple will enter the Generative AI arms race—a question shared by their big tech rivals Microsoft and Google. In late October 2023, with help from researchers at Columbia, they quietly released a new, open-source LLM called Ferret, which they labeled as an end-to-end MLLM (Multimodal Large Language Model) that can refer and ground anything anywhere at any granularity (read the research paper here). The model is available in two sizes–7B and 13B parameters–and was trained using 8 Nvidia A100 GPUs equipped with 80GB of HBM2e RAM. This release, a departure from Apple’s traditional tight-lipped launch process, even included a rare acknowledgement of Nvidia. That being said, Apple clearly made a statement on LLMs to the AI community: Bigger isn’t always better. Many believe that the model is intended to disrupt the Generative AI capabilities the community believed were possible on a smartphone or a Vision Pro 😉. Apple also doubled down on the viability of running LLMs on edge devices by releasing another pair of research papers in late 2023–Efficient Large Language Model Inference with Limited Memory and Human Gaussian Splats.
Innovative Approach to Visual and Textual Fusion
Ferret is a quantum leap in AI's capability to understand and integrate visual and textual data on the edge. Ferret uses the contrasting language-image pretraining vision transformer (CLIP ViT) model to analyze images and convert visual information into natural language. The model identifies fine-grained visual details from an image (e.g. produced by an iPhone camera) to understand and produce text-based insights, responses, and descriptions. Unlike its predecessors, Ferret excels in two critical areas: referring and grounding. Referring indicates that a model can understand the semantics of specific image regions, while grounding involves localizing objects based on textual descriptions.

What sets Ferret apart is its groundbreaking approach to fusing visual and textual information. By employing a unique combination of hybrid region representation and spatial-aware visual sampling, backed by a vast instruction tuning database called GRIT (>1.1M diverse image samples with rich spatial context), Ferret has redefined the benchmarks for spatial comprehension in AI.

Benchmarking Against GPT 4: A New Industry Standard
When benchmarked against models like GPT-4 (Ferret Benchmarking Framework), Ferret stands out, particularly in understanding and interacting with regions of interest within crowded and complex scenes. Its superior performance in accurately identifying and describing small, specific regions in complex images marks a significant advancement over existing multimodal models. The model can provide detailed descriptions of objects and the interaction between objects with high levels of precision as well as learn to predict user prompts based on the context of the scene. This capability underlines Apple's commitment to pushing the boundaries of what's possible when running multimodal Generative AI models on the edge and is setting a new industry standard.
Implications in Generative AI: A Leap Towards Human-Like Understanding
Ferret's introduction is not just a testament to Apple's commitment to Generative AI (some predict Apple will spend >$4B in the sector in 2024) but a giant step towards achieving a human-like understanding in edge AI systems. Ferret’s ability to process and integrate multimodal data – visual and textual – in a way that mirrors human cognitive abilities, unlocks a new frontier for generative technologies. This advancement could pave the way for more intuitive, efficient, and human-centric AI applications, fundamentally altering how we interact with technology. Rumors have already arrived on Ferret’s potential integration with other Apple products including Spotlight, Siri, the iPhone camera, Carplay, and Vision Pro. Some additional applications below:
Image-Based Siri Interactions: Ferret's ability to understand and analyze images can revolutionize Siri's functionalities, allowing users to interact using images as well as language.
Advanced Visual Search Functionality: The model can enhance visual search capabilities in Apple's ecosystem, enabling more nuanced and accurate searches that combine textual and visual information.
Augmented User Assistance for Accessibility: Ferret's capabilities can be leveraged to provide enhanced assistance to users with disabilities, offering a more contextual and inclusive experience.
Further, Ferret’s transformative impact extends beyond Apple’s ecosystem, potentially revolutionizing how industries leverage novel language models for spatial understanding. Its precision in handling diverse image regions – from discrete coordinates to free-form shapes – makes it an invaluable tool across various applications. For instance, in augmented reality, Ferret can enhance user experiences by providing more accurate and contextually relevant information. In autonomous driving, its ability to understand complex scenes could lead to safer and more efficient navigation.
Apple’s Vision for the Future of AI
Apple's unveiling of the Ferret model is more than just the release of a new model architecture; it's a declaration of the company’s vision for the future of Generative AI. They’ve made the conscious choice to open-source their research in order to bring transparency and trust to their AI system, while fostering further innovation and collaboration. Additionally, by integrating this novel architecture into its product suite, Apple is not only building defensibility around its current offerings but also reshaping the landscape of AR application development–a clear growth vector that Apple intends to dominate.
With that, I hope you enjoyed the newsletter and would love to hear your perspective on how you see Apple disrupting the ecosystem. If you have any ideas or insights you would like to share, hit up Eric on the community Slack or reach out via email at [email protected] for a feature in our next newsletter!
Events Recap
What a powerful start to 2024! We want to thank our partners from DataStax–Patrick McFadin and Aaron Ploetz–for the EXCEPTIONAL keynotes and for providing their wonderful HQ in Santa Clara for the AGI Enablers Summit last week! 🤩 As always, we also want to thank all of our community members who joined us for another night of thought-provoking conversations and intimate relationship building!

About Eric Fett
Eric joined The GenAI Collective in early September to lead the development of the newsletter. He is currently an investor at NGP Capital where he focuses on Series A/B investments across enterprise AI, cybersecurity, and industrial technology. He’s passionate about working with early-stage visionaries on their quest to create a better future. When not working, you can find him on a soccer field or at a sushi bar! 🍣