Pixtral is REALLY Good - Open-Source Vision Model

Mistral AI's new Pixol 12B model is a game-changer for vision tasks, excelling in image recognition and description with impressive speed and accuracy.

Mistral AI has just released Pixol 12B, a brand new open-source vision model. This is a multimodal model, and today we will be testing it. First, I'll provide some details about the model, and then we'll proceed with the testing. I've loaded the model on Vulture, who is also the sponsor of this video. Thank you to Vulture. Vulture is the easiest way to rent GPUs in the cloud. They offer Nvidia GPUs, virtual CPUs, bare metal, Kubernetes, storage, and networking solutions. Check out the link in the description for $300 of free credit by using Code Burman300.

In true Mistral AI form, the original announcement was basically just a torrent link with no additional information. However, after downloading it, we discovered that it is a vision model called Pixol. Today, they provided more information through a blog post announcing Pixol 12B, the first-ever multimodal Mistral AI model with an Apache 2.0 license. It is natively multimodal, trained with interleaved image and text data, and shows strong performance on multimodal tasks. It excels in instruction following and offers state-of-the-art performance on text-only benchmarks. The model is a 12 billion parameter multimodal decoder based on Mistral AI's Nemo. It supports variable image sizes and aspect ratios, multiple images in a long context window of 128,000 tokens.

In the provided chart, you can see some benchmarks. Pixol 12B is represented in a yellowish color, while other models like Lava in red, Quen in green, Gemini Flash 8B in light blue, and CLA 3 Haiku in darker blue are also shown. Across the board, Pixol 12B appears to be the best model. We will test it on various vision tasks and some text tasks to see how it performs. I've loaded it up on Vulture, which was very simple. We are hosting it on an Nvidia L40 with 48 GB of VRAM and using an OpenAI-compliant API. For the front end, we are using Open Web UI, hosting it locally, and plugging it into the cloud GPU via Vulture. It works seamlessly.

Let's start with a non-vision task to see its performance. We'll ask it to write the game Tetris in Python. The response is quite fast. Although I don't have high hopes because it is a relatively small model, most other models besides the most cutting-edge ones have gotten this wrong. After copying and pasting the code, an attribute error occurs, indicating that the model couldn't complete the task in one go. This is not surprising as the model isn't designed for logic and reasoning tasks.

Next, let's test its vision capabilities, which is its primary strength. We'll start with a simple task: describing an image of a llama. The response is impressively fast and accurate: "The image depicts a llama lying down in a grassy field. The llama appears to be resting calmly with its body fully extended along the ground." This is a perfect answer.

Now, let's see if it can recognize celebrities. We'll use an image of Bill Gates. Most models have struggled with this task, but Pixol 12B successfully identifies him: "The person in the image is Bill Gates, co-founder of Microsoft and a notable philanthropist. He's wearing glasses, a gray sweater over a collared shirt." This is very impressive and cool.

Let's continue testing...

=> 00:04:59

Super impressed by how fast and accurate the model is at identifying images, solving captchas, and analyzing storage details!

I am super impressed by how fast the response is. The image depicts a llama lying down in a grassy field. The llama appears to be resting calmly with its body fully extended along the ground. That's a perfect answer.

Next, let's see if it can recognize celebrities. This is Bill Gates. Most models have not been able to do this; in fact, I don't think any model has really been able to identify somebody in a photo. So, let's identify this person. Wow, there we go. The person in the image is Bill Gates, co-founder of Microsoft and a notable philanthropist. He's wearing glasses and a gray sweater over a collar shirt. Very impressive, very cool.

Okay, that's great. Next, let's see if it can solve CAPTCHAs. A few models have been able to do this, but not really; some of them just straight up refuse. Let's see if we can even get the model to do it, and if we can, will it do it successfully? So, I was originally going to say, "What letters do you see in this image?" but I'm going to actually just directly say, solve this CAPTCHA. Okay, and it's so fast. This image shows a distorted CAPTCHA challenge displaying the letters V4XBG. Perfect, unbelievable, really, really nice. So, CAPTCHAs are basically done.

Alright, next, I took a screenshot of my iPhone storage and I'm going to ask it a bunch of questions about it. How much storage does my phone have in total? According to the image, we have 128 GB. Great, let's keep asking. How much storage is currently being used? 119.7 GB. This is flawless. I am super impressed. The Llama model did not do nearly as well. Which app is using the most storage? There we go, the app using the most storage is Photo PH with 133.6k. I am beyond impressed with this.

Let's give it something a little bit harder. Which of the apps is not actually downloaded on the phone? That would be Cam Scanner, and as you can see, there's the little cloud download icon. Let's see, the app that is not actually downloaded on the phone is Instagram. Okay, so that is not correct. It doesn't really understand that the little cloud download icon means it's not actually downloaded.

Alright, last question. Give me a list of all apps you see and how much storage they are using. Alright, let's see and compare: Photos, WhatsApp, Gmail, Health, Mail, LinkedIn, Tuneblast, iCloud Drive, Cam Scanner, Instagram. Very, very nice. This is an absolute pass.

Next, we're going to ask it to explain the meme. We have startups on the left, big companies on the right. Let's see what it says. Explain the meme. The image is a comparison between startups and big companies depicted through a construction analogy. On the left, labeled startups, a group of workers wearing high-visibility vests and helmets are actively engaged in digging and working on a construction site—dynamic, hands-on, nitty-gritty details of the work. On the right, labeled big companies, a group of workers in blue uniforms are gathered around a hole, seemingly observing and discussing without actively participating in the digging. Yep, that is perfect.

And then, what is funny about it? The meme humorously illustrates the difference in work culture and approach between startups, which are typically more hands-on and involved, and big companies, which tend to be more process-oriented and focused on oversight. Now obviously, yeah, that's definitely a pass.

But I want to actually take a second and talk about what I believe is the future of models. I think what we're going to have are a lot of smaller specialized models. So, we might have Pixol for vision, we might use Llama 3.1 for logic, reasoning, and math, and we might use 01 when we have really complex queries that we need to run. I truly believe that even though this model isn't necessarily performing well on logic, reasoning, and coding, the vision is phenomenal, which means that's what I want to use it for. We want to use the best, cheapest, most efficient, lowest latency model for each use case. And quickly, I just want to say thank you.

=> 00:08:52

The future of AI is in using specialized models for specific tasks to maximize efficiency and performance.

I want to take a moment to discuss what I believe is the future of models. I think what we're going to have are a lot of smaller specialized models. For instance, we might have Pixol for vision, Llama 3.1 for logic reasoning and math, and 01 for really complex queries. Even though this model isn't necessarily performing well on logic reasoning and coding, the vision is phenomenal, which means that's what I want to use it for. We want to use the best, cheapest, most efficient, and lowest latency model for each use case.

Quickly, I just want to say thank you again to Vulture for sponsoring this video and loading up the Pixol model. It was so easy to do; I just got an endpoint, plugged it right into Open Web UI, and we were off to the races. If you want to load up models that can't necessarily fit on your local machine, or if you're starting a company and need GPU or scaling up your AI application and need a bunch of GPUs, Vulture is a great option. I'll drop a link in the description below to it. Use code Burman 300 and get $300 in credit—that's a ton of credit to try out any of the GPUs that you find on Vulture.

Next, I'm going to provide a QR code and ask what the URL is. No model has gotten this right so far; I don't even know if it's possible. Let's see what URL this QR code resolves to. However, without scanning the QR code or having the specific data embedded in it, it is not possible to determine what the URL resolves to. So yeah, it's not able to do that, but that's okay—no model has been able to do this so far.

Next, we have a screenshot of a table with headers and all the data filled out. We're going to ask it to convert this to CSV. Let's see. Okay, and metal name, sport, event date—yeah, this looks perfect, really well done.

Now, we're going to take this crudely drawn image of a potential app or website and ask it to output the code for it. Let's see if it actually works. Give me the HTML code for this. Here we go. The code looks accurate. I think we could actually do something a little bit more complicated next, but let's see. Here it is—very simple. You have vanilla as a dropdown; you could click next, and it does submit as the flavor. Again, very simple, but it seems to work fine.

Now, on the Vulture homepage, I'm going to take a screenshot of this little section of the code, and we are going to load it up and see if we could do it. Give me the HTML for this. Okay, here it is. Definitely a lot more code this time. We'll see if it's actually able to get the right colors and the right column sizes and everything like that. It's definitely reading the text properly. Copy code, and there it is. Not exactly what I wanted, but we really only asked for one little area of the page. It read a lot of it correctly, but of course, it doesn't have the images, and everything is vertical rather than horizontal. It did okay, not really great.

Next, for a brand new test, we are going to find where Waldo is. We have this picture of a "Where is Waldo" map. Let's load it up. I'm not going to describe what Waldo looks like; I'm simply going to say, "Where's Waldo in this picture?" In the image, "Where's Waldo" or "Wall-E" refers to a character in the popular series of puzzle books where readers are challenged to find Waldo or Wall-E among large crowds of people. Great. It didn't actually tell me where Waldo is; it told me how to find Waldo. So I'm going to say, "Look in the image I just gave you. Tell me where to look for Waldo." Waldo is located towards the right side of the picture near the middle of the beach scene. Here are more specific steps to find it: start by looking at the right half of the image, focus on the area where there are several groups of people, particularly where you see people walking and standing. That's everyone, so let's get more specific. Tell me more specifically where Waldo is in the picture using a coordinate system. Let's see if that works. So, top left 0 0, bottom right 1 1. We need to look at about 65 horizontal, 45 down. 65 horizontal would be right about here. This was actually a hard one because the picture is kind of low resolution, but it was right—it's right there, 65 45, and so we were able to find Waldo. This was definitely more difficult than I thought it would be.

Alright, that's it. This is an extremely capable vision model by Mistra, and I definitely encourage you to check it out. Thanks once more to Vulture. I'll drop links in the description below. Check them out—$300 off with code Burman 300. Thank you to Vulture again. If you enjoyed this video, please consider giving a like and subscribe, and I'll see you in the next one.