Microsoft brings out a small language model that can look at pictures

Phi-3-vision is a multimodal model — aka it can read both text and images — and is best used on mobile devices. Microsoft says Phi-3-vision, now available on preview, is a 4.2 billion parameter model (parameters refer to how complex a model is and how much of its training it understands) that can do general visual reasoning tasks like asking questions about charts or images.

But Phi-3-vision is far smaller than other image-focused AI models like OpenAI’s DALL-E or Stability AI’s Stable Diffusion. Unlike those models, Phi-3-vision doesn’t generate images, but it can understand what’s in an image and analyze it for a user.

Microsoft announced Phi-3 in April with the release of Phi-3-mini, the smallest Phi-3 model at 3.8 billion parameters. The Phi-3 family has two other members: Phi-3-small (7 billion parameters) and Phi-3-medium (14 billion parameters).

AI model developers have been putting out small, lightweight AI models like Phi-3 as demand to use more cost-effective and less compute-intensive AI services grows. Small models can be used to power AI features on devices like phones and laptops without the need to take up too much computer memory. Microsoft already released other small models in addition to Phi-3 and its predecessor, Phi-2. Its math problem solving model, Orca-Math, reportedly answers math questions better than its bigger counterparts, like Google’s Gemini Pro.

Phi-3-vision is now available on preview. Other members of the Phi-3 family — Phi-3-mini, Phi-3-small, and Phi-3-medium — are now available through Azure’s model library.

Source