AI Models Guide

vpick integrates multiple AI models covering image, video, voice, and music generation. Here's a detailed guide and comparison.


Image Models

Nano Banana 2 (Default)

Item Details
Cost $0.16 / image
Features High-quality general-purpose image generation, multi-reference image support

Nano Banana 2 is vpick's default image model, offering a great balance of quality, speed, and price. Supports multiple reference image inputs, ideal for product photos, creative designs, and more.

Supported aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4

Grok Imagine

Item Details
Cost $0.06 / call (text-to-image: 6 images, image-to-image: 2 images)
Features Great value, multiple outputs per call

Grok Imagine is budget-friendly — text-to-image produces 6 images at once, and image-to-image (requires connected reference images) produces 2 images. Perfect for creative brainstorming with lots of variations.

Seedream

Item Details
Cost $0.0825 / image
Features Best value, supports multiple aspect ratios

Seedream is the cheapest image model available, ideal for bulk generation or budget-conscious projects.

Supported aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4, 21:9

Two resolution options:


Video Models

Veo 3.1 Fast (Default)

Item Details
Cost $0.90 / video (fixed 8 seconds)
Duration 8 seconds (fixed)
Sound Supported
Features Top-tier quality, stunning visuals, built-in sound

Veo 3.1 Fast is vpick's default video model with exceptional visual quality and built-in sound generation. Ideal for high-quality showcase content.

Kling 3.0

Item Details
Cost Standard $0.30/sec, Pro $0.405/sec
Duration 3-15 seconds
Modes Standard (720p) / Pro (1080p)
Sound Supported
Features Stable, start/end frame support, sound, flexible duration

Kling 3.0 is reliable with flexible duration from 3 to 15 seconds, plus start/end frame control and sound generation. Pro mode outputs 1080p.

Supported aspect ratios: 1:1, 16:9, 9:16

Advanced: MultiShot

MultiShot mode lets you split one video into 1-5 segments, each with an independent prompt and duration (1-12 seconds), totaling 3-15 seconds. Sound is automatically enabled when MultiShot is active. Requires a connected start frame image.

Advanced: Elements

Elements lets you define character or object references with images, then reference them in prompts using @element_name. Each element can have 2-50 reference images, and the AI will maintain consistent character appearance. If no start frame image is connected, the first element's images are used automatically.

Grok Video

Item Details
Cost 480p: $0.15-$0.45 / 720p: $0.30-$0.60 (by duration)
Duration 6, 10, or 15 seconds
Modes 480p / 720p
Sound Supported
Features Multiple durations, great value, sound support

Grok Video offers flexible duration options (6/10/15 seconds), 480p and 720p quality, and sound generation. Reasonably priced, great for social media content.

Generation modes: Fun (creative), Normal (balanced), Spicy (dynamic)

Runway Gen4

Item Details
Cost 720p-5s: $0.18 / 720p-10s: $0.45 / 1080p-5s: $0.45
Duration 5 seconds (720p/1080p), 10 seconds (720p only)
Sound Not supported
Features Excellent video quality, 1080p support

Runway Gen4 is known for outstanding video quality and is one of the few models supporting 1080p. 5-second clips are the most reliable.

Supported aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16


Video Model Comparison

Model Cost Duration Sound Quality
Veo 3.1 Fast $0.90/video 8s Yes Best
Kling 3.0 $0.30-0.405/s 3-15s Yes Great
Grok Video $0.15-0.60 6/10/15s Yes Good
Runway Gen4 $0.18-0.45 5/10s No Great

Recommendations:


Voice Model

ElevenLabs Text-to-Dialogue V3

Item Details
Cost $0.21 / call
Features Multiple voices, multi-language, high-quality TTS

ElevenLabs V3 is one of the most natural AI voice synthesis models available.

Available voices:

Name Gender Style
Roger Male Calm and clear
Sarah Female Natural and friendly
Brian Male Energetic
Adam Male Deep and authoritative
Lily Female Soft and soothing
Bill Male Mature and steady
Laura Female Bright and vivid
Chris Male Versatile
Jessica Female Warm and friendly

Supported languages: Auto-detect, English, Chinese, Japanese, Korean, Spanish, French, German, Portuguese, Italian

Parameters:


Music Model

Suno V4.5

Item Details
Cost $0.10 / song
Features High-quality AI music generation, custom mode support

Suno V4.5 generates complete music tracks from text descriptions with professional-grade quality.

Modes:

Options:


Lipsync Model

Kling Avatar

Item Details
Cost Standard $0.12/sec, Pro $0.24/sec
Features Turns a static portrait into a talking video

Kling Avatar combines a portrait photo and an audio clip into a "talking video" where the person's lip movements precisely match the audio.

How to use:

  1. Connect a portrait photo (image-in)
  2. Connect an audio clip (audio-in)
  3. Run lipsync generation

Best practices:


Pricing Overview

Images

Model Cost Output
Nano Banana 2 $0.16 / call 1 image
Grok Imagine $0.06 / call 6 images (text) / 2 images (image-to-image)
Seedream $0.0825 / call 1 image (2K)
Seedream HD $0.0825 / call 1 image (3K)

Videos

Model Cost
Veo 3.1 Fast $0.90 / video (8s)
Kling 3.0 $0.30-$0.405 / sec (3-15s)
Grok Video $0.15-$0.60 / video (6/10/15s)
Runway Gen4 $0.18-$0.45 / video (5/10s)

Audio

Model Cost
ElevenLabs V3 $0.21 / call
Suno V4.5 $0.10 / song
Kling Avatar (Standard) $0.12 / sec
Kling Avatar (Pro) $0.24 / sec