AI Models Guide
vpick integrates multiple AI models covering image, video, voice, and music generation. Here's a detailed guide and comparison.
Image Models
Nano Banana 2 (Default)
| Item | Details |
|---|---|
| Cost | $0.16 / image |
| Features | High-quality general-purpose image generation, multi-reference image support |
Nano Banana 2 is vpick's default image model, offering a great balance of quality, speed, and price. Supports multiple reference image inputs, ideal for product photos, creative designs, and more.
Supported aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4
Grok Imagine
| Item | Details |
|---|---|
| Cost | $0.06 / call (text-to-image: 6 images, image-to-image: 2 images) |
| Features | Great value, multiple outputs per call |
Grok Imagine is budget-friendly — text-to-image produces 6 images at once, and image-to-image (requires connected reference images) produces 2 images. Perfect for creative brainstorming with lots of variations.
Seedream
| Item | Details |
|---|---|
| Cost | $0.0825 / image |
| Features | Best value, supports multiple aspect ratios |
Seedream is the cheapest image model available, ideal for bulk generation or budget-conscious projects.
Supported aspect ratios: 1:1, 16:9, 9:16, 4:3, 3:4, 21:9
Two resolution options:
- Seedream (2K): Standard resolution
- Seedream HD (3K): High resolution
Video Models
Veo 3.1 Fast (Default)
| Item | Details |
|---|---|
| Cost | $0.90 / video (fixed 8 seconds) |
| Duration | 8 seconds (fixed) |
| Sound | Supported |
| Features | Top-tier quality, stunning visuals, built-in sound |
Veo 3.1 Fast is vpick's default video model with exceptional visual quality and built-in sound generation. Ideal for high-quality showcase content.
Kling 3.0
| Item | Details |
|---|---|
| Cost | Standard $0.30/sec, Pro $0.405/sec |
| Duration | 3-15 seconds |
| Modes | Standard (720p) / Pro (1080p) |
| Sound | Supported |
| Features | Stable, start/end frame support, sound, flexible duration |
Kling 3.0 is reliable with flexible duration from 3 to 15 seconds, plus start/end frame control and sound generation. Pro mode outputs 1080p.
Supported aspect ratios: 1:1, 16:9, 9:16
Advanced: MultiShot
MultiShot mode lets you split one video into 1-5 segments, each with an independent prompt and duration (1-12 seconds), totaling 3-15 seconds. Sound is automatically enabled when MultiShot is active. Requires a connected start frame image.
Advanced: Elements
Elements lets you define character or object references with images, then reference them in prompts using @element_name. Each element can have 2-50 reference images, and the AI will maintain consistent character appearance. If no start frame image is connected, the first element's images are used automatically.
Grok Video
| Item | Details |
|---|---|
| Cost | 480p: $0.15-$0.45 / 720p: $0.30-$0.60 (by duration) |
| Duration | 6, 10, or 15 seconds |
| Modes | 480p / 720p |
| Sound | Supported |
| Features | Multiple durations, great value, sound support |
Grok Video offers flexible duration options (6/10/15 seconds), 480p and 720p quality, and sound generation. Reasonably priced, great for social media content.
Generation modes: Fun (creative), Normal (balanced), Spicy (dynamic)
Runway Gen4
| Item | Details |
|---|---|
| Cost | 720p-5s: $0.18 / 720p-10s: $0.45 / 1080p-5s: $0.45 |
| Duration | 5 seconds (720p/1080p), 10 seconds (720p only) |
| Sound | Not supported |
| Features | Excellent video quality, 1080p support |
Runway Gen4 is known for outstanding video quality and is one of the few models supporting 1080p. 5-second clips are the most reliable.
Supported aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16
Video Model Comparison
| Model | Cost | Duration | Sound | Quality |
|---|---|---|---|---|
| Veo 3.1 Fast | $0.90/video | 8s | Yes | Best |
| Kling 3.0 | $0.30-0.405/s | 3-15s | Yes | Great |
| Grok Video | $0.15-0.60 | 6/10/15s | Yes | Good |
| Runway Gen4 | $0.18-0.45 | 5/10s | No | Great |
Recommendations:
- Best quality: Veo 3.1 Fast (default)
- Need sound: Veo 3.1 Fast, Kling 3.0, or Grok Video
- Best value: Grok Video (480p-6s $0.15) or Runway Gen4 (720p-5s $0.18)
- Longer videos: Kling 3.0 (up to 15s) or Grok Video (up to 15s)
- High resolution: Kling 3.0 Pro or Runway Gen4 (1080p)
Voice Model
ElevenLabs Text-to-Dialogue V3
| Item | Details |
|---|---|
| Cost | $0.21 / call |
| Features | Multiple voices, multi-language, high-quality TTS |
ElevenLabs V3 is one of the most natural AI voice synthesis models available.
Available voices:
| Name | Gender | Style |
|---|---|---|
| Roger | Male | Calm and clear |
| Sarah | Female | Natural and friendly |
| Brian | Male | Energetic |
| Adam | Male | Deep and authoritative |
| Lily | Female | Soft and soothing |
| Bill | Male | Mature and steady |
| Laura | Female | Bright and vivid |
| Chris | Male | Versatile |
| Jessica | Female | Warm and friendly |
Supported languages: Auto-detect, English, Chinese, Japanese, Korean, Spanish, French, German, Portuguese, Italian
Parameters:
- Stability: 0.0-1.0 — higher values produce more consistent voice, lower values add more expressiveness and emotion
Music Model
Suno V4.5
| Item | Details |
|---|---|
| Cost | $0.10 / song |
| Features | High-quality AI music generation, custom mode support |
Suno V4.5 generates complete music tracks from text descriptions with professional-grade quality.
Modes:
- Simple mode: Enter a text description, Suno automatically decides style and lyrics
- Custom mode: Specify music style and song title
Options:
- Instrumental: Enable to generate vocal-free music, perfect for background music
- Vocal Gender: Auto / Male / Female
Lipsync Model
Kling Avatar
| Item | Details |
|---|---|
| Cost | Standard $0.12/sec, Pro $0.24/sec |
| Features | Turns a static portrait into a talking video |
Kling Avatar combines a portrait photo and an audio clip into a "talking video" where the person's lip movements precisely match the audio.
How to use:
- Connect a portrait photo (image-in)
- Connect an audio clip (audio-in)
- Run lipsync generation
Best practices:
- Use a front-facing, clear portrait photo
- The person's mouth should preferably be closed
- Better audio quality = better lip sync results
Pricing Overview
Images
| Model | Cost | Output |
|---|---|---|
| Nano Banana 2 | $0.16 / call | 1 image |
| Grok Imagine | $0.06 / call | 6 images (text) / 2 images (image-to-image) |
| Seedream | $0.0825 / call | 1 image (2K) |
| Seedream HD | $0.0825 / call | 1 image (3K) |
Videos
| Model | Cost |
|---|---|
| Veo 3.1 Fast | $0.90 / video (8s) |
| Kling 3.0 | $0.30-$0.405 / sec (3-15s) |
| Grok Video | $0.15-$0.60 / video (6/10/15s) |
| Runway Gen4 | $0.18-$0.45 / video (5/10s) |
Audio
| Model | Cost |
|---|---|
| ElevenLabs V3 | $0.21 / call |
| Suno V4.5 | $0.10 / song |
| Kling Avatar (Standard) | $0.12 / sec |
| Kling Avatar (Pro) | $0.24 / sec |