Gemini Multimodal Understanding
Gemini models support understanding multiple modalities of content including images, video, and audio.
POST /v1beta/models/{model}:generateContent
Image Understanding
curl "https://crazyrouter.com/v1beta/models/gemini-2.5-flash:generateContent?key=YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [
{
"role": "user",
"parts": [
{"text": "Describe the content of this image in detail"},
{
"inlineData": {
"mimeType": "image/jpeg",
"data": "/9j/4AAQSkZJRgABAQAA..."
}
}
]
}
]
}'
Video Understanding
Send video via inline data or file URI:
curl "https://crazyrouter.com/v1beta/models/gemini-2.5-flash:generateContent?key=YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [
{
"role": "user",
"parts": [
{"text": "Describe the content of this video and list the key scenes"},
{
"inlineData": {
"mimeType": "video/mp4",
"data": "AAAAIGZ0eXBpc29t..."
}
}
]
}
]
}'
Audio Understanding
curl "https://crazyrouter.com/v1beta/models/gemini-2.5-flash:generateContent?key=YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents": [
{
"role": "user",
"parts": [
{"text": "Transcribe this audio and summarize the main points"},
{
"inlineData": {
"mimeType": "audio/mp3",
"data": "SUQzBAAAAAAAI1RTU0..."
}
}
]
}
]
}'
Multi-Image Comparison
response = model.generate_content([
"Compare these three product images and analyze the design features, pros, and cons of each",
{"mime_type": "image/jpeg", "data": image1_data},
{"mime_type": "image/jpeg", "data": image2_data},
{"mime_type": "image/jpeg", "data": image3_data}
])
| Type | Supported Formats |
|---|
| Image | JPEG, PNG, GIF, WebP, BMP |
| Video | MP4, AVI, MOV, MKV, WebM |
| Audio | MP3, WAV, FLAC, AAC, OGG |
When sending video and audio files via inline data, file size is limited by the request body size. For large files, it is recommended to upload them to an accessible URL first and then reference them via fileData.
Video and audio processing consumes far more tokens than plain text. One minute of video can consume thousands of tokens.