DeepSeek VL2 Tiny

    DeepSeek VL2 Tiny is a small-sized model that supports visual language tasks with performance comparable or even exceeding that of large-scale models.

    DeepSeek VL2 Tiny model graphic

    API USAGE

    API IDENTIFIER

    deepseek/deepseek-vl2-tiny/fp-16
    import OpenAI from "openai";
    
    const openai = new OpenAI({
      baseURL: "https://api.inference.net/v1",
      apiKey: process.env.INFERENCE_API_KEY,
    });
    
    const completion = await openai.chat.completions.create({
      model: "deepseek/deepseek-vl2-tiny/fp-16",
      messages: [
        {
          role: "user",
          content: "What is the meaning of life?"
        }
      ],
      stream: true,
    });
    
    for await (const chunk of completion) {
      process.stdout.write(chunk.choices[0]?.delta.content as string);
    }
    MODEL PROVIDERDeepSeek
    TYPEText to Text
    PARAMETERS3B
    QUANTIZATIONFP16
    CONTEXT LENGTH4K
    PRICINGInput $0.05 / Million Tokens
    Output $0.10 / Million Tokens
    JSON MODE
    TOOL CALLING
    DEPLOYMENT
    Serverless
    Batch
    DOCUMENTATION

    PLAYGROUND

    Total Cost = $0.00

    Time To First Token

    0ms

    Tokens Per Second

    0

    Total Tokens

    0

    Type a message to get started

    1. Introduction

    Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Github Repository

    Zhiyu Wu*, Xiaokang Chen*, Zizheng Pan*, Xingchao Liu*, Wen Liu**, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan*** (* Equal Contribution, ** Project Lead, *** Corresponding author)

    2. Model Summary

    DeepSeek-VL2-tiny is built on DeepSeekMoE-3B (total activated parameters are 1.0B).

    3. Quick Start

    Installation

    On the basis of Python >= 3.8 environment, install the necessary dependencies by running the following command:

    pip install -e .
    

    Notifications

    1. We suggest to use a temperature T <= 0.7 when sampling. We observe a larger temperature decreases the generation quality.
    2. To keep the number of tokens managable in the context window, we apply dynamic tiling strategy to <=2 images. When there are >=3 images, we directly pad the images to 384*384 as inputs without tiling.
    3. The main difference between DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2 is the base LLM.

    Simple Inference Example

    import torch
    from transformers import AutoModelForCausalLM
    
    from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
    from deepseek_vl.utils.io import load_pil_images
    
    
    # specify the path to the model
    model_path = "deepseek-ai/deepseek-vl2-small"
    vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
    tokenizer = vl_chat_processor.tokenizer
    
    vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
    vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
    
    ## single image conversation example
    conversation = [
        {
            "role": "<|User|>",
            "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
            "images": ["./images/visual_grounding.jpeg"],
        },
        {"role": "<|Assistant|>", "content": ""},
    ]
    
    ## multiple images (or in-context learning) conversation example
    # conversation = [
    #     {
    #         "role": "User",
    #         "content": "<image_placeholder>A dog wearing nothing in the foreground, "
    #                    "<image_placeholder>a dog wearing a santa hat, "
    #                    "<image_placeholder>a dog wearing a wizard outfit, and "
    #                    "<image_placeholder>what's the dog wearing?",
    #         "images": [
    #             "images/dog_a.png",
    #             "images/dog_b.png",
    #             "images/dog_c.png",
    #             "images/dog_d.png",
    #         ],
    #     },
    #     {"role": "Assistant", "content": ""}
    # ]
    
    # load images and prepare for inputs
    pil_images = load_pil_images(conversation)
    prepare_inputs = vl_chat_processor(
        conversations=conversation,
        images=pil_images,
        force_batchify=True,
        system_prompt=""
    ).to(vl_gpt.device)
    
    # run image encoder to get the image embeddings
    inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
    
    # run the model to get the response
    outputs = vl_gpt.language_model.generate(
        inputs_embeds=inputs_embeds,
        attention_mask=prepare_inputs.attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        do_sample=False,
        use_cache=True
    )
    
    answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
    print(f"{prepare_inputs['sft_format'][0]}", answer)
    

    Gradio Demo (TODO)

    4. License

    This code repository is licensed under MIT License. The use of DeepSeek-VL2 models is subject to DeepSeek Model License. DeepSeek-VL2 series supports commercial use.

    5. Citation

    @misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels,
          title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding}, 
          author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan},
          year={2024},
          eprint={2412.10302},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2412.10302}, 
    }
    

    6. Contact

    If you have any questions, please raise an issue or contact us at [email protected].

    DeepSeek VL2 Tiny Footer Image

    Save up to 90% on DeepSeek VL2 Tiny inference

    Deploy in under five minutes and immediately start saving money on your inference bill.