You are a helpful assistant that classifies AI models and returns JSON descriptions. Here's the model to classify: ## Basic model info Model name: wan-2.1-i2v-480p Model description: Accelerated inference for Wan 2.1 14B image to video, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. ## Model inputs - seed: Random seed. Leave blank for random (integer) - image: Input image to start generating from (string) - prompt: Prompt for video generation (string) - max_area: Maximum area of generated image. The input image will shrink to fit these dimensions (undefined) - fast_mode: Speed up generation with different levels of acceleration. Faster modes may degrade quality somewhat. The speedup is dependent on the content, so different videos may see different speedups. (undefined) - lora_scale: Determines how strongly the main LoRA should be applied. Sane results between 0 and 1 for base inference. For go_fast we apply a 1.5x multiplier to this value; we've generally seen good performance when scaling the base value by that amount. You may still need to experiment to find the best value for your particular lora. (number) - num_frames: Number of video frames. 81 frames give the best results (integer) - lora_weights: Load LoRA weights. Supports Replicate models in the format / or //, HuggingFace URLs in the format huggingface.co//, CivitAI URLs in the format civitai.com/models/[/], or arbitrary .safetensors URLs from the Internet. For example, 'fofr/flux-pixar-cars' (string) - sample_shift: Sample shift factor (number) - sample_steps: Number of generation steps. Fewer steps means faster generation, at the expensive of output quality. 30 steps is sufficient for most prompts (integer) - frames_per_second: Frames per second. Note that the pricing of this model is based on the video duration at 16 fps (integer) - sample_guide_scale: Higher guide scale makes prompt adherence better, but can reduce variation (number) ## Model output schema { "type": "string", "title": "Output", "format": "uri" } If the input or output schema includes a format of URI, it is referring to a file. ## Example inputs and outputs Use these example outputs to better understand the types of inputs the model accepts, and the types of outputs the model returns: Example 1: Input: image: https://replicate.delivery/xezq/hoce7uTrrfhLeJraWTX7fODmW0rmfM7QREAcgOCS333LIyfGF/out-0.webp prompt: In the video, a miniature cat is presented. The cat is held in a person's hands. The person then presses on the cat, causing a sq41sh squish effect. The person keeps pressing down on the cat, further showing the sq41sh squish effect. max_area: 832x480 fast_mode: Balanced lora_scale: 1 num_frames: 81 lora_weights: https://huggingface.co/Remade-AI/Squish/resolve/main/squish_18.safetensors sample_shift: 3 sample_steps: 30 frames_per_second: 16 sample_guide_scale: 5 Output: "https://replicate.delivery/xezq/aWTfAuK8om0WMCZLoDSJ2n0neiGaUjmPlfqxtmh9nvvhzO4oA/output.mp4" --------------- Example 2: Input: image: https://replicate.delivery/pbxt/MZZyui7brAbh1d2AsyPtgPIByUwzSv6Uou8objC7zXEjLySc/1a8nt7yw5drm80cn05r89mjce0.png prompt: A woman is talking max_area: 832x480 fast_mode: Balanced num_frames: 81 sample_shift: 3 sample_steps: 30 frames_per_second: 16 sample_guide_scale: 5 Output: "https://replicate.delivery/xezq/B08EdKGBIAK8E9rbNTX9jWO9ScVNbFivMaeXZM9ZUb5HAaKKA/output.mp4" --------------- Example 3: Input: image: https://replicate.delivery/pbxt/MZa482Ysmabp3R1SS1sFnkZdYveZASXa5ncAELe9OdIFeuSt/5bjq05hg8srma0cn3cp8egraew.png prompt: A cat is sitting on a laptop, it is kneading max_area: 832x480 num_frames: 81 sample_shift: 3 sample_steps: 30 frames_per_second: 16 sample_guide_scale: 5 Output: "https://replicate.delivery/xezq/UyajaOSJCu4GD9lCtmytldnSWx0ynkPfedtaTyfRLnijBInoA/output.mp4" ## Task Classification Based on the information above, please classify the model into one of the following tasks: - any-to-any: - audio-classification: Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker. - audio-to-audio: Audio-to-Audio is a family of tasks in which the input is an audio and the output is one or multiple generated audios. Some example tasks are speech enhancement and source separation. - audio-text-to-text: - automatic-speech-recognition: Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text. It has many applications, such as voice user interfaces. - depth-estimation: Depth estimation is the task of predicting depth of the objects present in an image. - document-question-answering: Document Question Answering (also known as Document Visual Question Answering) is the task of answering questions on document images. Document question answering models take a (document, question) pair as input and return an answer in natural language. Models usually rely on multi-modal features, combining text, position of words (bounding-boxes) and image. - visual-document-retrieval: - feature-extraction: Feature extraction is the task of extracting features learnt in a model. - fill-mask: Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we want to get a statistical understanding of the language in which the model is trained in. - graph-ml: undefined - image-classification: Image classification is the task of assigning a label or class to an entire image. Images are expected to have only one class for each image. Image classification models take an image as input and return a prediction about which class the image belongs to. - image-feature-extraction: Image feature extraction is the task of extracting features learnt in a computer vision model. - image-segmentation: Image Segmentation divides an image into segments where each pixel in the image is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation. - image-to-image: Image-to-image is the task of transforming an input image through a variety of possible manipulations and enhancements, such as super-resolution, image inpainting, colorization, and more. - image-text-to-text: Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input. - image-to-text: Image to text models output a text from a given image. Image captioning or optical character recognition can be considered as the most common applications of image to text. - image-to-video: undefined - keypoint-detection: Keypoint detection is the task of identifying meaningful distinctive points or features in an image. - mask-generation: Mask generation is the task of generating masks that identify a specific object or region of interest in a given image. Masks are often used in segmentation tasks, where they provide a precise way to isolate the object of interest for further processing or analysis. - multiple-choice: undefined - object-detection: Object Detection models allow users to identify objects of certain defined classes. Object detection models receive an image as input and output the images with bounding boxes and labels on detected objects. - video-classification: Video classification is the task of assigning a label or class to an entire video. Videos are expected to have only one class for each video. Video classification models take a video as input and return a prediction about which class the video belongs to. - other: undefined - question-answering: Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. Some question answering models can generate answers without context! - reinforcement-learning: Reinforcement learning is the computational approach of learning from action by interacting with an environment through trial and error and receiving rewards (negative or positive) as feedback - robotics: undefined - sentence-similarity: Sentence Similarity is the task of determining how similar two texts are. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are between them. This task is particularly useful for information retrieval and clustering/grouping. - summarization: Summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, while other models can generate entirely new text. - table-question-answering: Table Question Answering (Table QA) is the answering a question about an information on a given table. - table-to-text: undefined - tabular-classification: Tabular classification is the task of classifying a target category (a group) based on set of attributes. - tabular-regression: Tabular regression is the task of predicting a numerical value given a set of attributes. - tabular-to-text: undefined - text-classification: Text Classification is the task of assigning a label or class to a given text. Some use cases are sentiment analysis, natural language inference, and assessing grammatical correctness. - text-generation: Generating text is the task of generating new text given another text. These models can, for example, fill in incomplete text or paraphrase. - text-ranking: Text Ranking is the task of ranking a set of texts based on their relevance to a query. Text ranking models are trained on large datasets of queries and relevant documents to learn how to rank documents based on their relevance to the query. This task is particularly useful for search engines and information retrieval systems. - text-retrieval: undefined - text-to-image: Text-to-image is the task of generating images from input text. These pipelines can also be used to modify and edit images based on text prompts. - text-to-speech: Text-to-Speech (TTS) is the task of generating natural sounding speech given text input. TTS models can be extended to have a single model that generates speech for multiple speakers and multiple languages. - text-to-audio: undefined - text-to-video: Text-to-video models can be used in any application that requires generating consistent sequence of images from text. - text2text-generation: undefined - time-series-forecasting: undefined - token-classification: Token classification is a natural language understanding task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models could be trained to identify specific entities in a text, such as dates, individuals and places; and PoS tagging would identify, for example, which words in a text are verbs, nouns, and punctuation marks. - translation: Translation is the task of converting text from one language to another. - unconditional-image-generation: Unconditional image generation is the task of generating images with no condition in any context (like a prompt text or another image). Once trained, the model will create images that resemble its training data distribution. - video-text-to-text: Video-text-to-text models take in a video and a text prompt and output text. These models are also called video-language models. - visual-question-answering: Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions. - voice-activity-detection: undefined - zero-shot-classification: Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes. - zero-shot-image-classification: Zero-shot image classification is the task of classifying previously unseen classes during training of a model. - zero-shot-object-detection: Zero-shot object detection is a computer vision task to detect objects and their classes in images, without any prior training or knowledge of the classes. Zero-shot object detection models receive an image as input, as well as a list of candidate classes, and output the bounding boxes and labels where the objects have been detected. - text-to-3d: Text-to-3D models take in text input and produce 3D output. - image-to-3d: Image-to-3D models take in image input and produce 3D output. ## Categories Categories are like tags that describe the model. Examples: audio cinematic colorization denoising diffusion facial-landmark-detection image image inpainting object-segmentation ocr physics pose-estimation prompt-conditioning restoration speech-synthesis speech-to-text stabilization style-transfer super-resolution text typography upscaling vector video ## Use cases Based on the information above, please provide a list of use cases for the model. - Denoise audio recordings - Colorize black-and-white photos - Summarize long documents - Transcribe podcasts to text - Detect objects in images - Generate text-to-speech audio - Add captions to videos - Stitch multiple videos together - Animate a still photo - Convert sketches to realistic images - Generate music from text prompts - Create 3D models from 2D images - Translate spoken language in real time - Convert handwriting to digital text - Classify email as spam or not spam - Detect plagiarism in text - Create art from a style prompt - Identify plants from photos - Diagnose medical images (like X-rays) - Predict customer churn - Generate resumes from a set of achievements - Fix blurry images - Identify key moments in video footage ## Output format Return a JSON object with the following fields: - summary: A short summary of what the model does in 10 words or less. This should not be a sales pitch. - inputTypes: An array of the types of inputs the model accepts. Allowable values are "text", "image", "audio", "speech", "video", "3d" - outputTypes: An array of the types of outputs the model returns. Allowable values are "text", "image", "audio", "speech", "video", "3d" - task: The task the model performs. This should be one of the Hugging Face task names. - categories: An array of categories the model belongs to. Generate 5 categories for the model. - useCases: An array of 10 use cases for the model. Each one should be a single sentence of 8 words or less. Do not include any other text in your response. Do not explain your reasoning. Just return the JSON object. No code fencing. No markdown. No backticks. No triple backticks. No code blocks.