Medical Sora: Towards A Medical World Simulator with Generalist Medical Video Generation
[Paper] [Code] [Model] [Data]
Benyou Wang Team
The Chinese University of Hong Kong, Shenzhen
Abstract: In this work, we present Medical Sora, a general-purpose video generation model designed for the medical field. Medical Sora aims to assist medical education, clinical training, and medical simulations by generating realistic medical videos. The model leverages state-of-the-art deep learning technologies and is capable of automatically generating videos of various medical scenarios, surgical procedures, diagnostic workflows, and more, covering a broad range of medical knowledge and skills. Trained on a large dataset of medical information, Medical Sora not only generates high-quality videos but also ensures the accuracy and professionalism of the medical content, enabling healthcare professionals to practice and learn repeatedly without the need for hands-on procedures.
Contents
with audio
|
with audio
|
with audio
|
with audio
|
with audio
|
Please turn on the sound.
with audio
|
with audio
|
with audio and speech
|
with audio and speech
|
with audio and speech
|
Please turn on the sound.
Data
Main Results
VideoScore
VideoScore is an end-to-end video generation evaluation framework trained on the carefully curated VideoFeedback dataset. It comprehensively evaluates videos across five key dimensions: visual quality, temporal consistency, dynamic degree, text-to-video alignment, and factual consistency. In multiple benchmark tests, VideoScore significantly outperforms baseline models such as GPT-4 and Gemini 1.5. As an automated video evaluation tool, VideoScore not only accurately assesses video quality but also effectively simulates human feedback on generated videos. Although many models achieve high image quality, they often suffer from noticeable distortions. Therefore, we also evaluate the Warping Error, which is determined through human assessment.
Model | Average Score | Visual Quality ↑ | Temporal Consistency ↑ | Dynamic Degree ↑ | Text-to-Video Alignment ↑ | Factual Consistency ↑ | Warping Error ↓ |
---|---|---|---|---|---|---|---|
Open-source video generation model | |||||||
CogVideoX-2B | 2.77 | 3.35 | 3.21 | 3.36 | 3.13 | 3.09 | 88.00 |
CogVideoX-5B | 2.78 | 3.30 | 3.13 | 3.33 | 3.07 | 3.03 | 80.00 |
Wan2.1-T2V-1.3B | 2.69 | 3.11 | 2.89 | 3.37 | 2.99 | 2.86 | 77.50 |
Open Sora (v1.2) | 2.57 | 3.21 | 3.04 | 3.24 | 2.97 | 2.93 | 99.50 |
Open Sora Plan (v1.3) | 2.76 | 3.39 | 3.29 | 3.32 | 3.13 | 3.18 | 93.03 |
VideoCrafter-2 | 2.39 | 2.91 | 2.78 | 3.10 | 2.80 | 2.64 | 97.00 |
ModelScope | 2.34 | 2.86 | 2.72 | 3.10 | 2.77 | 2.58 | 100.00 |
Latte-1 | 2.48 | 3.08 | 2.87 | 3.17 | 2.91 | 2.76 | 97.50 |
Vchitect-2.0 | 2.40 | 2.94 | 2.80 | 3.14 | 2.82 | 2.67 | 99.50 |
Pyramid-Flow | 2.35 | 2.87 | 2.74 | 3.08 | 2.76 | 2.60 | 98.50 |
Allegro | 2.62 | 3.28 | 3.12 | 3.17 | 2.82 | 3.06 | 93.50 |
Mochi-1-preview | 2.54 | 3.03 | 2.84 | 3.13 | 2.84 | 2.73 | 83.50 |
LTX-Video | 2.41 | 3.01 | 2.83 | 3.10 | 2.83 | 2.71 | 100.00 |
HunyuanVideo | 2.85 | 3.16 | 3.01 | 3.02 | 2.82 | 2.91 | 45.00 |
Wan2.1-T2V-14B | 2.78 | 2.99 | 2.77 | 3.21 | 2.93 | 2.68 | 50.00 |
Commercial video generation model | |||||||
Hailuo (video-01) | 2.73 | 2.99 | 2.94 | 2.71 | 2.63 | 2.78 | 42.27 |
Pika (v2.0) | 2.80 | 2.57 | 2.85 | 2.33 | 2.53 | 2.63 | 42.29 |
Sora | 3.58 | 3.97 | 3.85 | 3.75 | 3.24 | 3.82 | 28.40 |
Our model | |||||||
Medical Sora (ours) | - | - | - | - | - | - | - |
VBench
To evaluate the performance of text-to-video generation, we adopted multiple evaluation metrics that align with human perception, which are also used in Vbench. These include: (1) Subject Consistency, to assess the consistency of the video's subject appearance; (2) Background Consistency, to evaluate the temporal consistency of the video's background; (3) Motion Smoothness, to measure the smoothness of the generated motion; (4) Dynamic Degree, to assess whether the video contains large-scale movements; (5) Aesthetic Quality, to evaluate the aesthetic quality of the video (aesthetic scoring can be ignored for medical videos); and (6) Imaging Quality, to assess the overall imaging quality of the video. Although many models achieve high image quality, they often suffer from noticeable distortions. Therefore, we also evaluate the Warping Error, which is determined through human assessment. The medical video usually lacks aesthetic appeal, so we omitted it in the calculation of the average score.
Model | Average Score | Subject Consistency ↑ | Background Consistency ↑ | Motion Smoothness ↑ | Dynamic Degree ↑ | Aesthetic Quality ↑ | Imaging Quality ↑ | Warping Error ↓ |
---|---|---|---|---|---|---|---|---|
Open-source video generation model | ||||||||
CogVideoX-2B | 71.97 | 94.03 | 95.01 | 98.04 | 71.50 | 44.17 | 61.22 | 88.00 |
CogVideoX-5B | 71.20 | 93.92 | 96.04 | 98.13 | 63.00 | 42.61 | 56.10 | 80.00 |
Wan2.1-T2V-1.3B | 72.43 | 90.48 | 92.75 | 98.03 | 77.00 | 44.34 | 53.81 | 77.50 |
Open Sora (v1.2) | 63.77 | 95.40 | 96.46 | 99.40 | 26.00 | 46.25 | 64.88 | 99.50 |
Open Sora Plan (v1.3) | 65.06 | 97.08 | 97.38 | 99.33 | 22.00 | 45.96 | 67.60 | 93.03 |
VideoCrafter-2 | 65.27 | 98.13 | 98.42 | 98.53 | 23.00 | 54.48 | 70.52 | 97.00 |
ModelScope | 65.86 | 92.99 | 96.24 | 96.43 | 45.50 | 42.20 | 63.99 | 100.00 |
Latte-1 | 68.52 | 96.41 | 96.75 | 98.09 | 47.50 | 49.39 | 69.87 | 97.50 |
Vchitect-2.0 | 69.83 | 88.07 | 93.50 | 93.89 | 80.00 | 42.01 | 63.02 | 99.50 |
Pyramid-Flow | 66.85 | 91.54 | 94.75 | 99.38 | 46.50 | 45.71 | 67.41 | 98.50 |
Allegro | 69.23 | 92.96 | 95.16 | 99.01 | 49.50 | 47.60 | 72.22 | 93.50 |
Mochi-1-preview | 72.76 | 92.51 | 94.60 | 99.08 | 77.50 | 44.33 | 56.38 | 83.50 |
LTX-Video | 59.95 | 97.44 | 95.82 | 99.60 | 4.97 | 43.62 | 61.85 | 100.00 |
HunyuanVideo | 78.39 | 91.47 | 95.82 | 99.20 | 63.00 | 48.67 | 65.85 | 45.00 |
Wan2.1-T2V-14B | 77.13 | 91.86 | 94.13 | 98.51 | 72.00 | 47.43 | 56.27 | 50.00 |
Commercial video generation model | ||||||||
Hailuo (video-01) | 79.54 | 94.55 | 95.24 | 99.35 | 60.51 | 48.61 | 69.84 | 42.27 |
Pika (v2.0) | 73.11 | 97.59 | 97.26 | 99.57 | 16.92 | 53.72 | 69.62 | 42.29 |
Sora | 85.91 | 93.40 | 95.36 | 99.21 | 83.95 | 51.23 | 71.96 | 28.40 |
Our model | ||||||||
Medical Sora (ours) | - | - | - | - | - | - | - | - |
Comparison
Allergo
|
CogVideoX-2B
|
CogVideoX-5B
|
Latte-1
|
LTX-Video
|
Mochi-1-Preview
|
ModelScope
|
Open-Sora (v1.2)
|
Open-Sora-Plan (v1.3)
|
Pyramid-Flow
|
Vchitect
|
VideoCrafter-2
|
HunyuanVideo
|
Hailuo (video-01)
|
Pika (v2.0)
|
Wan2.1-1.3B
|
Wan2.1-14B
|
Sora
|
Allergo
|
CogVideoX-2B
|
CogVideoX-5B
|
Latte-1
|
LTX-Video
|
Mochi-1-Preview
|
ModelScope
|
Open-Sora (v1.2)
|
Open-Sora-Plan (v1.3)
|
Pyramid-Flow
|
Vchitect
|
VideoCrafter-2
|
HunyuanVideo
|
Hailuo (video-01)
|
Pika (v2.0)
|
Wan2.1-1.3B
|
Wan2.1-14B
|
Sora
|
Allergo
|
CogVideoX-2B
|
CogVideoX-5B
|
Latte-1
|
LTX-Video
|
Mochi-1-Preview
|
ModelScope
|
Open-Sora (v1.2)
|
Open-Sora-Plan (v1.3)
|
Pyramid-Flow
|
Vchitect
|
VideoCrafter-2
|
HunyuanVideo
|
Hailuo (video-01)
|
Pika (v2.0)
|
Wan2.1-1.3B
|
Wan2.1-14B
|
Sora
|
Allergo
|
CogVideoX-2B
|
CogVideoX-5B
|
Latte-1
|
LTX-Video
|
Mochi-1-Preview
|
ModelScope
|
Open-Sora (v1.2)
|
Open-Sora-Plan (v1.3)
|
Pyramid-Flow
|
Vchitect
|
VideoCrafter-2
|
HunyuanVideo
|
Hailuo (video-01)
|
Pika (v2.0)
|
Wan2.1-1.3B
|
Wan2.1-14B
|
Sora
|
Allergo
|
CogVideoX-2B
|
CogVideoX-5B
|
Latte-1
|
LTX-Video
|
Mochi-1-Preview
|
ModelScope
|
Open-Sora (v1.2)
|
Open-Sora-Plan (v1.3)
|
Pyramid-Flow
|
Vchitect
|
VideoCrafter-2
|
HunyuanVideo
|
Hailuo (video-01)
|
Pika (v2.0)
|
Wan2.1-1.3B
|
Wan2.1-14B
|
Sora
|
Allergo
|
CogVideoX-2B
|
CogVideoX-5B
|
Latte-1
|
LTX-Video
|
Mochi-1-Preview
|
ModelScope
|
Open-Sora (v1.2)
|
Open-Sora-Plan (v1.3)
|
Pyramid-Flow
|
Vchitect
|
VideoCrafter-2
|
HunyuanVideo
|
Hailuo (video-01)
|
Pika (v2.0)
|
Wan2.1-1.3B
|
Wan2.1-14B
|
Sora
|
Allergo
|
CogVideoX-2B
|
CogVideoX-5B
|
Latte-1
|
LTX-Video
|
Mochi-1-Preview
|
ModelScope
|
Open-Sora (v1.2)
|
Open-Sora-Plan (v1.3)
|
Pyramid-Flow
|
Vchitect
|
VideoCrafter-2
|
HunyuanVideo
|
Hailuo (video-01)
|
Pika (v2.0)
|
Wan2.1-1.3B
|
Wan2.1-14B
|
Sora
|
Empirical Studies
Synthetic Medical Video Data for Enhancing Supervised Models
Medical Video Classification
Accuracy | Precision | Recall | F1-score | |
---|---|---|---|---|
Dori | 66.05 | 69.44 | 67.18 | 65.35 |
Dhunyuan_gen | 66.05 | 70.46 | 67.33 | 65.00 |
Dmedsora_gen | - | - | - | - |
Dori+Dhunyuan_gen | 69.75 | 70.91 | 70.36 | 69.66 |
Dori+Dmedsora_gen | - | - | - | - |
Medical Surgical Operations and Disease Classification
Accuracy | Precision | Recall | F1-score | |
---|---|---|---|---|
Dori | 28.57 | 9.52 | 33.33 | 14.81 |
Dhunyuan_gen | 14.29 | 5.56 | 16.67 | 8.33 |
Dmedsora_gen | - | - | - | - |
Dori+Dhunyuan_gen | 42.86 | 33.33 | 38.89 | 35.71 |
Dori+Dmedsora_gen | - | - | - | - |
Accuracy | Precision | Recall | F1-score | |
---|---|---|---|---|
Dori | 55.56 | 37.50 | 37.50 | 33.33 |
Dhunyuan_gen | 11.11 | 2.78 | 25.00 | 5.00 |
Dmedsora_gen | - | - | - | - |
Dori+Dhunyuan_gen | 55.56 | 26.79 | 50.00 | 34.85 |
Dori+Dmedsora_gen | - | - | - | - |
Patient Portrait Generation Driven by Patient Health Record
Video 1
Video 2
Video 3
Video 4
Video 5
Citation
If you found our work useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
bib is here.