🔊 The work is still under construction.

Medical Sora: Towards A Medical World Simulator with Generalist Medical Video Generation

[Paper] [Code] [Model] [Data]


Benyou Wang Team

The Chinese University of Hong Kong, Shenzhen

Abstract: In this work, we present Medical Sora, a general-purpose video generation model designed for the medical field. Medical Sora aims to assist medical education, clinical training, and medical simulations by generating realistic medical videos. The model leverages state-of-the-art deep learning technologies and is capable of automatically generating videos of various medical scenarios, surgical procedures, diagnostic workflows, and more, covering a broad range of medical knowledge and skills. Trained on a large dataset of medical information, Medical Sora not only generates high-quality videos but also ensures the accuracy and professionalism of the medical content, enabling healthcare professionals to practice and learn repeatedly without the need for hands-on procedures.

Contents

with audio
with audio
with audio
with audio
with audio

Please turn on the sound.

with audio
with audio
with audio and speech
with audio and speech
with audio and speech

Please turn on the sound.

Data

Main Results

VideoScore

VideoScore is an end-to-end video generation evaluation framework trained on the carefully curated VideoFeedback dataset. It comprehensively evaluates videos across five key dimensions: visual quality, temporal consistency, dynamic degree, text-to-video alignment, and factual consistency. In multiple benchmark tests, VideoScore significantly outperforms baseline models such as GPT-4 and Gemini 1.5. As an automated video evaluation tool, VideoScore not only accurately assesses video quality but also effectively simulates human feedback on generated videos. Although many models achieve high image quality, they often suffer from noticeable distortions. Therefore, we also evaluate the Warping Error, which is determined through human assessment.

Model Average Score Visual Quality ↑ Temporal Consistency ↑ Dynamic Degree ↑ Text-to-Video Alignment ↑ Factual Consistency ↑ Warping Error ↓
Open-source video generation model
CogVideoX-2B 2.77 3.35 3.21 3.36 3.13 3.09 88.00
CogVideoX-5B 2.78 3.30 3.13 3.33 3.07 3.03 80.00
Wan2.1-T2V-1.3B 2.69 3.11 2.89 3.37 2.99 2.86 77.50
Open Sora (v1.2) 2.57 3.21 3.04 3.24 2.97 2.93 99.50
Open Sora Plan (v1.3) 2.76 3.39 3.29 3.32 3.13 3.18 93.03
VideoCrafter-2 2.39 2.91 2.78 3.10 2.80 2.64 97.00
ModelScope 2.34 2.86 2.72 3.10 2.77 2.58 100.00
Latte-1 2.48 3.08 2.87 3.17 2.91 2.76 97.50
Vchitect-2.0 2.40 2.94 2.80 3.14 2.82 2.67 99.50
Pyramid-Flow 2.35 2.87 2.74 3.08 2.76 2.60 98.50
Allegro 2.62 3.28 3.12 3.17 2.82 3.06 93.50
Mochi-1-preview 2.54 3.03 2.84 3.13 2.84 2.73 83.50
LTX-Video 2.41 3.01 2.83 3.10 2.83 2.71 100.00
HunyuanVideo 2.85 3.16 3.01 3.02 2.82 2.91 45.00
Wan2.1-T2V-14B 2.78 2.99 2.77 3.21 2.93 2.68 50.00
Commercial video generation model
Hailuo (video-01) 2.73 2.99 2.94 2.71 2.63 2.78 42.27
Pika (v2.0) 2.80 2.57 2.85 2.33 2.53 2.63 42.29
Sora 3.58 3.97 3.85 3.75 3.24 3.82 28.40
Our model
Medical Sora (ours) - - - - - - -

VBench

To evaluate the performance of text-to-video generation, we adopted multiple evaluation metrics that align with human perception, which are also used in Vbench. These include: (1) Subject Consistency, to assess the consistency of the video's subject appearance; (2) Background Consistency, to evaluate the temporal consistency of the video's background; (3) Motion Smoothness, to measure the smoothness of the generated motion; (4) Dynamic Degree, to assess whether the video contains large-scale movements; (5) Aesthetic Quality, to evaluate the aesthetic quality of the video (aesthetic scoring can be ignored for medical videos); and (6) Imaging Quality, to assess the overall imaging quality of the video. Although many models achieve high image quality, they often suffer from noticeable distortions. Therefore, we also evaluate the Warping Error, which is determined through human assessment. The medical video usually lacks aesthetic appeal, so we omitted it in the calculation of the average score.

Model Average Score Subject Consistency ↑ Background Consistency ↑ Motion Smoothness ↑ Dynamic Degree ↑ Aesthetic Quality ↑ Imaging Quality ↑ Warping Error ↓
Open-source video generation model
CogVideoX-2B 71.97 94.03 95.01 98.04 71.50 44.17 61.22 88.00
CogVideoX-5B 71.20 93.92 96.04 98.13 63.00 42.61 56.10 80.00
Wan2.1-T2V-1.3B 72.43 90.48 92.75 98.03 77.00 44.34 53.81 77.50
Open Sora (v1.2) 63.77 95.40 96.46 99.40 26.00 46.25 64.88 99.50
Open Sora Plan (v1.3) 65.06 97.08 97.38 99.33 22.00 45.96 67.60 93.03
VideoCrafter-2 65.27 98.13 98.42 98.53 23.00 54.48 70.52 97.00
ModelScope 65.86 92.99 96.24 96.43 45.50 42.20 63.99 100.00
Latte-1 68.52 96.41 96.75 98.09 47.50 49.39 69.87 97.50
Vchitect-2.0 69.83 88.07 93.50 93.89 80.00 42.01 63.02 99.50
Pyramid-Flow 66.85 91.54 94.75 99.38 46.50 45.71 67.41 98.50
Allegro 69.23 92.96 95.16 99.01 49.50 47.60 72.22 93.50
Mochi-1-preview 72.76 92.51 94.60 99.08 77.50 44.33 56.38 83.50
LTX-Video 59.95 97.44 95.82 99.60 4.97 43.62 61.85 100.00
HunyuanVideo 78.39 91.47 95.82 99.20 63.00 48.67 65.85 45.00
Wan2.1-T2V-14B 77.13 91.86 94.13 98.51 72.00 47.43 56.27 50.00
Commercial video generation model
Hailuo (video-01) 79.54 94.55 95.24 99.35 60.51 48.61 69.84 42.27
Pika (v2.0) 73.11 97.59 97.26 99.57 16.92 53.72 69.62 42.29
Sora 85.91 93.40 95.36 99.21 83.95 51.23 71.96 28.40
Our model
Medical Sora (ours) - - - - - - - -

Comparison

Allergo
CogVideoX-2B
CogVideoX-5B
Latte-1
LTX-Video
Mochi-1-Preview
ModelScope
Open-Sora (v1.2)
Open-Sora-Plan (v1.3)
Pyramid-Flow
Vchitect
VideoCrafter-2
HunyuanVideo
Hailuo (video-01)
Pika (v2.0)
Wan2.1-1.3B
Wan2.1-14B
Sora
The short video depicts a medical procedure being performed in a clinical setting. The scene is set in a treatment room equipped with medical supplies and equipment. Two healthcare professionals, dressed in blue scrubs and wearing gloves, are attending to a patient who is lying face down on a treatment table. The patient has visible skin lesions or moles on their back, which are the focus of the procedure. The healthcare professionals are engaged in a detailed process involving the patient's back. One of them is seen handling a syringe, likely preparing to administer an injection or perform a biopsy. The other professional assists by holding a piece of gauze or cloth, possibly to manage any bleeding or to apply pressure to the treated area. The procedure appears to be meticulous and involves careful handling of medical instruments and materials. The healthcare professionals work in a coordinated manner, ensuring that the patient's back is properly treated. The video captures the steps involved in the medical procedure, highlighting the precision and care taken by the medical staff to address the patient's skin condition.
Allergo
CogVideoX-2B
CogVideoX-5B
Latte-1
LTX-Video
Mochi-1-Preview
ModelScope
Open-Sora (v1.2)
Open-Sora-Plan (v1.3)
Pyramid-Flow
Vchitect
VideoCrafter-2
HunyuanVideo
Hailuo (video-01)
Pika (v2.0)
Wan2.1-1.3B
Wan2.1-14B
Sora
The short video depicts a scene inside a medical facility, likely a hospital or a specialized clinic. The setting is a sterile, well-equipped room with various medical instruments and equipment. The focal point of the video is a complex medical device, possibly a heart-lung machine, used in cardiac surgeries. This device is covered with numerous tubes and wires, indicating its critical role in supporting the patient's circulatory and respiratory functions during surgery. Two medical professionals, dressed in surgical scrubs and masks, are seen working with the machine. One of them is actively adjusting or monitoring the device, while the other observes and possibly assists. The room is filled with various medical supplies, monitors, and other equipment essential for the procedure. The monitors display vital signs and other critical information, ensuring that the patient's condition is continuously monitored. The overall atmosphere is one of precision and care, highlighting the complexity and critical nature of the medical procedure being performed. The presence of advanced medical technology and the focused efforts of the medical staff underscore the seriousness and professionalism of the setting.
Allergo
CogVideoX-2B
CogVideoX-5B
Latte-1
LTX-Video
Mochi-1-Preview
ModelScope
Open-Sora (v1.2)
Open-Sora-Plan (v1.3)
Pyramid-Flow
Vchitect
VideoCrafter-2
HunyuanVideo
Hailuo (video-01)
Pika (v2.0)
Wan2.1-1.3B
Wan2.1-14B
Sora
The short video depicts a clinical setting where a healthcare professional is attending to a patient's knee. The patient is lying on an examination table, with their leg elevated and supported. The healthcare professional, dressed in a blue sweater and dark pants, is seen manipulating the patient's foot and ankle, likely to assess or treat a condition related to the knee. The professional is also handling a small medical device, possibly a laser or similar equipment, which is positioned near the patient's knee. The room is equipped with various medical and exercise equipment, indicating a rehabilitation or physical therapy environment. The focus is on the hands-on examination and treatment of the patient's knee, highlighting the careful and precise nature of the medical care being provided.
Allergo
CogVideoX-2B
CogVideoX-5B
Latte-1
LTX-Video
Mochi-1-Preview
ModelScope
Open-Sora (v1.2)
Open-Sora-Plan (v1.3)
Pyramid-Flow
Vchitect
VideoCrafter-2
HunyuanVideo
Hailuo (video-01)
Pika (v2.0)
Wan2.1-1.3B
Wan2.1-14B
Sora
The short video features a person seated in a chair, holding a model of a human heart. The individual is dressed in a black shirt and appears to be explaining or discussing aspects of heart anatomy or health. The setting is a calm, indoor environment with a neutral-colored wall, a potted plant on a table, and a piece of abstract art hanging on the wall. The person's gestures and the presence of the heart model suggest that the video is educational, likely focusing on topics related to heart health, anatomy, or medical conditions affecting the heart. The overall tone is informative and professional.
Allergo
CogVideoX-2B
CogVideoX-5B
Latte-1
LTX-Video
Mochi-1-Preview
ModelScope
Open-Sora (v1.2)
Open-Sora-Plan (v1.3)
Pyramid-Flow
Vchitect
VideoCrafter-2
HunyuanVideo
Hailuo (video-01)
Pika (v2.0)
Wan2.1-1.3B
Wan2.1-14B
Sora
The short video depicts a medical procedure involving the administration of an injection. The scene is focused on a specific area of the skin, which appears to be a localized swelling or lesion. The individual performing the procedure is wearing blue medical gloves, ensuring a sterile environment. A syringe is being used to inject a substance into the affected area. The needle is carefully inserted into the skin, and the syringe is gradually filled with a liquid, likely a medication or anesthetic. The procedure is conducted with precision, and the area around the injection site shows signs of inflammation or infection, indicated by redness and swelling. Throughout the video, the syringe is steadily inserted into the lesion, and the liquid is slowly administered. The close-up shots emphasize the careful technique and the medical nature of the procedure, highlighting the importance of precision and care in handling such conditions.
Allergo
CogVideoX-2B
CogVideoX-5B
Latte-1
LTX-Video
Mochi-1-Preview
ModelScope
Open-Sora (v1.2)
Open-Sora-Plan (v1.3)
Pyramid-Flow
Vchitect
VideoCrafter-2
HunyuanVideo
Hailuo (video-01)
Pika (v2.0)
Wan2.1-1.3B
Wan2.1-14B
Sora
The short video appears to depict a medical procedure involving the use of ultrasound imaging. The video shows a close-up of an ultrasound monitor displaying real-time imaging of internal body structures. The monitor is equipped with various settings and parameters, such as frequency (12.5 MHz), gain, and depth, which are adjusted to capture clear images. The ultrasound images reveal layers of tissue, likely muscles or other soft tissues, as indicated by the striated patterns visible on the screen. The video does not show any external medical equipment or personnel, focusing solely on the ultrasound screen and the images it displays. The setting appears to be a clinical environment, possibly a hospital or a medical facility, as suggested by the presence of medical equipment and the professional setup. The video does not provide any additional context or explanation beyond the visual display of the ultrasound images.
Allergo
CogVideoX-2B
CogVideoX-5B
Latte-1
LTX-Video
Mochi-1-Preview
ModelScope
Open-Sora (v1.2)
Open-Sora-Plan (v1.3)
Pyramid-Flow
Vchitect
VideoCrafter-2
HunyuanVideo
Hailuo (video-01)
Pika (v2.0)
Wan2.1-1.3B
Wan2.1-14B
Sora
The short video depicts a surgical procedure focused on the eye, specifically involving the scleral region. The scene is highly magnified, showing a close-up of the eye with various surgical instruments in use. The sclera, or white part of the eye, appears to be the primary area of focus, with visible blood and incisions. Throughout the video, the surgeon is seen making precise incisions and manipulating the scleral tissue. The surgical instruments include forceps and scissors, which are used to hold and cut through the tissue. The scleral pocket is being dissected, and relaxing incisions are being placed on either edge of the incision site. The surgeon carefully manages the tissue, ensuring that the pocket is adequately formed and the incisions are correctly placed. The video provides a detailed view of the meticulous steps involved in this surgical technique, highlighting the precision and care required in such procedures. The presence of blood and the use of specialized surgical tools are evident, emphasizing the complexity and technical nature of the operation.

Empirical Studies

Synthetic Medical Video Data for Enhancing Supervised Models

Medical Video Classification

Accuracy Precision Recall F1-score
Dori 66.05 69.44 67.18 65.35
Dhunyuan_gen 66.05 70.46 67.33 65.00
Dmedsora_gen - - - -
Dori+Dhunyuan_gen 69.75 70.91 70.36 69.66
Dori+Dmedsora_gen - - - -

Medical Surgical Operations and Disease Classification

Accuracy Precision Recall F1-score
Dori 28.57 9.52 33.33 14.81
Dhunyuan_gen 14.29 5.56 16.67 8.33
Dmedsora_gen - - - -
Dori+Dhunyuan_gen 42.86 33.33 38.89 35.71
Dori+Dmedsora_gen - - - -
Accuracy Precision Recall F1-score
Dori 55.56 37.50 37.50 33.33
Dhunyuan_gen 11.11 2.78 25.00 5.00
Dmedsora_gen - - - -
Dori+Dhunyuan_gen 55.56 26.79 50.00 34.85
Dori+Dmedsora_gen - - - -

Patient Portrait Generation Driven by Patient Health Record

Video 1
纯白色背景,一个全身的男人物,表现的是肚子疼,他脸上表现的很难受
Video 2
纯白色背景,一个全身的孕妇,表现是肚子疼,她脸上表现很难受
Video 3
纯白色背景,半身的老太太,表现是头疼,她一只手摸自己的头,表现的很难受
Video 4
纯白色背景,年轻人,咳嗽,脸部微微发红
Video 5
纯白色背景,婴儿,脸部出现很多红色痘痘

Citation

If you found our work useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!


	bib is here.
	

Acknowledgement

This website is adapted from FunAudioLLM.