论文标题
任务感知的异步多任务模型,具有类增量的对比度学习,用于手术场景
Task-Aware Asynchronous Multi-Task Model with Class Incremental Contrastive Learning for Surgical Scene Understanding
论文作者
论文摘要
目的:使用工具组织互动识别和自动报告产生的手术现场理解可以在机器人手术中的术中指导,决策和术后分析中发挥重要作用。然而,域在不同手术和患者内变异的不同手术之间发生变化,而新颖的工具的外观降低了模型预测的性能。此外,它需要多种型号的输出,这可能在计算上昂贵并影响实时性能。 方法论:提出了一个多任务学习(MTL)模型,用于用于处理域转移问题的手术报告生成和工具组织交互预测。共享特征提取器的模型形式,用于字幕的网状转换器分支和用于工具组织相互作用预测的图形注意分支。共享特征提取器采用类增量对比度学习(CICL)来应对目标域中的强度转移和新型类别外观。我们将基于高斯(日志)的课程学习的拉普拉斯人设计到共享和特定于任务的分支中,以增强模型学习。我们结合了一种任务意识的异步MTL优化技术,以微调共享权重,并最佳地收敛这两个任务。 结果:提议的MTL模型使用任务感知的优化和微调技术训练了均衡性能(场景字幕的BLEU得分为0.4049,用于互动检测的0.3508的BLEU得分为0.4049),用于目标域上的两个任务),并在域适应性域中使用单任性模型对PAR进行了PAR。 结论:拟议的多任务模型能够适应域移位,将新颖的仪器纳入目标域,并执行工具组织交互检测和与单任务模型相同的报告生成。
Purpose: Surgery scene understanding with tool-tissue interaction recognition and automatic report generation can play an important role in intra-operative guidance, decision-making and postoperative analysis in robotic surgery. However, domain shifts between different surgeries with inter and intra-patient variation and novel instruments' appearance degrade the performance of model prediction. Moreover, it requires output from multiple models, which can be computationally expensive and affect real-time performance. Methodology: A multi-task learning (MTL) model is proposed for surgical report generation and tool-tissue interaction prediction that deals with domain shift problems. The model forms of shared feature extractor, mesh-transformer branch for captioning and graph attention branch for tool-tissue interaction prediction. The shared feature extractor employs class incremental contrastive learning (CICL) to tackle intensity shift and novel class appearance in the target domain. We design Laplacian of Gaussian (LoG) based curriculum learning into both shared and task-specific branches to enhance model learning. We incorporate a task-aware asynchronous MTL optimization technique to fine-tune the shared weights and converge both tasks optimally. Results: The proposed MTL model trained using task-aware optimization and fine-tuning techniques reported a balanced performance (BLEU score of 0.4049 for scene captioning and accuracy of 0.3508 for interaction detection) for both tasks on the target domain and performed on-par with single-task models in domain adaptation. Conclusion: The proposed multi-task model was able to adapt to domain shifts, incorporate novel instruments in the target domain, and perform tool-tissue interaction detection and report generation on par with single-task models.