论文标题

TCBERT:中文主题分类的技术报告Bert

TCBERT: A Technical Report for Chinese Topic Classification BERT

论文作者

Han, Ting, Pan, Kunhao, Chen, Xinyu, Song, Dingjie, Fan, Yuchen, Gao, Xinyu, Gan, Ruyi, Zhang, Jiaxing

论文摘要

来自变形金刚或bert〜\ cite {devlin-etal-2019-bert}的双向编码器表示,由于其出色的性能,已成为各种NLP任务的基本模型之一。提出了针对不同语言和任务定制的变体,以进一步提高性能。在这项工作中,我们调查了贝特(Bert)的中文主题分类任务的监督持续培训〜\ cite {gururangan-etal-2020-dont}。具体来说,我们将基于及时的学习和对比度学习纳入预训练中。为了适应中文主题分类的任务,我们收集了大约210万个中文数据,这些数据涵盖了各种主题。具有不同参数尺寸的预先训练的中国主题分类Berts(TCBERTS)在\ url {https://huggingface.co/idea-ccnl}开源。

Bidirectional Encoder Representations from Transformers or BERT~\cite{devlin-etal-2019-bert} has been one of the base models for various NLP tasks due to its remarkable performance. Variants customized for different languages and tasks are proposed to further improve the performance. In this work, we investigate supervised continued pre-training~\cite{gururangan-etal-2020-dont} on BERT for Chinese topic classification task. Specifically, we incorporate prompt-based learning and contrastive learning into the pre-training. To adapt to the task of Chinese topic classification, we collect around 2.1M Chinese data spanning various topics. The pre-trained Chinese Topic Classification BERTs (TCBERTs) with different parameter sizes are open-sourced at \url{https://huggingface.co/IDEA-CCNL}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源