论文标题

Pandora谈话:Reddit上的个性和人口统计

PANDORA Talks: Personality and Demographics on Reddit

论文作者

Gjurković, Matej, Karan, Mladen, Vukojević, Iva, Bošnjak, Mihaela, Šnajder, Jan

论文摘要

人格和人口统计学是社会科学中的重要变量,而在NLP中,它们可以帮助解释性和消除社会偏见。但是,具有个性和人口标签的数据集很少。为了解决这个问题,我们介绍了Pandora,这是第一个大规模的Reddit评论数据集,标记了三种个性模型(包括完善的Big 5型号)和人口统计学(年龄,性别和位置),可用于10K以上的用户。我们展示了该数据集对三个实验的有用性,在该实验中,我们利用其他人格模型的更容易获得的数据来预测5个大型特征,分析由心理人口统计学变量引起的性别分类偏见,并基于心理学理论进行验证性和探索性分析。最后,我们为所有个性和人口统计学变量提供了基准预测模型。

Personality and demographics are important variables in social sciences, while in NLP they can aid in interpretability and removal of societal biases. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first large-scale dataset of Reddit comments labeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源