|
|
|
| Advantage Analysis and Methodological Approaches for Constructing High Quality Datasets in Publishing Industry |
| WANG Jun1,WANG Biao2,LI Suhang3 |
1. Shanghai Jiao Tong University ICCI, 200240, Shanghai, China 2. Chinese Academy of Press and Publication, 100073, Beijing, China 3. University College London, WC1E 6BT, London, UK |
|
|
|
|
Abstract Nations worldwide are actively developing and leveraging data resources both domestically and internationally. These resources exhibit economic characteristics including externalities, non-rivalry, and non-excludability, alongside sociological attributes such as shareability, spatiotemporal relevance, and public accessibility. Data quality serves as a critical determinant of model performance in generative artificial intelligence (GenAI) systems, and the lack of high-quality training datasets remains a significant challenge across sectors. While previous research on data elements has focused on implementation aspects, this study examines the underlying rationale and methodologies. This paper establishes the connotation and extension of high-quality datasets, identifying four quality dimensions within a three-dimensional six-tier analytical framework; 1. Structural Dimension; 2. Spatiotemporal Dimension; 3. Security Dimension: The core requirements for constructing high-quality datasets are categorized into four dimensions; 1. Data Unit Level; 2. Dataset Level; 3. Social Benefit Perspective; 4. Economic Benefit Perspective: This framework integrates technical specifications with governance principles, addressing both operational efficiency and societal value creation. The analysis examines industry-specific characteristics and resource endowments to demonstrate why the publishing sector holds unique social responsibility in constructing high-quality datasets. Publishing data exhibits inherent advantages: 1. Quantity: Rich diversity of types and abundant reserves; 2. Quality: Rigorous supply mechanisms and strict review processes; 3. Externality: Traceable ownership and privacy clearance; 4. Standardization: Technical support and cross-referencing capabilities. At the data unit level, publishing data undergoes comprehensive peer review and expert verification, ensuring superior accuracy and reliability compared to alternative data sources. Publishing data achieves substantial completeness and richness through comprehensive industry coverage. At the dataset level, professional editorial teams facilitate secondary knowledge production during data aggregation. They integrate technology with publishing workflows in processes such as packaging, delivery, error correction, and iterative updates, establishing sustainable version control mechanisms. Regarding benefits, publishing data inherently features desensitization and alignment with mainstream ideological values, addressing the balance between data protection and public accessibility. Moreover, the publishing industry's established ownership tracing and benefit distribution mechanisms provide a foundation for business evolution, facilitating trust networks and incentive-compatible business models between data providers and users. From a meso-theoretical perspective, this study employs a best-practice approach, examining mature image databases in the digital copyright trading industry as case studies. It analyzes principles and methodologies for constructing high-quality datasets, proposes operational and training recommendations, and achieves alignment between theory and practice. The marginal contributions of this paper are threefold: first, clarifying the scope and definition of high-quality datasets; second, analyzing the publishing industry's characteristics and advantages to identify key stakeholders; and third, recommending standards, operational principles, and construction methods for high-quality datasets.
|
|
Published: 09 July 2025
|
|
|
|
| 高质量数据集需求 | 出版行业特征与资源禀赋 | | 行业特征 | 资源禀赋 | | 供给侧 | 数据单元 | 碎片化 | 内容生产生态权威地位 | ·海量多元创作主体 ·历史沉淀资源 | | 结构化 | 内容审核流程 | ·严肃研究者 ·严格审核流程 | | 数据集 | 标准化 | ·同行评议制度 ·内容引用制度 | ·专业编辑团队 ·学术团体生态 | | 灵活性 | ·内容引用制度 ·反馈制度 | ·行业覆盖度 ·信息技术支持生态 | | 应用侧 | 社会效益 | 公开性 | 知识把关人和传播者 | 既有出版发行体系 | | 安全性 | 学术伦理 | 既有权属溯源体系 | | 经济效益 | 公共性 | 知识把关人和传播者 | 既有出版发行体系 | | 持续性 | 商业模式 | 既有权益分配体系 |
|
|
|
| 1 |
潘文年, 马隽. 出版业高质量发展中大数据的影响机制、作用路径及驱动效应[J]. 出版发行研究, 2023 (9): 22-31,21.
|
| 2 |
胡曙光, 陈昌凤. 观念与规范:人工智能时代媒介伦理困境及其引导[J]. 中国出版, 2019 (2): 11- 15.
|
| 3 |
徐偲骕, 李凤. 数据资源持有权:机构媒体应对AI侵权的升维因应[J]. 中国编辑, 2024 (11): 47- 55.
|
| 4 |
禹卫华. 生成式人工智能数据原生风险与媒介体系性规范[J]. 中国出版, 2023 (10): 10- 16.
|
| 5 |
张高祥, 陈哲, 陈云松. 化零为整的宏观社会数据生成:基于潜变量模型和动态贝叶斯方法[J]. 社会, 2024, 44 (3): 173- 219.
|
| 6 |
张鑫, 许海云, 杨宁, 等. 有限样本下的科技文献语步识别方法探讨[J]. 图书情报工作, 2024, 68 (3): 117- 129.
|
| 7 |
雷珏莹, 王晓光. 数字人文视域下古籍出版的内容价值增值[J]. 科技与出版, 2023 (3): 105- 109.
|
| 8 |
刘智锋, 王继民. 社会科学数据集的跨学科性研究:以CHARLS和CGSS数据集为例[J]. 现代情报, 2023, 43 (9): 165- 177.
|
| 9 |
陈平, 宋启凡. 基于自然资源期刊集群的多模态资源融合与学术传播路径研究[J]. 编辑学报, 2023, 35 (3): 321- 325.
|
| 10 |
白永秀, 李嘉雯, 王泽润. 数据要素:特征、作用机理与高质量发展[J]. 电子政务, 2022 (6): 23- 36.
|
| 11 |
王琪. 数据要素与高质量出版双向驱动:从新质生产力出发[J]. 河南大学学报(社会科学版), 2024, 64 (6): 129-135,156.
|
| 12 |
朱扬勇, 叶雅珍. 从数据的属性看数据资产[J]. 大数据, 2018, 4 (6): 65- 76.
|
| 13 |
杨艳, 林凌. 数据要素高质量供给:内涵解析、困境挑战与规制设计[J]. 电子政务, 2024 (11): 15- 26.
|
| 14 |
Wang R Y , Strong D M . Beyond accuracy:What data quality means to data consumers[J]. Journal of management information systems, 1996, 12 (4): 5- 33.
|
| 15 |
Marsden J R , Pingry D E . Numerical data quality in IS research and the implications for replication[J]. Decision support systems, 2018, 115, 1- 7.
|
| 16 |
胡峰, 王秉, 张思芊. 从边界分野到跨界共轭:政府数据协同治理交互困境扫描与纾困路径探赜[J]. 电子政务, 2023 (4): 93- 105.
|
| 17 |
喻海飞, 黄晋婷. 从基于闭环数据供应链的数据产品定价策略研究[J]. 管理工程学报, 2023, 37 (1): 136- 146.
|
| 18 |
盛小平, 田婧, 向桂林. 科学数据开放共享中的数据质量治理研究[J]. 图书情报工作, 2020, 64 (22): 11- 24.
|
| 19 |
王禄生. ChatGPT类技术:法律人工智能的改进者还是颠覆者?[J]. 政法论坛, 2023, 41 (4): 49- 62.
|
| 20 |
徐康, 余胜男, 陈蕾, 等. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法[J]. 数据分析与知识发现, 2023, 7 (5): 92- 104.
|
| 21 |
王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42 (6): 31- 43.
|
| 22 |
余本功, 朱晓洁, 张子薇. 基于多层次特征提取的胶囊网络文本分类研究[J]. 数据分析与知识发现, 2021, 5 (6): 93- 102.
|
| 23 |
付芸, 朱丽雅, 李丹, 等. ULEO:表示合成实验规程的实验操作统一语言[J]. 数据分析与知识发现, 2024, 8 (1): 30- 39.
|
| 24 |
孙宝磊, 郭风华, 李仁杰, 等. 线性文化遗产景观视觉感知区位信息模型与实证[J]. 地理科学进展, 2024, 43 (1): 80- 92.
|
| 25 |
张宇, 孙茂松. 面向人工智能的传统音乐标注数据集构建研究:兼论多模态胡琴标注数据集的建设与应用[J]. 中央音乐学院学报, 2024 (2): 66- 83.
|
| 26 |
黄世忠, 叶钦华, 叶凡. 隐性关联关系与真实盈余管理[J]. 会计研究, 2024 (4): 35- 46.
|
| 27 |
何劲, 王曰芬, 傅柱. 专题情报研究中的数据集构造比较研究[J]. 情报理论与实践, 2023, 46 (8): 175- 181.
|
| 28 |
王锡锌, 黄智杰. 公平利用权:公共数据开放制度建构的权利基础[J]. 华东政法大学学报, 2022, 25 (2): 59- 72.
|
| 29 |
周自强. 公共物品概念的延伸及其政策含义[J]. 经济学动态, 2005 (9): 25- 28.
|
| 30 |
周濛. 大数据时代学术期刊出版的数据风险及防范策略[J]. 中国科技期刊研究, 2023, 34 (8): 982- 989.
|
| 31 |
王钧. 智能时代影像数字出版发展探究[J]. 科技与出版, 2021 (11): 85- 93.
|
| 32 |
王钧, 王红梅. 中国图片版权交易现状浅析[J]. 科技与出版, 2016 (7): 69- 71.
|
| 33 |
王飚. 国家文化数字化背景下出版业高质量发展路径探究[J]. 中国数字出版, 2024, 2 (2): 21- 30.
|
| 34 |
王飚. 新时代数字出版高质量发展前景探析[J]. 数字出版研究, 2023, 2 (1): 1- 16.
|
| [1] |
WANG Pengfei,MAO Zhihui. Ten Major Hot Topics in Publishing Studies in 2024[J]. Science-Technology & Publication, 2025, 44(6): 5-17. |
| [2] |
LIU Haobing,LUAN Xuedong,ZHAO Yushan. University Publishing in the New Era: Development Positioning, Path Selection, and Value Significance[J]. Science-Technology & Publication, 2025, 44(6): 18-25. |
| [3] |
CHENG Hui,HONG Xingfan,LI Wu. Exploring the Construction of Open Data Platform for Juvenile Science Popularization in the Digital Intelligence Era[J]. Science-Technology & Publication, 2025, 44(5): 70-77. |
| [4] |
WU Yun,LIU Qian,SUN Xu. Reading Reconstruction and Cultural Renewal: Research on the Internal Mechanism and Innovative Pathways of "Reading for All" in Promoting Creative Transformation and Innovative Development of Excellent Traditional Chinese Culture[J]. Science-Technology & Publication, 2025, 44(5): 19-26. |
| [5] |
HAN Shaojun,ZHANG Xinyuan,CHEN Ruiyao,WANG Meiling. Research on Collaborative Cultivation of Publishing Talents in the Context of Deep Integration and Development: An Analysis Based on the Quadruple Helix Theory[J]. Science-Technology & Publication, 2025, 44(5): 89-97. |
| [6] |
LI Hao. Normative Construction of the Right of Publishers of Public Domain Works[J]. Science-Technology & Publication, 2025, 44(5): 106-118. |
| [7] |
LI Zhong,ZHANG Yu. Mission and Responsibility of University Presses in Developing the "China Series" of Original Textbooks[J]. Science-Technology & Publication, 2025, 44(4): 13-22. |
| [8] |
SHI Ge,WU Yiman,WANG Shiyou. Multi-Dimensional Analysis of International Chinese Education Publishing[J]. Science-Technology & Publication, 2025, 44(4): 23-30. |
| [9] |
HUANG Lijuan,REN Ruiting. Pathways and Reflections on Synergistic Development of Books and Journals in Professional Publishing Houses[J]. Science-Technology & Publication, 2025, 44(4): 60-64. |
| [10] |
CHEN Juhong,FENG Cailing. Technical Logic and Practical Approach of Empowering Publishing Marketing with Artificial Intelligence Generated Content[J]. Science-Technology & Publication, 2025, 44(4): 76-84. |
| [11] |
PENG Zhaoyi,FU Xiaojing. Shifting toward Medium and Long Videos :Development Path of Publishing Institution Content Operation :Take the Bilibili Platform as Example[J]. Science-Technology & Publication, 2025, 44(4): 85-93. |
| [12] |
FAN Wenting. Optimization of Multi-subject Collaborative Innovation Path in Value Cocreation on Multi-Sided Digital Reading Platforms: A Platform Ecosystem Perspective[J]. Science-Technology & Publication, 2025, 44(4): 94-102. |
| [13] |
XU Liping,DING Yujie. Practice and Exploration of Publishing Going Global in Western China in Context of the Belt and Road Initiative[J]. Science-Technology & Publication, 2025, 44(4): 112-120. |
| [14] |
XIE Qingfeng,XIE Yonglin. Urgency, Regularity, and Practicality of Systematic Innovation in the Publishing Industry[J]. Science-Technology & Publication, 2025, 44(3): 6-13. |
| [15] |
MAO Wensi,WANG Biao. Reviewof Development Trends of Chinese Digital Publishing in 2024 and Prospects for 2025[J]. Science-Technology & Publication, 2025, 44(3): 14-26. |
|
|
|
|