出版内容语料库建设的逻辑前提、现状检视与实践路径

科技与出版

2026, Vol. 45

Issue (5): 93-102

版权视界

本期目录 | 过刊浏览

出版内容语料库建设的逻辑前提、现状检视与实践路径

范晔^1,²

1. 中南财经政法大学知识产权研究中心，430073，武汉
2. 科隆大学法学院，50923，德国科隆

Building Publishing Content Corpora for the AI Age: Logical Premises, Current Challenges, and Institutional Pathways

FAN Ye^1,²

1. Center for Studies of Intellectual Property Rights, Zhongnan University of Economics and Law, 430073, Wuhan, China
2. Faculty of Law, University of Cologne, 50674, Cologne, Germany

全文:

HTML

PDF(1719 KB)
输出: BibTeX | EndNote (RIS)

摘要:

在高质量中文语料供给短缺的背景下，将作为优质语料的出版内容资源转化为数据资产，已成为支撑出版业数智化转型与数字文化产业建设的重要课题。当前，出版业已探索出自行开发、大模型接入与合作共建三种语料库建设模式，并逐步朝着精细化治理迈进，但仍受制于权益界定模糊、标准体系缺失及共享激励不足等问题。为此，首先需要明确出版主体享有出版内容数据资源持有权、加工使用权与产品经营权，探索适宜的运营模式；其次，组建政产学研联盟，构建由政府引导、行业参与、多方协同的标准制定体系；最后，建立数据资产确权登记规则、引入数据信托模式、构建数据供给激励机制，以激活出版内容数据的价值循环，为推动出版业深度融入人工智能产业发展战略提供可行路径。

关键词 ：中文语料库, 出版内容数据, 人工智能大模型, 出版数智化, 数据要素

Abstract：

Against the backdrop of a growing shortage of high-quality Chinese corpus, transforming published content into usable data assets has become critical to supporting the digital and intelligent transformation of the publishing industry, as well as the broader development of digital cultural industries. Using literature analysis, normative analysis, and case studies, this study maps current corpus development practices and diagnoses the systemic barriers impeding progress. Three primary models of corpus development have emerged in practice: independent construction, integration with large language models (LLMs), and cooperative construction. In the independent model, publishers leverage proprietary content resources to build vertical corpora. The LLM integration model focuses on connecting content with external AI capabilities, while the cooperative model involves combining editorial resources with the technical expertise of technology companies and universities. While these models reflect progress toward refined data governance, three core challenges persist: poorly defined licensing rights and value distribution, technical friction caused by fragmented formatting and annotation standards, and weak data-sharing incentives stemming from low trust and ambiguous revenue models. To address the challenges mentioned above, this paper proposes a series of integrated solutions. (1) Regarding the authorization and operation of corpus resources, the legal rights of publishing entities must be formally recognized. This involves affirming their authority to hold data resources, process and use content, and operate data products. The rights to hold and process data are grounded in the legal authorization of property rights within publishing contracts, while the right to operate and profit from data products depends on the substantive processing of these resources by the publishers. Furthermore, publishers should select operational models that align with their content advantages. Second, to resolve standard fragmentation, a collaborative alliance involving government, industry, and research institutions should be established. This body would lead the development of a standard-setting system that is guided by government leadership but driven by industry participation and multi-stakeholder coordination. Such an approach ensures that corpus standards are fundamental, practical, and capable of being widely adopted across the industry to facilitate data circulation. (3) The paper outlines three specific mechanisms to facilitate data circulation and reuse. First, establishing rules for the registration and confirmation of data asset rights. These rules would provide preliminary evidence for resolving ownership disputes and serve as essential credentials for balance-sheet recognition and market trading. Second, exploring data trust models for publishing content. This involves using informed consent and implied license rules as institutional tools for orderly sharing. Specifically, a dedicated data trust management body should be established to build "data pools", drawing on the operational experience of patent pools in the intellectual property field. Third, building a multi-dimensional incentive system. Economic incentives should follow the contribution principle and create a profit-sharing framework that covers all stakeholders in the data value chain. Technical incentives should focus on reducing participation costs and quantifying data value through innovation. Managerial incentives should include incorporating corpus construction into national financial support programs, providing research subsidies, and implementing tax preferences for participating institutions.

Key words： Chinese language corpus publishing content data large AI models digital-intelligent transformation of publishing data elements

出版日期: 2026-06-15

基金资助:国家社科基金项目“算法不正当竞争行为的法律规制研究”(25CFX046)

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	范晔

引用本文:

范晔. 出版内容语料库建设的逻辑前提、现状检视与实践路径[J]. 科技与出版, 2026, 45(5): 93-102.
FAN Ye. Building Publishing Content Corpora for the AI Age: Logical Premises, Current Challenges, and Institutional Pathways. Science-Technology & Publication, 2026, 45(5): 93-102.

链接本文:

http://kjycb.tsinghuajournals.com/CN/ 或 http://kjycb.tsinghuajournals.com/CN/Y2026/V45/I5/93

1	张凌寒. 加快建设人工智能大模型中文训练数据语料库[J]. 人民论坛·学术前沿, 2024 (13): 57- 71.
2	培育全国一体化数据市场、强化数据赋能人工智能发展国家数据局部署2026年重点工作[EB/OL].（2025-12-30）[2026-01-21]. https://www.gov.cn/lianbo/202512/content_7053258.htm.
3	张新新, 赵哲瑞. 在争鸣中探索: 数据出版概念辨析与重构[J]. 出版科学, 2025, 33 (5): 76- 88.
4	蔡斐, 张永坤. 论出版数据产权: "三权分置"、规制逻辑与实现路径[J]. 出版发行研究, 2025 (6): 38-45, 93.
5	蔡斐, 雷怡. 作为生产要素的出版数据: 缘起、价值与破局[J]. 中国出版, 2024 (22): 41- 46.
6	刘文斌, 段秋婷. 市场化配置视角下出版社内容数据资本化困境: 基于多类市场经营主体的访谈分析[J]. 出版发行研究, 2025 (10): 46- 55.
7	陆晓芳. 数字技术赋能中华优秀传统文化"两创"的实践进路[J]. 华中科技大学学报(社会科学版), 2025, 39 (5): 130- 136.
8	曹新明, 范晔. 生成式人工智能数据训练的合理使用规则研究[J]. 中国版权, 2024 (4): 20- 35.
9	何玮珂, 霍琳. 新质生产力驱动下专业出版社垂类大模型建设路径[J]. 中国出版, 2025 (16): 45- 47.
10	王钧, 王飚, 李苏航. 出版行业构建高质量数据集的优势分析与方法研究[J]. 科技与出版, 2025 (6): 64- 72.
11	马治国, 刘慧. 区块链技术视角下的数字版权治理体系构建[J]. 科技与法律, 2018 (2): 1- 9.
12	向安玲, 马雯筱. 面向生成式人工智能的中文语料库建设[J]. 中国出版, 2025 (1): 35- 43.
13	国家数据局发布“高质量数据集典型案例”人民网“主流价值语料库”入选[EB/OL].（2025-09-13）[2026-01-21]. http://finance.people.com.cn/n1/2025/0913/c1004-40563116.html.
14	张馨宇. 上海世纪出版集团: 以“AI+出版”构建智慧出版新生态[EB/OL].（2025-08-15）[2026-01-21]. https://www.chinawriter.com.cn/n1/2025/0815/c403994-40543074.html.
15	对接DeepSeek, 主流媒体破浪前行[EB/OL].（2025-02-27）[2026-04-01]. https://baijiahao.baidu.com/s?id=1825218469607491826&wfr=spider&for=pc.
16	上海高等研究院.垂直领域基座模型[EB/OL].（2023-08-20）[2026-01-21]. http://sias.zju.edu.cn/2023/0820/c57512a2791444/page.htm.
17	教育智能研究中心推出中小学科学教育大模型NNUEChat1.0[EB/OL].（2024-01-30）[2026-01-21]. https://mp.weixin.qq.com/s?__biz=Mzk0MjQwNTc5OA==&mid=2247537913&idx=2&sn=17ddb8d7845c0b36770ce1f1275e8b09&chksm=c2c1b530f5b63c26d5b8f7ea4849a3256f77a1d512ba556b1f2be9be87f464663c05f501ac6f&scene=27.
18	黄云, 董晓尚, 廖可欣. 大模型时代主流媒体智能化转型的实践探索: 以四川日报报业集团为例[J]. 传媒, 2026 (3): 23- 25.
19	开源数据量超2TB！中国大模型语料数据联盟发布“书生·万卷”多模态语料[EB/OL].（2023-08-15）[2026-01-21]. https://baijiahao.baidu.com/s?id=1774278631995763512&wfr=spider&for=pc.
20	马子斌. 学术期刊版权协议使用的版权困境与破局策略[J]. 出版发行研究, 2025 (2): 79- 86.
21	王朔, 张大伟. 数智时代出版业内容与技术的主导权竞争研究[J]. 出版科学, 2025, 33 (6): 5- 13.
22	田丽, 张翀. 习近平文化思想引领下主流价值语料库构建策略[J]. 新闻战线, 2025 (19): 19- 22.
23	程乐. 我国高质量场景数据集的供给现状与发展策略[J]. 人民论坛, 2025 (5): 68- 72.
24	李思经, 宋立荣, 王健. 面向开放共享的科学数据出版: 机遇、挑战与对策[J]. 中国科技期刊研究, 2021, 32 (5): 671- 679.
25	张嘉鑫. 人工智能训练中作品数据来源者利益共享机制研究[J]. 知识产权, 2025 (5): 111- 126.
26	秦艳华, 侯玉丽, 李一凡. 基于国外出版业大模型最新应用的思考[J]. 中国编辑, 2024 (9): 19- 27.
27	Scopus AI: 可信内容[EB/OL].[2026-01-21]. https://www.elsevier.com/zh-cn/products/scopus/scopus-ai.
28	李俊. 基于出版内容的AI知识库构建策略与价值拓展[J]. 出版发行研究, 2025 (11): 23-29, 37.
29	杜方伟, 宋吉述. 出版业高质量数据集建设评价指标与运营策略[J]. 中国出版, 2025 (16): 3- 10.
30	陈强, 邢窈窈. 面向生成式人工智能的语料库建设: 现状、挑战与未来策略[J]. 社会科学, 2025 (6): 177- 192.
31	郭晶, 吕宇梦. 出版数据要素跨界赋能的内在逻辑、多维场景与实践路径[J]. 出版广角, 2024 (22): 47- 53.
32	黄玉烨, 李凡. 出版数据资产化的演进逻辑、现实困境及法治进路[J]. 编辑之友, 2025 (10): 20- 29.
33	出版融合枢纽中心在沪揭牌, 共建数据生态推动行业革新[EB/OL].（2025-07-11）[2026-01-21]. https://business.sohu.com/a/912948765_120823584.
34	翟志勇. 论数据信托: 一种数据治理的新方案[J]. 东方法学, 2021 (4): 61- 76.
35	吴桂德. 医疗数据共享之私权激励与行为规制[J]. 东方法学, 2025 (4): 168- 182.
36	市政府关于印发苏州市进一步加快建设“人工智能+”城市的若干措施（2026年版）的通知[EB/OL].（2025-12-31）[2026-01-21]. https://www.suzhou.gov.cn/szsrmzf/zfwj/202512/81ccb2edeec5426990646da3f59dfc4b.shtml.
37	推动深圳数据交易所打造国家级数据交易所, 促进人工智能语料共享和交易[EB/OL].（2025-03-10）[2026-01-21]. http://zfsg.gd.gov.cn/xxfb/dtxw/content/post_4678046.html.

[1]	余钧,戚德祥. 推动出版产业数据要素价值化的支持政策研究[J]. 科技与出版, 2026, 45(3): 123-131.
[2]	韩璐,杨军. 数据要素赋能出版企业价值创造的作用机理、现实困境与纾困路径[J]. 科技与出版, 2025, 44(12): 58-65.
[3]	高虹. 新质生产力推动学术期刊高质量发展的理论逻辑与实践路径^*[J]. 科技与出版, 2024, 43(7): 70-78.
[4]	杨阳,宋昱霖. 出版企业数据资产的概念、分类与建设^*[J]. 科技与出版, 2024, 43(7): 92-102.
[5]	李建红,董侠. 新质生产力赋能新时代教育出版研究^*[J]. 科技与出版, 2024, 43(11): 55-63.

Viewed

Full text

Abstract

Cited

Shared

Discussed