论文部分内容阅读
在真实语料中提取词表面临着许多技术与理论上的难点与困难,但它又有着特殊的价值。“通用语料库”是国家语委组织研制的大型语料库,基本反映了现代汉语的语言面貌,完成对它的词表提取,其过程、做法及词表结果,都有着重要意义。机器分词时会遇到分词的正确性、加工精度的可容性、机器分词的强制性、机器分词的局限性等问题。源于真实语料的词表清楚反映出断代词汇由语言词和言语词两个层面构成,两个层面的词语之间有着互渗作用。源于真实语料的词表存在着书面语与口语的差异,不规范现象也较普遍存在,在词语的普遍性上与断代词汇有着相当的距离。
There are many technical and theoretical difficulties and difficulties in extracting corpus in real corpus, but it has special value. The “General Corpus” is a large-scale corpus developed by the National Language Committee. It basically reflects the language appearance of modern Chinese. It is of great significance to complete the process of extracting its vocabulary, its process, its practices and its results. Machine segmentation will encounter the correctness of the participle, the processing of the tolerance of the accuracy of the machine participle mandatory, machine segmentation, and other issues. The lexicon derived from the real corpus clearly shows that the dictation lexicon is composed of two levels: the linguistic and the linguistic, and the two linguistic levels have the function of mutual infiltration. The vocabulary from the real corpus has the difference between written language and spoken language. The non-standard phenomenon is also prevalent, and there is a considerable distance between the vocabulary and the dictionaries.