论文部分内容阅读
网络环境下,如何让用户快速发现所需数据是地学数据共享平台长期面临的挑战之一。本文基于国家地球系统科学数据共享平台网站服务器日志数据获取用户搜索行为及数据集访问行为,使用聚类算法挖掘用户行为模式,并基于会话聚类模式开发在线搜索和访问预测算法。在数据预处理阶段,对原始服务器日志数据进行清洗、用户识别、用户会话识别、搜索词提取。在模式挖掘阶段,采用DBSCAN算法对会话进行聚类。考虑到会话向量值的二元性,聚类算法中的距离采用Jaccard距离函数计算。视每个会话聚类包含的搜索词集合为一个文本,所有用户历史搜索词集合为语料库,统计各聚类中搜索词的TF-IDF值。在线搜索推荐,以搜索词检索各聚类中TF-IDF值,返回TF-IDF值最高的搜索词所属聚类,并给出该聚类的高频项目作为推荐。在线访问推荐,则以用户实时访问向量为查询向量,计算该向量与聚类中心的聚类。根据聚类排序,给出距离最近的聚类,并产生该聚类中高频项目作为推荐。实验结果表明基于TF-IDF和聚类的搜索推荐有较高的准确率和召回率,访问推荐效果基于高频统计的推荐有较大提高。研究可得出以下结论:(1)地学共享网用户访问和搜索行为体现了专业性的特点,其行为较普通网站用户可预测性更好;(2)对于地学数据共享用户行为预测,需明确定义用户行为,并采用合适的距离函数描述行为相似性;(3)通过搜索词TF-IDF值来预测用户数据需求的方法可行,以此产生的推荐可作为搜索结果的补充。本研究可服务于地学领域数据共享平台建设,提高共享服务质量,也可为其他领域科学数据共享提供技术方法借鉴。
Under the network environment, how to enable users to quickly find the data they need is one of the long-term challenges for the geo-data sharing platform. In this paper, the user search behavior and data set access behavior are obtained based on the log data of the National Earth System Science Data Sharing Website server. The clustering algorithm is used to mine user behavior patterns and the online search and access prediction algorithm is developed based on the clustering model. In the data preprocessing stage, raw server log data is cleaned, user identification, user session identification, and search word extraction are performed. In the pattern mining stage, the DBSCAN algorithm is used to cluster the sessions. Considering the duality of conversational vector values, the distance in clustering algorithm is calculated by Jaccard distance function. According to each conversation cluster contains a collection of search words as a text, all the user history search word collection as a corpus, statistical clustering of the TF-IDF value of the search term. The online search is recommended. The TF-IDF value of each cluster is searched by the search term, and the cluster with the highest TF-IDF value is returned, and the high-frequency item of the cluster is recommended as a recommendation. When the online visit is recommended, the real-time access vector of the user is a query vector, and the clustering of the vector and the clustering center is calculated. According to the clustering ranking, the closest clustering is given and the high frequency items in the clustering are generated as recommendations. The experimental results show that the search recommendation based on TF-IDF and clustering has high accuracy and recall, and the recommendation recommendation based on high-frequency statistics has been greatly improved. The following conclusions can be drawn from the research: (1) The geospatial user access and search behaviors reflect the professional characteristics and their behavior is more predictable than that of ordinary users. (2) The prediction of geospatial data sharing user behavior needs to be clear Define the user behavior and describe the similarity of the behavior by the appropriate distance function; (3) The method of predicting user data needs by using the TF-IDF value of the search term is feasible, and the resulting recommendation can be used as a supplement to the search results. This study can serve as a data sharing platform for geosciences and improve the quality of shared services. It can also provide technical methods for scientific data sharing in other fields.