Global data mining An empirical study of current trends, future forecasts and technology diffusions

合集下载

人工智能领域中英文专有名词汇总

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱(共包含二级节点15 个,三级节点93 个)间序列分析)监督学习)领域 二级分类 三级分类。

国际全球变化研究的数据管理政策(三)(一)

国际全球变化研究的数据管理政策(三)(一)

国际全球变化研究的数据管理政策(三)(一)2区域性网络亚洲-太平洋区全球变化研究网络(APN)的目标是要在政府间建立一个协作网络,以促进亚洲-太平洋地区各国的全球变化研究,以及加强各国处理全球环境变化问题的能力,该网络十分强调需要在亚太地区引进和加强电子和其他通讯系统,以促进本地区数据与信息的交流,解决诸如数据政策的发展、数据标准化和质量保证等有关问题。

应该发展一个联合或通用的数据集。

欧洲全球变化研究网络(ENRICH)的总体目标是为全球变化研究国际行动作出欧洲的重要贡献。

考虑到欧盟成员国现有的活动,ENRICH的目的是为欧盟的政策目标的发展提供知识基础。

这将通过充当信息交流的场所和促进在研究与能力建设方面的合作来实现。

ENRICH的一个重要努力是在Internet建立ENRICH服务器,发展横跨欧洲的先进通讯网络《高频带宽度、高分辨率、相互多媒体服务》,特别要联系整个欧洲的大学和研究中心。

美洲国家间全球变化研究所(IAI)的研究范围为美洲、欧洲-非洲和远东-西南太平洋地区,其主要目标是:①指导和支持基础研究;②收集和管理数据;③促进人类资源的开发;④为制定与全球变化有关的公共政策作出贡献。

其基本原则是促进标准化数据和信息的交换。

2.3.3国家网络美国长期生态学研究网络(LTER)计划由美国国家科学基金会(NSF)资助,于1980年正式启动。

它是世界上第一个以长期生态学现象为主要对象的研究网络。

现在它已经成为世界上规模最大、研究水平最高的国家级长期生态学研究网络。

LTER重视数据集的可比性以及方法和设备的标准化。

数据集的可比性至少包括统计和实时记录。

设备的标准化还包括测量、方法及计算机的标准化,其有关通讯、数据控制以及分析用软硬件的标准化在1988年就已选定。

成立于1992年的英国环境变化监测网络(ECN)是一个综合性的环境监测网络。

该网络旨在收集、存贮、分析、解释以一系列关键变量为基础的长期数据。

国际数据安全领域的研究热点与前沿分析

国际数据安全领域的研究热点与前沿分析

安全等研究;研究前沿包括大数据安全技术与隐私保护、数据共享、物联网数据安全问题。
关键词:数据安全;研究热点;研究前沿;可视化分析;CiteSpace
中图分类号:G353.11
文献标识码:A
DOI:10.3969/j.issn.1003-8256.2021.03.009
开放科学 (资源服务) 标识码 (OSID) :
表 4 数据安全研究国别分布的相关信息统计
国家 /地区
发文 频次
中国
511
美国
342
印度
136
英格兰 104
澳大利亚 98
德国
96
韩国
81
加拿大 67
法国
38
意大利 29
国家 /地区
突增性
中国
7.47
德国
4.68
美国
4.16
比利时 3.55
瑞士
2.95
罗马尼亚 2.63
西班牙 2.55
约旦
2.53

13
编号 1 2 3 4 5 6 7 8 9 10 11 12 13 14
表 3 数据安全核心作者分布 (N>5)
作者 ZHANG Yinghui LIU Ximeng LI Hui YANG Yixian NOMAN Mohammed DENG HUA GUNASEKARAN MANOGARAN YI Xun HUANG Qinlong JIANG Xiaoqian DENG Robert H WANG Shangping ZHENG Dong MA Jianfeng
2 数据安全领域研究现状的计量分析
2.1 文献量变化趋势分析 研究数据安全领域文献的数量和增长速度可以揭

大数据的国内外研究现状及发展动态分析

大数据的国内外研究现状及发展动态分析

大数据的国内外研究现状及发展动态分析在信息时代的浪潮中,大数据成为了一种重要的资源和技术。

它的涌现不仅改变了人们的生活方式和商业运营方式,也推动了科学研究的发展。

本文将对国内外大数据研究的现状以及未来的发展动态进行分析。

一、国际大数据研究现状大数据研究在国际范围内已经有了长足的发展。

首先,在数据存储方面,云计算技术被广泛应用于海量数据的存储和管理,例如Amazon的S3和Google的Bigtable等技术。

其次,在数据处理方面,分布式计算和并行计算被用于加速大数据的处理速度,例如MapReduce和Spark等技术。

此外,数据挖掘和机器学习也成为了大数据研究的重要方向,通过对大量数据的分析和学习,揭示其中的关联模式和规律。

二、国内大数据研究现状在国内,大数据研究也呈现出蓬勃发展的态势。

首先,在政府的支持下,各大高校和研究机构纷纷开展了大数据相关的研究项目。

其次,在行业应用方面,诸如金融、医疗、物流等各个领域都开始利用大数据来提高效率和服务质量。

此外,一些互联网企业也在大数据分析和算法研发方面进行了深入探索,例如阿里巴巴和百度等。

三、国际大数据研究动态在国际上,大数据研究正朝着更加深入和广泛的方向发展。

首先,随着物联网技术的不断演进,大量传感器数据的产生将推动数据存储和分析的需求。

其次,在人工智能领域,深度学习技术的崛起为大数据研究提供了新的方法和思路。

此外,跨界研究也成为了大数据领域的趋势,例如将大数据与社会科学、医学等学科相结合,探索新的研究方向和方法。

四、国内大数据研究动态在国内,大数据研究也在不断推进和突破。

首先,政府加大了对大数据研究的支持力度,提出了一系列发展政策和资金扶持。

其次,学术界和产业界之间的合作交流也越来越频繁,加快了大数据技术的推广和应用。

此外,一些新兴领域的涌现,如人工智能、区块链等,也将为大数据研究带来新的机遇和挑战。

五、国际大数据研究趋势在国际上,大数据研究的趋势是多样化和复合化发展。

数据分析英语试题及答案

数据分析英语试题及答案

数据分析英语试题及答案一、选择题(每题2分,共10分)1. Which of the following is not a common data type in data analysis?A. NumericalB. CategoricalC. TextualD. Binary2. What is the process of transforming raw data into an understandable format called?A. Data cleaningB. Data transformationC. Data miningD. Data visualization3. In data analysis, what does the term "variance" refer to?A. The average of the data pointsB. The spread of the data points around the meanC. The sum of the data pointsD. The highest value in the data set4. Which statistical measure is used to determine the central tendency of a data set?A. ModeB. MedianC. MeanD. All of the above5. What is the purpose of using a correlation coefficient in data analysis?A. To measure the strength and direction of a linear relationship between two variablesB. To calculate the mean of the data pointsC. To identify outliers in the data setD. To predict future data points二、填空题(每题2分,共10分)6. The process of identifying and correcting (or removing) errors and inconsistencies in data is known as ________.7. A type of data that can be ordered or ranked is called________ data.8. The ________ is a statistical measure that shows the average of a data set.9. A ________ is a graphical representation of data that uses bars to show comparisons among categories.10. When two variables move in opposite directions, the correlation between them is ________.三、简答题(每题5分,共20分)11. Explain the difference between descriptive andinferential statistics.12. What is the significance of a p-value in hypothesis testing?13. Describe the concept of data normalization and its importance in data analysis.14. How can data visualization help in understanding complex data sets?四、计算题(每题10分,共20分)15. Given a data set with the following values: 10, 12, 15, 18, 20, calculate the mean and standard deviation.16. If a data analyst wants to compare the performance of two different marketing campaigns, what type of statistical test might they use and why?五、案例分析题(每题15分,共30分)17. A company wants to analyze the sales data of its products over the last year. What steps should the data analyst take to prepare the data for analysis?18. Discuss the ethical considerations a data analyst should keep in mind when handling sensitive customer data.答案:一、选择题1. D2. B3. B4. D5. A二、填空题6. Data cleaning7. Ordinal8. Mean9. Bar chart10. Negative三、简答题11. Descriptive statistics summarize and describe thefeatures of a data set, while inferential statistics make predictions or inferences about a population based on a sample.12. A p-value indicates the probability of observing the data, or something more extreme, if the null hypothesis is true. A small p-value suggests that the observed data is unlikely under the null hypothesis, leading to its rejection.13. Data normalization is the process of scaling data to a common scale. It is important because it allows formeaningful comparisons between variables and can improve the performance of certain algorithms.14. Data visualization can help in understanding complex data sets by providing a visual representation of the data, making it easier to identify patterns, trends, and outliers.四、计算题15. Mean = (10 + 12 + 15 + 18 + 20) / 5 = 14, Standard Deviation = √[(Σ(xi - mean)^2) / N] = √[(10 + 4 + 1 + 16 + 36) / 5] = √52 / 5 ≈ 3.816. A t-test or ANOVA might be used to compare the means ofthe two campaigns, as these tests can determine if there is a statistically significant difference between the groups.五、案例分析题17. The data analyst should first clean the data by removing any errors or inconsistencies. Then, they should transformthe data into a suitable format for analysis, such ascreating a time series for monthly sales. They might also normalize the data if necessary and perform exploratory data analysis to identify any patterns or trends.18. A data analyst should ensure the confidentiality andprivacy of customer data, comply with relevant data protection laws, and obtain consent where required. They should also be transparent about how the data will be used and take steps to prevent any potential misuse of the data.。

全球数字财富调研报告英文

全球数字财富调研报告英文

全球数字财富调研报告英文Global Digital Wealth Survey ReportIntroductionThe Global Digital Wealth Survey Report provides an in-depth analysis of the current landscape of digital wealth across the world. The survey explores various aspects such as digital assets, cryptocurrency, online investment platforms, and the adoption of financial technology in managing wealth. This report examines the findings of the survey and provides insights into the global digital wealth landscape.MethodologyThe survey was conducted through an online questionnaire thatwas distributed to individuals across different age groups and regions. The questionnaire consisted of multiple-choice questions and open-ended questions, allowing respondents to provide detailed insights into their personal experiences with digital wealth. Key Findings1. Digital Assets:a. 73% of respondents reported owning digital assets, with cryptocurrencies being the most popular form.b. Bitcoin was identified as the most widely-held cryptocurrency, followed by Ethereum and Ripple.c. Ownership of non-fungible tokens (NFTs) was reported by 29% of respondents, primarily in the age group of 18-34.2. Cryptocurrency Adoption:a. 59% of respondents view cryptocurrency as a viableinvestment option.b. Security concerns and volatility were identified as the main barriers to cryptocurrency adoption.c. In terms of usage, 41% of respondents have used cryptocurrencies for online transactions and purchases.3. Online Investment Platforms:a. 68% of respondents reported using online investment platforms for managing their wealth.b. Robo-advisors were the most popular type of online investment platform, followed by peer-to-peer lending platforms and crowdfunding platforms.c. Transparency, ease of use, and lower fees were identified as the key factors influencing the choice of online investment platforms.4. Financial Technology Adoption:a. 78% of respondents reported using financial technology tools for managing their wealth.b. Mobile apps for banking and investment purposes were the most widely used financial technology tools.c. Accessibility, convenience, and enhanced security were cited as the main advantages of using financial technology tools. ConclusionThe Global Digital Wealth Survey Report highlights the increasing adoption of digital assets, cryptocurrencies, online investment platforms, and financial technology tools for managing wealth globally. The findings emphasize the importance of addressing security concerns and promoting financial literacy to encouragewider adoption of digital wealth management solutions. As the world becomes more digitally interconnected, it is crucial for individuals to stay informed and adapt to the evolving landscape of digital wealth.。

地理科学进展英文版

地理科学进展英文版

地理科学进展英文版The Progress of Geographical ScienceGeographical science is a multidisciplinary field that studies the Earth's physical features, climate patterns, landforms, ecosystems, human settlements, and their interactions. Over the years, there have been significant advancements in geographical science that have greatly contributed to our understanding of the world. Here are some key areas of progress:1. Remote Sensing and GIS: Remote sensing technology has revolutionized the way we collect data about the Earth's surface. Satellites and airborne sensors provide high-resolution images that help in mapping and monitoring various phenomena such as land use, vegetation cover, and urban growth. Geographic Information Systems (GIS) enable the storage, analysis, and visualization of spatial data, facilitating advanced spatial modeling and decision-making processes.2. Climate Change Research: Geographical science plays acrucial role in studying the impacts of climate change. Scientists analyze temperature records, precipitation patterns, and sea level rise to understand the changing climate and its effects on ecosystems, agriculture, and human societies. This research helps in developing strategies for adaptation and mitigation.3. Geospatial Analysis: Geographical science has seen advancements in geospatial analysis techniques, allowing for more accurate and detailed investigations. Geographic data can be analyzed using statistical methods, spatial interpolation, and geostatistics to identify spatial patterns, trends, and relationships. This aids in solving complex spatial problems, such as disease mapping, urban planning, and transportation optimization.4. Human Geography: The study of human geography has advanced significantly, focusing on the relationships between people and the environment. It includes analyzing population dynamics, migration patterns, urbanization, cultural landscapes, and socioeconomic inequalities. Understanding these factors is crucial for effective urban planning, resource management, and sustainable development.5. Geographical Information Science (GIScience): GIScience is an emerging field that combines geographical science with computer science and artificial intelligence. It explores new methods and algorithms for spatial analysis, data integration, and modeling. GIScience contributes to advancements in location-based services, spatial data mining, and geovisualization techniques.6. Geographical Education: There have been improvements in geographical education, with innovative teaching methods and technologies being adopted. Interactive mapping tools, online data resources, and virtual field trips provide students with hands-on learning experiences and a deeper understanding of geographical concepts.These are just a few examples of the progress made in geographical science. With ongoing advancements in technology and interdisciplinary collaborations, geographical science continues to evolve and contribute to our knowledge of the world around us.。

基于深度学习的教育数据挖掘中学生学习成绩的...(IJEME-V10-N6-4)

基于深度学习的教育数据挖掘中学生学习成绩的...(IJEME-V10-N6-4)

I.J. Education and Management Engineering, 2020, 6, 27-33Published Online December 2020 in MECS (/)DOI: 10.5815/ijeme.2020.06.04Predicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlowMussa S. Abubakari *, Fatchul ArifinDepartment of Electronics & Informatics Engineering Education, Postgraduate Program, Universitas Negeri Yogyakarta, Yogyakarta 55281, IndonesiaE-mail: abu.mussaside@*, fatchul@uny.ac.idGilbert G. HungiloDepartment of Informatics Engineering, Graduate Program, University Atma Jaya Yogyakarta, Yogyakarta 55281, IndonesiaE-mail: gutabagaonline@Received:07 May 2020; Accepted: 26 July 2020; Published: 08 December 2020Abstract: The study was aimed to create a predictive model for predicting students’ academic performance based on a neural network algorithm. This is because recently, educational data mining has become very helpful in decision making inan educational context and hence improving students’ academic outcomes. This study implemented a Neural Network algorithm as a data mining technique to extract knowledge patterns from student’s dataset consisting of 480 instances (students) with 16 attributes for each student. The classification metric used is accuracy as the model quality measurement. The accuracy result was below 60% when the Adam model optimizer was used. Although, after applying the Stochastic Gradient Descent optimizer and dropout technique, the accuracy increased to more than 75%. The final stable accuracy obtained was 76.8% which is a satisfactory result. This indicates that the suggested NN model can be reliable for prediction, especially in social science studies.Index Terms: Classification, Data Mining Techniques, Educational Data Mining, Neural Network Algorithm, Predictive Model.1.IntroductionCurrently, data mining has become an interesting topic for many researchers in various fields such as medicine, engineering, and even educational field. Especially in educational context, through mining of students’ information, it has become easier to make decisions concerning students in their academic performance [1, 2]. The prediction of students’ performance is a vital matter in educational context as predicting future performance of students after being admitted into a college, can determine who would attain poor marks and who would perform well. These results can help make efficient decisions during admission and hence improve the academic services quality [3–5].Analysis of educational data using data-mining techniques helps extract unique information of students from educational database and use that hidden information to solve various academic problems of students by understanding learners, improve teaching-learning methods and process [6, 7]. Moreover, these data mining techniques help educational stakeholders to make quality decisions to enhance students’ outcomes.Various methods like Decision tree and Naïve Bayesian were used by many researchers for predicting learners’ academic performance and make decisions to help those who need help immediately [7]. Other researchers used ensemble methods such as Random Forest (RF), AdaBoosting, and Bagging as classification methods [7, 8]. Different data mining methods can solve different educational problems such as classification and clustering. The famous known data mining method in prediction models is classification. Various deep learning algorithms like Neural Networks, are used under28 Predicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlow classification matter [9].In the current study, neural network (NN) classification algorithm is implemented to create a predictive model in predicting academic performance of students in a particular academic institution by using students’ characteristics and their distinctive demographic data. A predictive model based on NN approach can be useful in decision making on academic success of students and therefore enhancing academic management and improving quality education.2. Related WorksVarious studies have been conducted concerning data mining in educational context for uncovering knowledge patterns from students’ information for improving academic performance of students. This current study will base its theoretical background on the previous research done on the educational data mining contexts as explained below.The study was conducted on engineering students based on different mining techniques for making academic decisions. Techniques involving classification rules and association rules for discovering knowledge patterns, were used to predict the engineering student’s performance. The study experiment also clustered the students based on k-means clustering algorithm [10]. In another study, students’ performance was evaluated based on association rule algorithm. The research was done by assessing the performance of students based on different features. The experiment was implemented based on real time dataset found in the school premises using Weka [11].Baradwaj and Pal explained in their study on student’s assessment by using a number of data mining methods. Their study facilitated teachers to identify students who need special attention to reduce the fail percentage and help to take valid measure for next semesters [3]. Also, another study was done to develop a classification model to predict student performance using Deep Learning which learns multiple levels of representation automatically. They used unsupervised learning algorithm to pre-train hidden layers of features layer-wisely based on a sparse auto-encoder from unlabeled data, and then supervised training was used for the parameters fine-tuning. The resulted model was trained on a relatively huge real-world students’ dataset, and the experimental findings indicate the effectiveness of the proposed method to be implemented into academic pre-warning mechanism [12].Other researchers developed models to predict students' university performance based on students' personal attributes, university performance and pre-university characteristics. The studies included the data of 10,330 students Bulgaria with every student having 20 attributes. Algorithms such as the K-nearest neighbour (KNN), decision tree, Naive Bayes, and rule learner's algorithms were applied to classify the students into 5 classes: Excellent, Very Good, Good, Bad or Average. Overall accuracy was below 69%. However, decision tree classifier showed best performance having the highest overall accuracy, followed by the rule learner [13, 14].Recently, the study was conducted to predict user’s intention to utilize peer-to-peer (P2P) mobile application for transactions. Logistic regression (LR) analysis technique together with neural network were used to predict the technology adoption. The results indicated that NN model has higher accuracy than LR model [15]. Another study proposed a student performance model with behavioral characteristics. These characteristics are associated with the student interactivity with an e-learning platform. Data mining techniques such as Naïve Bayesian and Decision Tree classifiers were used to evaluate the impact of such features on student’s academic performance. The results of that study revealed that there is a strong relationship between learner behaviors and its academic achievement [16].In this study, a predictive model is created based on neural network (NN) classification algorithm in predicting academic performance of students by using students’ behavioral characteristics and their distinctive demographic data as variables. A predictive model using NN data mining approach can help in making decisions and conclusions on academic success of students hence enhancing academic management and improve education quality.3. Methodology3.1 Data CollectionThe student data implemented in this project were obtained from educational dataset collected by [16] from learning management system (LMS) in The University of Jordan, Amman, Jordan during the study conducted in 2015. The dataset is available in the kaggle website (https:///aljarah/xAPI-Edu-Data). The dataset comprised of 480 (instances) of student records and their 16 respective attributes. These attributes were grouped into three classes, namely (i) Behavioral attributes include parents answering survey, school satisfaction, opening resources, and raised hand on class, (ii) Academic background attributes including grade Level, educational stage, and section, and (iii) Demographic features including nationality and gender. The dataset also includes 175 females and 305 males. The students have different nationalities including from Kuwait (179), USA (6), Jordan (172), Iraq (22), Lebanon (17), Tunis (12), Saudi Arabia (11), Egypt (9), from Iran, Syria, and Libya were 7 each, Morocco (4), 28 students from Palestine, and one from Venezuela.Another attribute is school attendance having two groups based on days of class absence: 191 students exceeded 7 days and 289 students were absent under 7 days. Moreover, the dataset includes also a new kind of attribute namely parent participation having two sub attributes: Parent School Satisfaction and Parent Answering Survey. 270 parents participated in a survey answering and 210 did not, 292 parents were satisfied from the school and 188 were not. The students arePredicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlow 29 grouped into three classes based on their total grades, namely High-Level, Middle-Level, and Low-Level [8]. Appendix A summarizes the students’ attributes and their description.3.2 Methods and Data PreparationFor this study, authors used Anaconda software environment for python machine learning language together with keras machine learning library and specifically TensorFlow utility which is powerful to create and evaluate the proposed NN classification model [17–19]. Keras is a python library widely used in deep-learning that run on top of TensorFlow and Theano, providing an intuitive best API for Python in NNs [20, 21]. Since the dataset used in this study contains variables (attributes) with different categories, there was a need to transform them into a form the computer and NN model can understand. The dataset explained above consists of three main categories of variables. First are nominal variables with two categories such as gender (male or female), semester (first or second), and others. Second, are variables with numerical values such as visited resources, raised hand, and others. And third, are nominal variables with more than three categories such as grade levels (G-01 to G-12), topic (English, Math, Chemistry, and so on), and other variables as it can be seen in Appendix A.Nominal variables with two categories were transformed using label encoder mechanism. While, those with three or more categories were transformed using one-hot encoding (dummies method). Furthermore, continuous numerical variables were transformed by normalizing them using min-max scaler mechanism for normal distribution.4. Experiment Process and ResultsAfter data transformation as explained above, the inputs increased from 16 inputs to 39 inputs and the output (classification outputs) of 3 outputs making a total of 42 columns in the NN model. After that, the dataset was split into train data and test data with data for testing consisting of less than 26% of all dataset and the remaining percentage for training.The following step was to create a predictive model based on Artificial Neural Network (ANN) classification technique to evaluate the attributes which influence directly or indirectly student's academic success. ANN technique is an implementation of artificial neural network that involves training data inputs for the best accuracy achievement. A cross validation with 10-fold was used to divide the dataset for training and testing process. Then the process was followed by fitting the model by 200 iteration (epochs) with 10 batch-size of inputs and then followed by the results evaluation for generating knowledge representation. The evaluation measure used is accuracy for classification quality. Accuracy is the proportion or ratio of the total number of correct predictions to incorrectly predicted.Fig. 1. The NN Model Structure.Figure 1 above shows the NN model structure created by a python code as can be seen in the last code line in Appendix B. The NN predictive model used in this study consists of three layers: (1) input layer with 39 neurons, (2) hidden layer30 Predicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlowwith 19 neurons and (3) an output layer with 3 outputs. The input layer receives input data from 16 attributes and the output layer send output of three grade categories, namely Low (L). Middle (M), and High (H). There is a hidden layer between the input layer and output layer. Appendix B illustrate the python code used to create, fit, and validate the NN model.In this study, we used accuracy as the metric for prediction quality of the developed NN model. Also, only NN algorithm was used for classification of the student dataset.The result of the experiment has two versions due to the implementation of two different model (function) optimizers namely, Adam and Stochastic gradient descent (SGD) as well as due to the introduction of dropout technique to the NN model development to drop (20% of neurons were dropped in this study) loosely connected neuron. The result indicates that when we applied Adam optimization technique the accuracy was below 60%. While, when we applied the SGD optimizer the accuracy improved to more than 76%.Moreover, the dropout technique helped to improve the accuracy value to more than 76.5%. The dropout technique is used to remove the loosely connected neurons as the NN technique performs better with fully connected neurons. The final stable result was 76.8% accuracy.5. Conclusion and Future WorkEducation is a vital element in any community for their social-economic development. Data mining techniques or business intelligence allows extracting knowledge patterns from students’ raw data offering interesting chances for the educational context. Particularly, various studies have implemented machine learning techniques like Decision Tree and Random Forest to enhance the management of college resources and hence improving education quality.In this study, the authors have presented a predictive model using NN technique to learn the patterns from students’ data and predict their academic performance. By applying data mining techniques on students’ database, academic stakeholders can find the important factors which have direct or indirect impacts on the student’s academic success. The knowledge patterns and results discovered in this study after applying NN classification method indicate that different attributes of students have impacts on their learning process as it can be seen in the classification accuracy results. The final classification accuracy obtained in this study is 76.9% which is more than satisfactory percentage for our predictive model developed using NN algorithm.Like other studies, this study is with some limitations too. One of which is the dataset can only be applied to the similar context as this study. Also, the results presented here involves the accuracy as the only predictive measure of model quality. Moreover, only one algorithm, NN algorithm was used for classification purpose.For future studies, authors intend to use the localized student data from a particular university in Yogyakarta, especially from Yogyakarta State University. Also, in the future we expect to apply other data mining methods such as RF, DT, and others in the localized dataset. Moreover, future experiments will add more measurement classification qualities such as Precision, sensitivity, and Recall.AcknowledgementsMuch appreciation to my close friends who inspired me to do this work.References[1]S. K. Mohamad and Z. Tasir, “Educational Data Mining: A Review,” Procedia - Soc. Behav. Sci., vol. 97, pp. 320–324, 2013.[2]M. Chalaris, S. Gritzalis, M. Maragoudakis, C. Sgouropoulou, and A. Tsolakidis, “Improving Quality of Educational ProcessesProviding New Knowledge Using Data Mining Techniques,” Procedia - Soc. Behav. Sci., vol. 147, pp. 390–397, 2014.[3] B. Brijesh Kumar and P. Saurabh, “Mining Educational Data to Analyze Students‟ Performance,” Int. J. Adv. Comput. Sci.Appl., vol. 2, no. No. 6, pp. 59–63, 2011.[4]W. F. W. Yaacob, S. A. M. Nasir, W. F. W. Yaacob, and N. M. Sobri, “Supervised data mining approach for predicting studentperformance,” Indones. J. Electr. Eng. Comput. Sci., vol. 16, no. 3, pp. 1584–1592, 2019.[5]H. Aldowah, H. Al-Samarraie, and W. M. Fauzy, “Educational data mining and learning analytics for 21st century highereducation: A review and synthesis,” Telemat. Informatics, vol. 37, pp. 13–49, 2019.[6]S. Hussain, N. A. Dahan, F. M. Ba-Alwib, and N. Ribata, “Educational data mining and analysis of students’ academicperformance using WEKA,” Indones. J. Electr. Eng. Comput. Sci., vol. 9, no. 2, pp. 447–459, 2018.[7]S. S. M. Ajibade, N. B. Ahmad, and S. M. Shamsuddin, “A data mining approach to predict academic performance of studentsusing ensemble techniques,” in Advances in Intelligent Systems and Computing, 2020, vol. 940, no. March, pp. 749–760.[8] E. A. Amrieh, T. Hamtini, and I. Aljarah, “Mining Educational Data to Predict Student’s academic Performance using EnsembleMethods,” Int. J. Database Theory Appl., vol. 9, no. 8, pp. 119–136, 2016.[9] A. M. Shahiri, W. Husain, and N. A. Rashid, “A Review on Predicting Student’s Performance Using Data Mining Techniques,”in Procedia Computer Science, 2015, vol. 72, pp. 414–422.[10]R. Singh, “An Empirical Study of Applications of Data Mining Techniques for Predicting Student Performance in HigherEducation,” Int. J. Comput. Sci. Mob. Comput., vol. 2, no. February, pp. 53–57, 2013.Predicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlow 31 [11]S. Borkar and K. Rajeswari, “Predicting students academic performance using education data mining,” Int. J. Comput. Sci. Mob.Comput., vol. 2, no. 7, pp. 273–279, 2013.[12]B. Guo, R. Zhang, G. Xu, C. Shi, and L. Yang, “Predicting Students Performance in Educational Data Mining,” in Proceedings -2015 International Symposium on Educational Technology, ISET 2015, 2016, pp. 125–128.[13]D. Kabakchieva, “Predicting student performance by using data mining methods for classification,” Cybern. Inf. Technol., vol.13, no. 1, pp. 61–72, 2013.[14]D. Kabakchieva, K. Stefanova, and V. Kisimov, “Analyzing university data for determining student profiles and predictingperformance,” in EDM 2011 - Proceedings of the 4th International Conference on Educational Data Mining, 2011, pp. 347–348.[15]J. Lara-Rubio, A. F. Villarejo-Ramos, and F. Liébana-Cabanillas, “Explanatory and predictive model of the adoption of P2Ppayment systems,” Behav. Inf. Technol., vol. 0, no. 0, pp. 1–14, 2020.[16]E. A. Amrieh, T. Hamtini, and I. Aljarah, “Preprocessing and analyzing educational data set using X-API for improvingstudent’s performance,” in 2015 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies, AEECT 2015, 2015.[17]P. S. Janardhanan, “Project repositories for machine learning with TensorFlow,” Procedia Comput. Sci., vol. 171, pp. 188–196,2020.[18]L. Hao, S. Liang, J. Ye, and Z. Xu, “TensorD: A tensor decomposition library in TensorFlow,” Neurocomputing, vol. 318, pp.196–200, 2018.[19]R. Orus Perez, “Using TensorFlow-based Neural Network to estimate GNSS single frequency ionospheric delay (IONONet),”Adv. Sp. Res., vol. 63, no. 5, pp. 1607–1618, 2019.[20]V.-H. Nhu et al., “Effectiveness assessment of Keras based deep learning with different robust optimization algorithms forshallow landslide susceptibility mapping at tropical area,” CATENA, vol. 188, p. 104458, 2020.[21]K. Akyol, “Comparing of deep neural networks and extreme learning machines based on growing and pruning approach,”Expert Syst. Appl., vol. 140, p. 112875, 2020.Authors’ ProfilesMussa S. Abubakari was born in Kondoa, Tanzania in 1990. He received the B.Sc. degree inTelecommunications Engineering from the University of Dodoma, Tanzania in 2016. Currently he is thepostgraduate candidate taking master degree in Electronics & Informatics Engineering Education atUniversitas Negeri Yogyakarta, Indonesia. His research interests include technology enhanced learning,human computer interaction, technology acceptance, Internet of Things, mobile technologies, intelligentsystems, and signal processing.Dr. Fatchul Arifin was born on 08 Mei 1972. He received a B.Sc. in Electric Engineering at UniversitasDiponegoro and PH.D. degree in Electric Engineering from Institut Teknologi Surabaya, in 1996 and 2014,respectively. Currently he is the lecturer at both undergraduate faculty of engineering and postgraduateprogram at Universitas Negeri Yogyakarta. His research interests include but not limited to intelligentcontrol systems, machine learning, expert systems, and neural-fuzzy system.Gilbert G. Hungilo is a master degree graduate from department of Informatics Engineering at theUniversity Atma Jaya Yogyakarta, Indonesia. He received Bachelor of Science in Computer Science fromthe University of Dar es salaam, Tanzania. His research interests include technology adoption, big dataanalytics, and machine learning.How to cite this paper: Mussa S. Abubakaria, Fatchul Arifin, Gilbert G. Hungilo. "Predicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlow ", International Journal of Education and Management Engineering (IJEME), Vol.10, No.6, pp.27-33, 2020. DOI: 10.5815/ijeme.2020.06.0432 Predicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlowAppendix A. Students’ Attributes [16]SN Attribute Description Variable Type1 Gender Gender of Student: Female or Male. Nominal(binary)2 Nationality Student's Origin: Kuwait, Iraq, Libya Lebanon, Egypt, USA,Morocco, Jordan, Iran, Tunis, Syria, Palestine, Saudi Arabia,Venezuela.Nominal(dummy)3 Birth Place Student's Birth Place: Kuwait, Iraq, Libya Lebanon, Egypt,USA, Morocco, Jordan, Iran, Tunis, Syria, Palestine, SaudiArabia, Venezuela.Nominal(dummy)4 Stage ID Student Educational Level: High School, Middle School,Lower level.Nominal(dummy)5 Grade ID Student Grade: G-01 up to G-12. Nominal(dummy)6 Section ID Classroom student belongs: A, B, C. Nominal(dummy)7 Topic Course Studied: Arabic, Biology, Chemistry, English,Geology, French, Spanish, IT, Math, Science, History, Quran.Nominal(dummy)8 Semester School year semester: First, Second. Nominal(binary)9 Relation Responsible Parent: Mom, Father. Nominal(binary)10 Raised hand Frequency of raising hand in classroom: 0-100. Numeric11 VisitedresourcesFrequency of visiting course online content: 0-100. Numeric12 AnnouncementsViewFrequency of checking the new online announcement: 0-100. Numeric13 Discussion Frequency of participating in online discussion forums: 0-100. Numeric14 Parent SurveyAnsweringWhether Parents answered or not the survey: Yes, No. Nominal(binary)15 Parent SchoolSatisfactionWhether a parent is satisfied or not: Yes, No. Nominal(binary)16 Student AbsenceDays The number of absence days a student was absent: Above orUnder 7 days.Nominal(binary)17 Class The grade class: High-Level (H): from 90-100; Middle-Level(M): from 70 to 89; Low-Level (L): from 0 to 69.Nominal(dummy)Predicting Students' Academic Performance in Educational Data Mining Based on Deep Learning Using TensorFlow 33 Appendix B. A Piece of Python Code Used to Create and Validate an NN Model。

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Global data mining:An empirical study of current trends,future forecasts and technology diffusionsHsu-Hao Tsai ⇑Department of Management Information System,National Chengchi University,No.64,Sec.2,Zhinan Rd.,Wenshan District,Taipei City 11605,Taiwan,ROCa r t i c l e i n f o Keywords:Data miningResearch trends and forecasts Technology diffusionsBibliometric methodologya b s t r a c tUsing a bibliometric approach,this paper analyzes research trends and forecasts of data mining from 1989to 2009by locating heading ‘‘data mining’’in topic in the SSCI database.The bibliometric analytical technique was used to examine the topic in SSCI journals from 1989to 2009,we found 1181articles with data mining.This paper implemented and classified data mining articles using the following eight cate-gories—publication year,citation,country/territory,document type,institute name,language,source title and subject area—for different distribution status in order to explore the differences and how data mining technologies have developed in this period and to analyze technology tendencies and forecasts of data mining under the above results.Also,the paper performs the K-S test to check whether the analysis fol-lows Lotka’s law.Besides,the analysis also reviews the historical literatures to come out technology dif-fusions of data mining.The paper provides a roadmap for future research,abstracts technology trends and forecasts,and facilitates knowledge accumulation so that data mining researchers can save some time since core knowledge will be concentrated in core categories.This implies that the phenomenon ‘‘success breeds success’’is more common in higher quality publications.Ó2012Elsevier Ltd.All rights reserved.1.IntroductionData mining is an interdisciplinary field that combines artificial intelligence,database management,data visualization,machine learning,mathematic algorithms,and statistics.Data mining,also known as knowledge discovery in databases (KDD)(Chen,Han,&Yu,1996;Fayyad,Piatetsky-Shapiro,&Smyth,1996a ),is a rapidly emerging field.This technology provides different methodologies for decision-making,problem solving,analysis,planning,diagno-sis,detection,integration,prevention,learning,and innovation This technology is motivated by the need of new techniques to help analyze,understand or even visualize the huge amounts of stored data gathered from business and scientific applications.It is the process of discovering interesting knowledge,such as pat-terns,associations,changes,anomalies and significant structures from large amounts of data stored in databases,data warehouses,or other information repositories.It can be used to help companies to make better decisions to stay competitive in the marketplace.The major data mining functions that are developed in commercial and research communities include summarization,association,classification,prediction and clustering.These functions can be implemented using a variety of technologies,such as database-ori-ented techniques,machine learning and statistical techniques (Fayyad,Piatetsky-Shapiro,&Smyth,1996b ).Data mining was defined by Turban,Aronson,Liang,and Sharda (2007,p.305)as a process that uses statistical,mathematical,arti-ficial intelligence and machine-learning techniques to extract and identify useful information and subsequently gain knowledge from large databases.In an effort to develop new insights into practice-performance relationships,data mining was used to investigate improvement programs,strategic priorities,environmental factors,manufacturing performance dimensions and their interactions (Hajirezaie,Husseini,Barfourosh,et al.,2010).Berson,Smith,and Thearling (2000),Lejeune (2001),Ahmed (2004)and Berry and Lin-off (2004)also defined data mining as the process of extracting or detecting hidden patterns or information from large databases.With an enormous amount of customer data,data mining technol-ogy can provide business intelligence to generate new opportuni-ties (Bortiz &Kennedy,1995;Fletcher &Goss,1993;Langley &Simon,1995;Lau,Wong,Hui,&Pun,2003;Salchenberger,Cinar,&Lash,1992;Su,Hsu,&Tsai,2002;Tam &Kiang,1992;Zhang,Hu,Patuwo,&Indro,1999).Recently,a number of data mining applications and prototypes have been developed for a variety of domains (Brachman,Khabaza,Kloesgen,Piatetsky-Shapiro,&Simoudis,1996)including market-ing,banking,finance,manufacturing and health care.In addition,data mining has also been applied to other types of data such as time-series,spatial,telecommunications,web,and multimedia data.In general,the data mining process,and the data mining tech-nique and function to be applied depend very much on the appli-cation domain and the nature of the data available.0957-4174/$-see front matter Ó2012Elsevier Ltd.All rights reserved.doi:10.1016/j.eswa.2012.01.150Tel.:+886227929728;fax:+886229393754.E-mail addresses:simontsai@ ,98356512@.twUsing a bibliometric approach,the paper analyzes technology trends and forecasts of data mining from1989to2009by locating heading‘‘data mining’’in topic in the SSCI database.This paper surveys and classifies data mining articles using the following eight categories–publication year,citation,document type,country/ter-ritory,institute name,language,source title and subject area–for different distribution status in order to explore the difference and how technologies and applications of data mining have developed in this period and to analyze technology trends and forecasts of data mining under the above results.Besides,the analysis also re-views the historical literatures to come out technology diffusions of data mining.The analysis provides a roadmap for future research,abstracts technology trends and forecasts,and facilitates knowledge accu-mulation so that data mining researchers can save some time since core knowledge will be concentrated in core categories.This im-plies that the phenomenon‘‘success breeds success’’is more com-mon in higher quality publications.2.Material and methodology2.1.Research materialWeingart(2003,2004)pointed at the very influential role of the monopolist citation data producer ISI(Institute for Scientific Information,now Thomson Scientific)as its commercialization of these data(Adam,2002)rapidly increased the non-expert use of bibliometric analysis such as rankings.The materials used in this study were accessed from the database of the Social Sci-ence Citation Index(SSCI),obtained by subscription from the ISI,Web of Science,Philadelphia,PA,USA.In this study,we dis-cuss the papers published in the period from1989to2009be-cause there was no data prior to that year.The Social Sciences Citation Index is a multidisciplinary index to the journal article of the social sciences.It fully indexes over1950journals across 50social sciences disciplines.It also indexes individually selected, relevant items from over3,300of the world’s leading scientific and technical journals.2.2.Research methodologyPritchard(1969,p.349)defined bibliometrics as‘‘the applica-tion of mathematics and statistical methods to books and other media of communication.’’Broadus(1987,p.376)defined biblio-metrics as‘‘the quantitative study of physical published units,or of bibliographic units,or of the surrogates for either.’’Bibliometric techniques have been used primarily by information scientists to study the growth and distribution of the scientific article. Researchers may use bibliometric methods of evaluation to deter-mine the influence of a single writer,for example,or to describe the relationship between two or more writers or works.Besides, properly designed and constructed(Moed&Van Leeuwen,1995; Van Raan,1996;Van Raan,2000),bibliometrics can be applied as a powerful support tool to peer review.Also for interdisciplinary researchfields this is certainly possible(Van Raan&Van Leeuwen, 2002).One common way of conducting bibliometric research is to use the Social Science Citation Index(SSCI),the Science Citation In-dex(SCI)or the Arts and Humanities Citation Index(A&HCI)to trace citations.There are some research using bibliometric methodology to analyze the trends and forecasts,such as e-commerce,supply chain management,data mining,CRM,and energy management. (Chen,Chen,&Lee,2010;Tsai,2011;Tsai&Chang,2011;Tsai& Chi,2011).2.2.1.Lotka’s lawLotka’s law describes the frequency of publication by authors in a givenfield.It states that‘‘the number(of authors)making n con-tributions is about1/n2of those making one;and the proportion of all contributors,that make a single contribution,is about60%’’(Lotka,1926).Lotka’s law is stated by the following formula: x n y¼c where y is the number of authors with x publications,the exponent n is suggested by a value of0.6079and the constant c is suggests by a value of2.This means that out of all the authors in a givenfield,about60%will have just one publication,about 15%will have two publications(1/22times0.60),about7%of authors will have three publications(1/32times0.60),and so on. Lotka’s law,when applied to large bodies of article over a fairly long period of time,can be accurate in general,but not statistically exact.It is often used to estimate the frequency with which authors will appear in an online catalog(Potter,1988).Lotka’s law is generally used for understanding the productivity patterns of authors in a bibliography(Coille,1977;Gupta,1987; Nicholls,1989;Pao,1985;Rao,1980;Vlachy,1978).In this article, Lotka’s law is chosen to perform bibliometric analysis to check the number of publications versus accumulated authors between1989 and2009to perform an author productivity inspection to collect the results for research tendency in the near future.To verify the analysis,the paper implements the K-S test to evaluate whether the result matches Lotka’s law.2.2.2.Research architectureUsing a bibliometric approach,the paper analyzes technology trends and forecasts of data mining from1989to2009by locating heading‘‘data mining’’in topic in the SSCI database.The bibliomet-ric analytical technique was used to examine the topic in SSCI jour-nals from1989to2009,we found of1181articles with data mining.This paper surveys and classifies data mining articles using the following eight categories–publication year,citation,docu-ment type,country/territory,institute name,language,source title and subject area–for different distribution status in order to ex-plore the difference and how technologies and applications of data mining have developed in this period and to analyze technology trends and forecasts of data mining under the above results.Be-sides,the analysis also reviews the historical literatures to come out technology diffusions of data mining.As a verification of its analysis,the paper implements the Kol-mogorov-Smirnov(K-S)test by the following steps to check whether the analysis follows Lotka’s law:(1)Collect data(2)List author&article distribution table(3)Calculation the value of n(slope)According to Lotka’s law,the generalized formula is x n y¼c the suggested value of n is2.The exponent n of appliedfield is calcu-lated by the least square-method using the following formula(Pao, 1985):n¼NPXYÀPXPYNPX2ÀðPXÞ2ð1ÞN is the number of pairs of data,X is the logarithm of publications (x)and Y is the logarithm of authors(y).The least-square method is used to estimate the best value for the slope of a regression line which is the exponent n for Lotka’s law(Pao,1985).The slope is usually calculated without data points representing authors of high productivity.Since values of the slope change with different number of points for the same set of data,we have made several computations of n.The median or the mean val-ues of n can also be identified as the best slope for the observedH.-H.Tsai/Expert Systems with Applications39(2012)8172–81818173distribution (Pao,1985).Different values of n produce different val-ues of the constant c.(4)Calculation the value of cAccording to Lotka’s law,the generalized formula is x 00y ¼c the suggested value of c is 0.6079.The parameter c of applied field is calculated using the following formula (Pao,1985):C ¼1Pp À11þ1ðn À1Þðp n À1Þþ1p n þn24ðp À1Þn þ1ð2Þp is the 20,n is the value obtained in (3)Calculation the value of n ,and x is the number of publications.(5)Utilizing the K-S (Kolmogorov-Smirnov,K-S)test to evaluatewhether the analysis matches Lotka’s lawPao (1985)suggests the K-S test,a goodness-of-fit statistical test to assert that the observed author productivity distribution is not significantly different from a theoretical distribution.The hypothesis concerns a comparison between observed and expected frequencies.The test allows the determination of the associated probability that the observed maximum deviation occurs within the limits of chance.The maximum deviation between the cumu-lative proportions of the observed and theoretical frequency is determined by the following formula (Pao,1985):D ¼Max j Fo ðx ÞÀSn ðx Þj ð3ÞFo(x )is the theoretical cumulative frequency,Sn(x )is the observed cumulative frequency.The test is performed at the 0.01level of significance.When sample size is greater than 35,the critical value of significance is calculated by the following formula (Pao,1985):The critical value at the 0:01level of significance ¼1:63ffiffiffiffiffiffiffiffiPyp Py ¼the total population under studyð4ÞIf the maximum deviation falls within the critical value the null hypothesis that the data set conforms to Lotka’s law can be ac-cepted at a certain level of significance.But if it exceeds the critical value the null hypothesis must be rejected at a certain level of sig-nificance and concluded that the observed distribution is signifi-cantly different from the theoretical distribution.The analysis provides a roadmap for future researches,abstracts technology trend information and facilitates knowledge accumula-tion so that data mining researchers can save some time since coreknowledge will be concentrated in core categories.This implies that the phenomenon ‘‘success breeds success’’is more common in higher quality publications.3.Results3.1.Distribution by publication yearAs Fig.1shows,the article production on data mining has been rising since 1996.The article distribution can be divided into three segments to show the trends of development:from 1989to 1998,from 1999to 2003and from 2004to 2009.From 1989to 1998,data mining did not draw many researchers’attention.After 1998,the publication productivity per annum steadily increased,was followed by fast growth between 1999and 2003,and very sharp growth in 2006,and rapidly peaked in 2009.3.2.Distribution by citationFrom Fig.1,we can see that the citation distribution of data mining is not easy to recognize between 1989and 1999,followed by a dramatic growth and rapidly peaked in 2009.The result indi-cates that data mining will keep popular in the future.3.3.Distribution by country/territoryTable 1shows the US at the top with 551(46.66%),following by England with 108(9.14%)respectively.Taiwan ranks third with 104(8.81%).Behind them,Australia,Canada,the PRC and Germany are also major academic providers in the field.In Table 1,we can find the article distribution of the top 25countries/territories in each year for data mining.The US leads in the field.Taiwan ranks third starting from 2000,and has risen to second by 2008,indicating its potential to increase the production in the near future.Regarding the relationship between article production and cita-tions,there are only ten articles from Finland in data mining,its citations,however,are 474times in the domain (Table 1).The oth-ers almost follow the article production ranking accordingly.3.4.Distribution by institution nameIn Table 1,Noish,Penn State University and the University of Wisconsin are all no.1author affiliation in data mining research with 17record counts (1.44%).After analyzing the locations of8174H.-H.Tsai /Expert Systems with Applications 39(2012)8172–8181these affiliations,the US is still the most productive country in the world in data mining research.Regarding the relationship between article production and cita-tions,there are only nine articles from Yale University in data min-ing,but it has the largest amount of citations(717times)in the domain(Table1).The others almost follow the article production ranking accordingly.3.5.Distribution by document typeIn Table2,the distribution of document types from1989to 2009indicates that the most popular publication document type is‘‘Article’’(936articles,79.25%).The result demonstrates that the article is the major tendency of document type in data mining research.3.6.Distribution by languageIn Table2,the majority language for data mining is English with 1149articles(97.29%).Clearly,English is still the main trend in data mining research.3.7.Distribution by subject areaTable3offers critical information for future research tendencies in data mining,allowing researchers a better understanding of the distribution of the top25subjects in future research.The top three subjects for data mining research are information science&library science(260articles,22.01%),followed by computer science& information system(251articles,21.25%)and operations research &management science(168articles,14.23%).Besides,this paper’s analysis suggests that there are other important research disci-plines for data mining article production such as management, computer science&artificial intelligence,economics,computer science&interdisciplinary applications,public environmental& occupational health and engineering,electrical&electronic.As Table3illustrates,data mining citations follow article pro-duction ranking in the top25subjects,except for statistics&prob-ability(57.48average citations per article),social sciences& mathematical methods(32.09average citations per article),eco-nomics(12.26average citations per article),computer science& artificial intelligence(10.79average citations per article),engineer-ing,electrical&electronic(9.05average citations per article)and computer science&information systems(7.73average citations per article).3.8.Distribution by source titleTable3highlights information on trends for data mining,allow-ing researchers to closely approach the distribution of the top25 sources in future research.The top three research journals of data mining are Expert Systems with Applications(69articles,5.84%),fol-lowed by Journal of the American Medical Informatics Association(35 articles,2.96%)and Journal of Operation Research Society(26arti-cles,2.20%).In addition,there are a significant number of research sources for data mining article production such as Journal of the American Society for Information and Technology,Information Pro-cessing&Management,International Journal of Geographical Infor-mation Science,Journal of Information Science,Online Information Review,Information&Management,and Decision Support Systems.Table1Distribution of top25countries/territories and institutions from1989to2009.Rank Country/territory NP%of1181Citation Institution name NP%of1181Citation Country1The US55146.664781NIOSH17 1.4476The US 2England1089.14997Pennsylvania State University17 1.44202The US 3Taiwan1048.81436University of Wisconsin17 1.44122The US 4Canada67 5.67547University of Illinois13 1.10125The US 5The P.R.C.54 4.57187Columbia University12 1.0265The US 6Australia47 3.98350National Central University12 1.0241Taiwan 7Germany32 2.71177University of Pennsylvania12 1.0276The US 8South Korea32 2.71232National Chiao Tung University110.9322Taiwan 9Spain27 2.2979Purdue University110.9391The US 10Netherlands21 1.78135Monash University100.8585Australia 11Belgium20 1.6996University of Texas100.85100The US 12France20 1.69105Duke University90.7660The US 13Japan18 1.5249Tamkang University90.7687Taiwan 14Italy17 1.4478University of North Carolina90.76113The US 15Brazil13 1.1033University of Western Ontario90.76119Canada 16Scotland13 1.1045Yale University90.76717The US 17South Africa13 1.1069Virginia Commonwealth University90.7625The US 18Sweden12 1.0211City University of Hong Kong80.6815The PRC 19Turkey12 1.0253Harvard University80.6855The US 20India110.9330NanYang Technology University80.6890Singapore 21Slovenia110.934National Sun Yat-Sen University80.6829Taiwan 22Austria100.8530ONR80.6890The US 23Finland100.85474Syracuse University80.6858The US 24Singapore100.85105University of Arizona80.6862The US 25Wales100.85117University of Hong Kong80.6826The PRCNP=number of publication.Table2Distribution of document type and language from1989to2009.Document type NP%of1181Language NP%of1181Article93679.25English114997.29Proceedings paper1068.98Spanish12 1.02Book review50 4.23German50.42Review41 3.472Slovak40.34Meeting abstract23 1.95Japanese30.25Editorial material19 1.61Czech20.17News item20.17French20.17Correction10.08Portuguese20.17Note10.08Russian10.08Reprint10.08Slovene10.08Software review10.08Total1181100Total1181100NP=number of publication.H.-H.Tsai/Expert Systems with Applications39(2012)8172–81818175In Table3,data mining citations follow article production rank-ing in the top25sources,except for Decision Support Systems(20.00 average citations per article),Information&Management(14.75average citations per article),Journal of the American Society for Information Science(12.45average citations per article),Interna-tional Journal of Geographical Information Science(9.70average cita-tions per article)and Scientometrics(9.00average citations per article).4.DiscussionThe section implements the steps which are demonstrated in Section2.2.2to verify whether the distribution of author article production follows Lotka’s law in data mining research.4.1.The literatures productivity analysis by Lotka’s law(1)Collect data and(2)List author&article distribution tableTable3Distribution of top25subjects and sources from1989to2009.Rank Subject area NP%of1181Citation Source title NP%of1181Citation1Information Science&Library Science26022.021508Expert Systems with Applications69 5.84447 2Computer Science,Information Systems25121.251941Journal of the American Medical Informatics Association35 2.96147 3Operations Research&ManagementScience16814.231096Journal of the Operational Research Society26 2.2044 4Management14912.62864Journal of the American Society for Information Science andTechnology22 1.861645Computer Science,Artificial Intelligence13211.181424Information Processing&Management21 1.78142 6Economics1129.481373International Journal of Geographical Information Science20 1.69194 7Computer Science,InterdisciplinaryApplications1038.72713Journal of Information Science19 1.61114 8Public,Environmental&OccupationalHealth857.20588Online Information Review17 1.4412 9Engineering,Electrical&Electronic82 6.94742Information&Management16 1.35236 10Environmental Studies68 5.76367Decision Support Systems15 1.2746 11Business56 4.74350Resources Policy15 1.27300 12Geography52 4.40348Computers&Education110.9352 13Medical Informatics49 4.15239Journal of the American Society for Information Science110.93137 14Environmental Sciences38 3.22378International Journal of Forecasting100.8547 15Social Sciences,Mathematical Methods35 2.961123Journal of Safety Research90.7626 16Ergonomics34 2.88146Safety Science90.7634 17Engineering,Industrial33 2.79147Scientometrics90.7681 18Planning&Development31 2.62201Society&Natural Resources80.6838 19Education&Educational Research30 2.5497Technological Forecasting and Social Change80.6863 20Social Sciences,Interdisciplinary30 2.5492American Journal of Industrial Medicine70.5956 21Sociology30 2.54197Educational Technology&Society70.5915 22Mathematics,InterdisciplinaryApplications26 2.20221Electronic Library70.591523Geography,Physical24 2.03212Journal of Biomedical Informatics70.5956 24Computer Science,Cybernetics23 1.95114Social Work in Health Care70.5915 25Statistics&Probability21 1.781207European Journal of Operational Research60.5114NP=number of publication.Table4Calculation of author productivity of data mining.NP Author(s)(NP)Ã(Author)Accumulated record Accumulated record(%)Accumulated author(s)Accumulated author(s)(%) 91990.3110.0480090.3110.047214230.7930.12631841 1.4260.24563071 2.45120.4841248119 4.11240.953371112307.9561 2.42220641264222.1826710.601225222522894100.002519100.00NP=number of publication.Table5Calculation of the exponent n for data mining.x(NP)y(Author)X=log(x)Y=log(y)XY XX910.950.000.000.91800.900.000.000.82720.850.300.250.71630.780.480.370.61560.700.780.540.494120.60 1.080.650.363370.48 1.570.750.2322060.30 2.310.700.09122520.00 3.350.000.00Total2519 5.569.87 3.26 4.22x=number of publication;y=author;X=logarithm of x;Y=logarithm of y.8176H.-H.Tsai/Expert Systems with Applications39(2012)8172–8181Author quantity is calculated by the equality method from 1181articles retrieved by the SSCI index.Altogether,2519authors on data mining are included.See Table 4for reference.(3)Calculation the value of n (slope)In Table 5,we list the number of authors and the number of publications by one author for calculation of the exponent n with topic as ‘‘data mining’’in SSCI database.The results of the calcula-tions in Table 5can be brought into Eq.(1)to calculate the value of n :n ¼9ð3:26ÞÀð5:56Þð9:87Þ9ð4:22ÞÀð5:56Þð5ÞThen we can find n =-3.629488955(4)Calculation the value of cThe value of c is calculated by using Eq.(2),where P =20,x =1,2,3,4,5,6,7,8and n =3.629488955,then we can find c =0.892795157.With n =À3.629488955and c =0.892795157,the Lotka’s law equation of data mining is:f ðx Þ¼0:892795157=x 3:629488955ð6ÞWhen the result is compared to Table 4,we can see that authors with only one article account for 89.40%(100À10.60%=89.40%),which almost matches the primitive c value 89.28%generated byTable 6The K-S test for data mining.NP Author(s)Data mining (Observed)Sn(x )Data mining (Expected)Fo(x )D 122520.89400.89400.89280.89280.001222060.08180.97580.07210.96490.01093370.01470.99050.01660.98150.00904120.00480.99520.00580.98730.0079560.00240.99760.00260.98990.0077630.00120.99880.00130.99130.0076720.00080.99960.00080.99200.0076800.00000.99960.00050.99250.0071910.00041.00000.00030.99280.0072NP =number of publication;Data mining =author productivity of data mining;Sn(x )=observed cumulative frequency;Fo(x )=theoretical cumulative frequency;D =max-imum deviation.Table 7The overview of technology innovations in data mining.InnovationAuthorsCorrecting results Markowitz et al.(1994)Model uncertainty Chatfield (1995)DiscoveryTrybula (1997)Asymptotic complexity McSherry (1997)Scientific computingKral (1997)DNA intragenic mutation Evans et al.(1997)IntroductionRaghavan et al.(1998)Hospital infection control and public health surveillanceBrossette et al.(1998)User-guided query constructionChen and Zhu (1998)Transforming corporate information into value Cheng and Chang (1998)Organizational learningDhar (1998)Natural language understandingWilcox and Hripcsak (1998)Table 8The overview of organization adoptions in data mining.AdoptionAuthorsDatabase marketing Forcht and Cochran (1999)Interface Lavington et al.(1999)Semantic indexing Jiang et al.(1999)Cancer information system Houston et al.(1999)Customer retention and insurance claim patterns Smith et al.(2000)Data quality Feelders et al.(2000);Hand (2000)Customer service support Hui and Jha (2000)Electroencephalography application Flexer (2000)Prediction of corporate failure Lin and McClean (2001)Network intrusion detection Zhu et al.(2001)Knowledge refinement Park et al.(2001)Software integration Chua et al.(2002)Credit card portfolio management Shi et al.(2001)Knowledge warehouse Nemati et al.(2002)Grid services Cannataro et al.(2002)Library material acquisition budget allocation Wu (2003)Selection of insurance sales agents Cho and Ngai (2003)Prediction of physical performance Fielitz and Scott (2003)Library decision making Nicholson (2003)H.-H.Tsai /Expert Systems with Applications 39(2012)8172–81818177。

相关文档
最新文档