Topic Modeling: Research Papers and Abstracts
                                                                                    compiled by Subasish Das

acc_research
[1] Hua Xu, Fan Zhang, and Wei Wang. Implicit feature identification in chinese reviews using explicit topic mining model. Knowledge-Based Systems, 76(0):166 - 175, 2015. [ bib | DOI | http ]
The essential work of feature-specific opinion mining is centered on the product features. Previous related research work has often taken into account explicit features but ignored implicit features, However, implicit feature identification, which can help us better understand the reviews, is an essential aspect of feature-specific opinion mining. This paper is mainly centered on implicit feature identification in Chinese product reviews. We think that based on the explicit synonymous feature group and the sentences which contain explicit features, several Support Vector Machine (SVM) classifiers can be established to classify the non-explicit sentences. Nevertheless, instead of simply using traditional feature selection methods, we believe an explicit topic model in which each topic is pre-defined could perform better. In this paper, we first extend a popular topic modeling method, called Latent Dirichlet Allocation (LDA), to construct an explicit topic model. Then some types of prior knowledge, such as: must-links, cannot-links and relevance-based prior knowledge, are extracted and incorporated into the explicit topic model automatically. Experiments show that the explicit topic model, which incorporates pre-existing knowledge, outperforms traditional feature selection methods and other existing methods by a large margin and the identification task can be completed better.

Keywords: Opinion mining
[2] Xiaobing Sun, Bixin Li, Hareton Leung, Bin Li, and Yun Li. Msr4sm: Using topic models to effectively mining software repositories for software maintenance tasks. Information and Software Technology, 66(0):1 - 12, 2015. [ bib | DOI | http ]
Context Mining software repositories has emerged as a research direction over the past decade, achieving substantial success in both research and practice to support various software maintenance tasks. Software repositories include bug repository, communication archives, source control repository, etc. When using these repositories to support software maintenance, inclusion of irrelevant information in each repository can lead to decreased effectiveness or even wrong results. Objective This article aims at selecting the relevant information from each of the repositories to improve effectiveness of software maintenance tasks. Method For a maintenance task at hand, maintainers need to implement the maintenance request on the current system. In this article, we propose an approach, MSR4SM, to extract the relevant information from each software repository based on the maintenance request and the current system. That is, if the information in a software repository is relevant to either the maintenance request or the current system, this information should be included to perform the current maintenance task. {MSR4SM} uses the topic model to extract the topics from these software repositories. Then, relevant information in each software repository is extracted based on the topics. Results {MSR4SM} is evaluated for two software maintenance tasks, feature location and change impact analysis, which are based on four subject systems, namely jEdit, ArgoUML, Rhino and KOffice. The empirical results show that the effectiveness of traditional software repositories based maintenance tasks can be greatly improved by MSR4SM. Conclusions There is a lot of irrelevant information in software repositories. Before we use them to implement a maintenance task at hand, we need to preprocess them. Then, the effectiveness of the software maintenance tasks can be improved.

Keywords: Software maintenance
[3] Jing Guo, Peng Zhang, JianlongTan, and Li Guo. Mining hot topics from twitter streams. Procedia Computer Science, 9(0):2008 - 2011, 2012. Proceedings of the International Conference on Computational Science, {ICCS} 2012. [ bib | DOI | http ]
Mininghottopicsfrom twitter streamshas attractedalotof attentionin recent years.Traditionalhottopicmining from InternetWeb pages were mainly basedontext clustering.However, comparedtothetextsinWeb pages, twitter texts are relatively short with sparse attributes. Moreover,twitter data often increase rapidly withfast spreading speed, whichposesgreat challengetoexistingtopicmining models.Tothisend,we propose,inthispaper, aflexible stream mining approach for hot twitter topic detection. Specifically, we propose to use the FrequentPattern stream mining algorithm (i.e. FP-stream) to detect hot topics from twitter streams. Empirical studies on real world twitter data demonstrate the utility of the proposed method.

Keywords: Data stream mining
[4] Yanghui Rao, Qing Li, Xudong Mao, and Liu Wenyin. Sentiment topic models for social emotion mining. Information Sciences, 266(0):90 - 100, 2014. [ bib | DOI | http ]
The rapid development of social media services has facilitated the communication of opinions through online news, blogs, microblogs/tweets, instant-messages, and so forth. This article concentrates on the mining of readers’ emotions evoked by social media materials. Compared to the classical sentiment analysis from writers’ perspective, sentiment analysis of readers is sometimes more meaningful in social media. We propose two sentiment topic models to associate latent topics with evoked emotions of readers. The first model which is an extension of the existing Supervised Topic Model, generates a set of topics from words firstly, followed by sampling emotions from each topic. The second model generates topics from social emotions directly. Both models can be applied to social emotion classification and generate social emotion lexicons. Evaluation on social emotion classification verifies the effectiveness of the proposed models. The generated social emotion lexicon samples further show that our models can discover meaningful latent topics exhibiting emotion focus.

Keywords: Social emotion mining
[5] Ben Marwick. Chapter 3 - discovery of emergent issues and controversies in anthropology using text mining, topic modeling, and social network analysis of microblog content. In Yanchang ZhaoYonghua Cen, editor, Data Mining Applications with R, pages 63 - 93. Academic Press, Boston, 2014. [ bib | DOI | http ]
R is a convenient tool for analyzing text content to discover emergent issues and controversies in diverse corpora. In this case study, I investigate the use of Twitter at a major conference of professional and academic anthropologists. Using R I identify the demographics of the community, the structure of the community of Twitter-using anthropologists, and the topics that dominate the Twitter messages. I describe a series of statistical methods for handling a large corpus of Twitter messages that might otherwise be impractical to analyze. A key finding is that the transformative effect of Twitter in academia is to easily enable the spontaneous formation of information-sharing communities bound by an interest in an event or topic.

Keywords: Twitter
[6] Jin Wang, Xiangping Sun, Mary F.H. She, Abbas Kouzani, and Saeid Nahavandi. Unsupervised mining of long time series based on latent topic model. Neurocomputing, 103(0):93 - 103, 2013. [ bib | DOI | http ]
This paper presents a novel unsupervised method for mining time series based on two generative topic models, i.e., probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA). The proposed method treats each time series as a text document, and extracts a set of local patterns from the sequence as words by sliding a short temporal window along the sequence. Motivated by the success of latent topic models in text document analysis, latent topic models are extended to find the underlying structure of time series in an unsupervised manner. The clusters or categories of unlabeled time series are automatically discovered by the latent topic models using bag-of-patterns representation. The proposed method was experimentally validated using two sets of time series data extracted from a public Electrocardiography (ECG) database through comparison with the baseline k-means and the Normalized Cuts approaches. In addition, the impact of the bag-of-patterns' parameters was investigated. Experimental results demonstrate that the proposed unsupervised method not only outperforms the baseline k-means and the Normalized Cuts in learning semantic categories of the unlabeled time series, but also is relatively stable with respect to the bag-of-patterns' parameters. To the best of our knowledge, this work is the first attempt to explore latent topic models for unsupervised mining of time series data.

Keywords: {ECG} signals
[7] Flora S. Tsai. A tag-topic model for blog mining. Expert Systems with Applications, 38(5):5330 - 5335, 2011. [ bib | DOI | http ]
Blog mining addresses the problem of mining information from blog data. Although mining blogs may share many similarities to Web and text documents, existing techniques need to be reevaluated and adapted for the multidimensional representation of blog data, which exhibit dimensions not present in traditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuable sources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic model for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topic model determines the most likely tags and words for a given topic in a collection of blog posts. The model has been successfully implemented and evaluated on real-world blog data.

Keywords: Blog mining
[8] Özcan Özyurt and Cemal Köse. Chat mining: Automatically determination of chat conversations’ topic in turkish text based chat mediums. Expert Systems with Applications, 37(12):8705 - 8710, 2010. [ bib | DOI | http ]
Mostly, the conversations taking place in chat mediums bear important information concerning the speakers. This information can vary in many fields such as tendencies, habits, attitudes, guilt situations, and intentions of the speakers. Therefore, analysis and processing of these conversations are of much importance. Many social and semantic inferences can be made from these conversations. In determining characteristics of conversations and analysis of conversations, subject designation can be grounded on. In this study, chat mining is chosen as an application of text mining, and a study concerning determination of subject in the Turkish text based chat conversations is conducted. In sorting the conversations, supervised learning methods are used in this study. As for classifiers, Naive Bayes, k-Nearest Neighbor and Support Vector Machine are used. Ninety-one percent success is achieved in determination of subject.

Keywords: Chat mining
[10] Jianping Zeng, Chengrong Wu, and Wei Wang. Multi-grain hierarchical topic extraction algorithm for text mining. Expert Systems with Applications, 37(4):3202 - 3208, 2010. [ bib | DOI | http ]
Topic extraction from text corpus is the fundamental of many topic analysis tasks, such as topic trend prediction, opinion extraction. Since hierarchical structure is characteristics of topics, it is preferential for a topic extraction algorithm to output the topics description with this kind of structure. However, the hierarchical topic structure that is extracted by most of the current topic analysis algorithms cannot provide a meaningful description for all subtopics in the hierarchical tree. Here, we propose a new hierarchical topic extraction algorithm based on topic grain computation. By considering the distribution of word document frequency as a mixture Gaussian, an EM-like algorithm is employed to achieve the best number of mixture components, and the mean value of each component. Then topic grain is defined based on the mixture Gaussian parameters, and feature words are selected for the grain. A clustering algorithm is employed to the converted text set based on the feature words. After repeatedly applying the clustering algorithm to different converted text set, a multi-grain hierarchical topic structure with different subtopic feature words description is extracted. Experiments on two real world datasets which are collected from a news website show that the proposed algorithm can generate more meaningful multi-grain topic structure, by comparing with the current hierarchical topic clustering algorithms.

Keywords: Hierarchical topic
[11] Aurora Pons-Porrata, Rafael Berlanga-Llavori, and José Ruiz-Shulcloper. Topic discovery based on text mining techniques. Information Processing & Management, 43(3):752 - 768, 2007. Special Issue on Heterogeneous and Distributed {IR}. [ bib | DOI | http ]
In this paper, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related to it and a summary extracted from these documents. Summaries so built are useful to browse and select topics of interest from the generated hierarchies. Our proposal consists of a new incremental hierarchical clustering algorithm, which combines both partitional and agglomerative approaches, taking the main benefits from them. Finally, a new summarization method based on Testor Theory has been proposed to build the topic summaries. Experimental results in the {TDT2} collection demonstrate its usefulness and effectiveness not only as a topic detection system, but also as a classification and summarization tool.

Keywords: Hierarchical clustering
[12] Jianping Zeng, Jiangjiao Duan, Wenjun Cao, and Chengrong Wu. Topics modeling based on selective zipf distribution. Expert Systems with Applications, 39(7):6541 - 6546, 2012. [ bib | DOI | http ]
Automatically mining topics out of text corpus becomes an important fundament of many topic analysis tasks, such as opinion recognition, Web content classification, etc. Although large amount of topic models and topic mining methods have been proposed for different purposes and shown success in dealing with topic analysis tasks, it is desired to create accurate models or mining algorithms for many applications. A general criteria based on Zipf fitness quantity computation is proposed to determine whether a topic description is well-form or not. Based on the quantity definition, the popular Dirichlet prior on multinomial parameters is found that it cannot always produce well-form topic descriptions. Hence, topics modeling based on {LDA} with selective Zipf documents as training dataset is proposed to improve the quality in generation of topics description. Experiments on two standard text corpuses, i.e. {AP} dataset and Reuters-21578, show that the modeling method based on selective Zipf distribution can achieve better perplexity, which means better ability in predicting topics. While a test of topics extraction on a collection of news documents about recent financial crisis shows that the description key words in topics are more meaningful and reasonable than that of tradition topic mining method.

Keywords: Selective Zipf distribution
[13] Daifeng Li, Ying Ding, Xin Shuai, Johan Bollen, Jie Tang, Shanshan Chen, Jiayi Zhu, and Guilherme Rocha. Adding community and dynamic to topic models. Journal of Informetrics, 6(2):237 - 253, 2012. [ bib | DOI | http ]
The detection of communities in large social networks is receiving increasing attention in a variety of research areas. Most existing community detection approaches focus on the topology of social connections (e.g., coauthor, citation, and social conversation) without considering their topic and dynamic features. In this paper, we propose two models to detect communities by considering both topic and dynamic features. First, the Community Topic Model (CTM) can identify communities sharing similar topics. Second, the Dynamic {CTM} (DCTM) can capture the dynamic features of communities and topics based on the Bernoulli distribution that leverages the temporal continuity between consecutive timestamps. Both models were tested on two datasets: ArnetMiner and Twitter. Experiments show that communities with similar topics can be detected and the co-evolution of communities and topics can be observed by these two models, which allow us to better understand the dynamic features of social networks and make improved personalized recommendations.

Keywords: Social network
[14] Hsin-Chang Yang, Chung-Hong Lee, and Han-Wei Hsiao. Incorporating self-organizing map with text mining techniques for text hierarchy generation. Applied Soft Computing, 34(0):251 - 259, 2015. [ bib | DOI | http ]
Self-organizing maps (SOM) have been applied on numerous data clustering and visualization tasks and received much attention on their success. One major shortage of classical {SOM} learning algorithm is the necessity of predefined map topology. Furthermore, hierarchical relationships among data are also difficult to be found. Several approaches have been devised to conquer these deficiencies. In this work, we propose a novel {SOM} learning algorithm which incorporates several text mining techniques in expanding the map both laterally and hierarchically. On training a set of text documents, the proposed algorithm will first cluster them using classical {SOM} algorithm. We then identify the topics of each cluster. These topics are then used to evaluate the criteria on expanding the map. The major characteristic of the proposed approach is to combine the learning process with text mining process and makes it suitable for automatic organization of text documents. We applied the algorithm on the Reuters-21578 dataset in text clustering and categorization tasks. Our method outperforms two comparing models in hierarchy quality according to users’ evaluation. It also receives better F1-scores than two other models in text categorization task.

Keywords: Text mining
[15] Fedja Hadzic, Michael Hecker, and Andrea Tagarelli. Ordered subtree mining via transactional mapping using a structure-preserving tree database schema. Information Sciences, 310(0):97 - 117, 2015. [ bib | DOI | http ]
Frequent subtree mining is a major research topic in knowledge discovery from tree-structured data, whose importance is witnessed by the pervasiveness of such data in several domains. In this paper, we present a novel approach to discover all the frequent ordered subtrees in a tree-structured database. A key aspect is that the structural aspects of the input tree instances are extracted to generate a transactional format that enables the application of standard itemset mining techniques. In this way, the expensive process of subtree enumeration is avoided, while subtrees can be reconstructed in a post-processing stage. As a result, more structurally complex tree data can be handled and much lower support thresholds can be used. In addition to discovering traditional subtrees, this is the first approach to frequent subtree mining that can discover position-constrained subtrees. Each node in the position-constrained subtree is annotated with its exact occurrence and level of embedding in the original database tree. Also, disconnected subtree associations can be represented via virtual connecting nodes. Experiments conducted on synthetic and real-world datasets confirm the expected advantages of our approach over competing methods in terms of efficiency, mining capabilities, and informativeness of the extracted patterns.

Keywords: Frequent subtree mining
[16] Xiong Zhang and Zhi-Hong Deng. Mining summarization of high utility itemsets. Knowledge-Based Systems, 84(0):67 - 77, 2015. [ bib | DOI | http ]
Mining interesting itemsets from transaction databases has attracted a lot of research interests for decades. In recent years, high utility itemset (HUI) has emerged as a hot topic in this field. In real applications, the bottleneck of {HUI} mining is not at the efficiency but at the interpretability, due to the huge number of itemsets generated by the mining process. Because the downward closure property of itemsets no longer holds for HUIs, the compression or summarization methods for frequent itemsets are not available. With this in mind, considering coverage and diversity, we introduce a novel well-founded approach, called SUIT-miner, for succinctly summarizing {HUIs} with a small collection of itemsets. First, we define the condition under which an itemset can cover another itemset. Then, a greedy algorithm is presented to find the least itemsets to cover all of HUIs, in order to ensure diversity. For enhancing the efficiency, the greedy algorithm employs some pruning strategies. To evaluate the performance of SUIT-miner, we conduct extensive experiments on real datasets. The experimental results show that SUIT-miner is effective and efficient.

Keywords: Data mining
[17] Gerald Petz, Michał Karpowicz, Harald Fürschuß, Andreas Auinger, Václav Stříteský, and Andreas Holzinger. Reprint of: Computational approaches for mining user’s opinions on the web 2.0. Information Processing & Management, 51(4):510 - 519, 2015. [ bib | DOI | http ]
The emerging research area of opinion mining deals with computational methods in order to find, extract and systematically analyze people’s opinions, attitudes and emotions towards certain topics. While providing interesting market research information, the user generated content existing on the Web 2.0 presents numerous challenges regarding systematic analysis, the differences and unique characteristics of the various social media channels being one of them. This article reports on the determination of such particularities, and deduces their impact on text preprocessing and opinion mining algorithms. The effectiveness of different algorithms is evaluated in order to determine their applicability to the various social media channels. Our research shows that text preprocessing algorithms are mandatory for mining opinions on the Web 2.0 and that part of these algorithms are sensitive to errors and mistakes contained in the user generated content.

Keywords: Opinion mining
[25] Xianghua Fu, Kun Yang, Joshua Zhexue Huang, and Laizhong Cui. Dynamic non-parametric joint sentiment topic mixture model. Knowledge-Based Systems, 82(0):102 - 114, 2015. [ bib | DOI | http ]
The reviews in social media are produced continuously by a large and uncontrolled number of users. To capture the mixture of sentiment and topics simultaneously in reviews is still a challenging task. In this paper, we present a novel probabilistic model framework based on the non-parametric hierarchical Dirichlet process (HDP) topic model, called non-parametric joint sentiment topic mixture model (NJST), which adds a sentiment level to the {HDP} topic model and detects sentiment and topics simultaneously from reviews. Then considered the dynamic nature of social media data, we propose dynamic {NJST} (dNJST) which adds time decay dependencies of historical epochs to the current epochs. Compared with the existing sentiment topic mixture models which are based on latent Dirichlet allocation (LDA), the biggest difference of {NJST} and dNJST is that they can determine topic number automatically. We implement {NJST} and dNJST with online variational inference algorithms, and incorporate the sentiment priors of words into {NJST} and dNJST with HowNet lexicon. The experiment results in some Chinese social media dataset show that dNJST can effectively detect and track dynamic sentiment and topics.

Keywords: Topic sentiment analysis
[34] Jun ze Wang, Zheng Yan, Laurence T. Yang, and Ben xiong Huang. An approach to rank reviews by fusing and mining opinions based on review pertinence. Information Fusion, 23(0):3 - 15, 2015. [ bib | DOI | http ]
Fusing and mining opinions from reviews posted in webs or social networks is becoming a popular research topic in recent years in order to analyze public opinions on a specific topic or product. Existing research has been focused on extraction, classification and summarization of opinions from reviews in news websites, forums and blogs. An important issue that has not been well studied is the degree of relevance between a review and its corresponding article. Prior work simply divides reviews into two classes: spam and non-spam, neglecting that the non-spam reviews could have different degrees of relevance to the article. In this paper, we propose a notion of “Review Pertinence” to study the degree of this relevance. Unlike usual methods, we measure the pertinence of review by considering not only the similarity between a review and its corresponding article, but also the correlation among reviews. Experiment results based on real data sets collected from a number of popular portal sites show the obvious effectiveness of our method in ranking reviews based on their pertinence, compared with three baseline methods. Thus, our method can be applied to efficiently retrieve reviews for opinion fusion and mining and filter review spam in practice.

Keywords: Review pertinence
[37] Hang Yang and Simon Fong. Countering the concept-drift problems in big data by an incrementally optimized stream mining model. Journal of Systems and Software, 102(0):158 - 166, 2015. [ bib | DOI | http ]
Mining the potential value hidden behind big data has been a popular research topic around the world. For an infinite big data scenario, the underlying data distribution of newly arrived data may be appeared differently from the old one in the real world. This phenomenon is so-called the concept-drift problem that exists commonly in the scenario of big data mining. In the past decade, decision tree inductions use multi-tree learning to detect the drift using alternative trees as a solution. However, multi-tree algorithms consume more computing resources than the singletree. This paper proposes a singletree with an optimized node-splitting mechanism to detect the drift in a test-then-training tree-building process. In the experiment, we compare the performance of the new method to some state-of-art singletree and multi-tree algorithms. Result shows that the new algorithm performs with good accuracy while a more compact model size and less use of memory than the others.

Keywords: Concept drift
[39] Vijay Kotu and Bala Deshpande. Chapter 9 - text mining. In Vijay KotuBala Deshpande, editor, Predictive Analytics and Data Mining, pages 275 - 303. Morgan Kaufmann, Boston, 2015. [ bib | DOI | http ]
This chapter provides a detailed look into the emerging area of text mining and text analytics. It starts with a background of the origins of text mining and provides the motivation for this fascinating topic using the example of IBM's Watson, the Jeopardy!-winning computer program that was built almost entirely using concepts from text and data mining. The chapter introduces some key concepts important in the area of text analytics such as TF-IDF scores. Finally it describes two hands-on case studies in which the reader is shown how to use RapidMiner to address problems like document clustering and automatic gender classification based on text content.

Keywords: Inverse document frequency
[40] Metin Turan and Coskun Sönmez. Automatize document topic and subtopic detection with support of a corpus. Procedia - Social and Behavioral Sciences, 177(0):169 - 177, 2015. First Global Conference on Contemporary Issues in Education (GLOBE-EDU 2014) 12-14 July 2014, Las Vegas, {USA}. [ bib | DOI | http ]
In this article, we propose a new automatic topic and subtopic detection method from a document called paragraph extension. In paragraph extension, a document is considered as a set of paragraphs and a paragraph merging technique is used to merge similar consecutive paragraphs until no similar consecutive paragraphs left. Following this, similar word counts in merged paragraphs are summed up to construct subtopic scores by using a corpus which is designed so that we can find words related to a subtopic. The paragraph vectors are represented by subtopics instead of the words. The subtopic of a paragraph is the most frequent one in the paragraph vector. On the other hand, topic of the document is the most dispersive subtopic in the document. An experimental topic/subtopic corpus is constructed for sport and education topics. We also supported corpus by WordNet to obtain synonyms words. We evaluate the proposed method on a data set contains randomly selected 40 documents from the education and sport topics. The experiment results show that average of topic detection success ratio is about %83 and the subtopic detection is about %68.

Keywords: topic detection
[46] Sérgio Moro, Paulo Cortez, and Paulo Rita. Business intelligence in banking: A literature analysis from 2002 to 2013 using text mining and latent dirichlet allocation. Expert Systems with Applications, 42(3):1314 - 1324, 2015. [ bib | DOI | http ]
This paper analyzes recent literature in the search for trends in business intelligence applications for the banking industry. Searches were performed in relevant journals resulting in 219 articles published between 2002 and 2013. To analyze such a large number of manuscripts, text mining techniques were used in pursuit for relevant terms on both business intelligence and banking domains. Moreover, the latent Dirichlet allocation modeling was used in order to group articles in several relevant topics. The analysis was conducted using a dictionary of terms belonging to both banking and business intelligence domains. Such procedure allowed for the identification of relationships between terms and topics grouping articles, enabling to emerge hypotheses regarding research directions. To confirm such hypotheses, relevant articles were collected and scrutinized, allowing to validate the text mining procedure. The results show that credit in banking is clearly the main application trend, particularly predicting risk and thus supporting credit approval or denial. There is also a relevant interest in bankruptcy and fraud prediction. Customer retention seems to be associated, although weakly, with targeting, justifying bank offers to reduce churn. In addition, a large number of articles focused more on business intelligence techniques and its applications, using the banking industry just for evaluation, thus, not clearly acclaiming for benefits in the banking business. By identifying these current research topics, this study also highlights opportunities for future research.

Keywords: Banking
[47] Jing Wang, Zhaojun Liu, Wei Li, and Xiongfei Li. Research on a frequent maximal induced subtrees mining method based on the compression tree sequence. Expert Systems with Applications, 42(1):94 - 100, 2015. [ bib | DOI | http ]
Most complex data structures can be represented by a tree or graph structure, but tree structure mining is easier than graph structure mining. With the extensive application of semi-structured data, frequent tree pattern mining has become a hot topic. This paper proposes a compression tree sequence (CTS) to construct a compression tree model; and save the information of the original tree in the compression tree. As any subsequence of the {CTS} corresponds to a subtree of the original tree, it is efficient for mining subtrees. Furthermore, this paper proposes a frequent maximal induced subtrees mining method based on the compression tree sequence, {CFMIS} (compressed frequent maximal induced subtrees). The algorithm is primarily performed via four stages: firstly, the original data set is constructed as a compression tree model; then, a cut-edge reprocess is run for the edges in which the edge frequent is less than the threshold; next, the tree is compressed after the cut-edge based on the different frequent edge degrees; and, last, frequent subtree sets maximal processing is run such that, we can obtain the frequent maximal induced subtree set of the original data set. For each iteration, compression can reduce the size of the data set, thus, the traversal speed is faster than that of other algorithms. Experiments demonstrate that our algorithm can mine more frequent maximal induced subtrees in less time.

Keywords: Data mining
[49] Kun Chen, Gang Kou, Jennifer Shang, and Yang Chen. Visualizing market structure through online product reviews: Integrate topic modeling, topsis, and multi-dimensional scaling approaches. Electronic Commerce Research and Applications, 14(1):58 - 74, 2015. [ bib | DOI | http ]
Studies have shown that perceptual maps derived from online consumer-generated data are effective for depicting market structure such as demonstrating positioning of competitive brands. However, most text mining algorithms would require manual reading to merge extracted product features with synonyms. In response, Topic modeling is introduced to group synonyms together under a topic automatically, leading to convenient and accurate evaluation of brands based on consumers’ online reviews. To ensure the feasibility of employing Topic modeling in assessing competitive brands, we developed a unique and novel framework named {WVAP} (Weights from Valid Posterior Probability) based on Scree plot technique. {WVAP} can filter the noises in posterior distribution obtained from Topic modeling, and improve accuracy in brand evaluation. A case study exploring online reviews of mobile phones is conducted. We extract topics to reflect the features of the cell phones with a qualified validity. In addition to perceptual maps derived by multi-dimensional scaling (MDS) for product positioning, we also rank these products by {TOPSIS} (Technique for Order Performance by Similarity to Ideal Solution) so as to visualize the market structure from different perspectives. Our case study of cell phones shows that the proposed framework is effective in mining online reviews and providing insights into the competitive landscape.

Keywords: Market structure
[63] Lu Liu, Feida Zhu, Lei Zhang, and Shiqiang Yang. A probabilistic graphical model for topic and preference discovery on social media. Neurocomputing, 95(0):78 - 88, 2012. Learning from Social Media Network. [ bib | DOI | http ]
Many web applications today thrive on offering services for large-scale multimedia data, e.g., Flickr for photos and YouTube for videos. However, these data, while rich in content, are usually sparse in textual descriptive information. For example, a video clip is often associated with only a few tags. Moreover, the textual descriptions are often overly specific to the video content. Such characteristics make it very challenging to discover topics at a satisfactory granularity on this kind of data. In this paper, we propose a generative probabilistic model named Preference-Topic Model (PTM) to introduce the dimension of user preferences to enhance the insufficient textual information. {PTM} is a unified framework to combine the tasks of user preference discovery and document topic mining together. Through modeling user-document interactions, {PTM} cannot only discover topics and preferences simultaneously, but also enable them to inform and benefit each other in a unified framework. As a result, {PTM} can extract better topics and preferences from sparse data. The experimental results on real-life video application data show that {PTM} is superior to {LDA} in discovering informative topics and preferences in terms of clustering-based evaluations. Furthermore, the experimental results on {DBLP} data demonstrate that {PTM} is a general model which can be applied to other kinds of user–document interactions.

Keywords: Social media mining
[67] Matias Nicoletti, Silvia Schiaffino, and Daniela Godoy. Mining interests for user profiling in electronic conversations. Expert Systems with Applications, 40(2):638 - 645, 2013. [ bib | DOI | http ]
The increasing amount of Web-based tasks is currently requiring personalization strategies to improve the user experience. However, building user profiles is a hard task, since users do not usually give explicit information about their interests. Therefore, interests must be mined implicitly from electronic sources, such as chat and discussion forums. In this work, we present a novel method for topic detection from online informal conversations. Our approach combines: (i) Wikipedia, an extensive source of knowledge, (ii) a concept association strategy, and (iii) a variety of text-mining techniques, such as {POS} tagging and named entities recognition. We performed a comparative evaluation procedure for searching the optimal combination of techniques, achieving encouraging results.

Keywords: Topic identification
[73] Ivan Vulić, Wim De Smet, Jie Tang, and Marie-Francine Moens. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing & Management, 51(1):111 - 147, 2015. [ bib | DOI | http ]
Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent {LDA} model to multilingual settings called bilingual {LDA} (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (i) per-topic word distributions and (ii) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.

Keywords: Multilingual probabilistic topic models
[74] Gerald Petz, Michał Karpowicz, Harald Fürschuß, Andreas Auinger, Václav Stříteský, and Andreas Holzinger. Computational approaches for mining user’s opinions on the web 2.0. Information Processing & Management, 50(6):899 - 908, 2014. [ bib | DOI | http ]
The emerging research area of opinion mining deals with computational methods in order to find, extract and systematically analyze people’s opinions, attitudes and emotions towards certain topics. While providing interesting market research information, the user generated content existing on the Web 2.0 presents numerous challenges regarding systematic analysis, the differences and unique characteristics of the various social media channels being one of them. This article reports on the determination of such particularities, and deduces their impact on text preprocessing and opinion mining algorithms. The effectiveness of different algorithms is evaluated in order to determine their applicability to the various social media channels. Our research shows that text preprocessing algorithms are mandatory for mining opinions on the Web 2.0 and that part of these algorithms are sensitive to errors and mistakes contained in the user generated content.

Keywords: Opinion mining
[77] Arman Khadjeh Nassirtoussi, Saeed Aghabozorgi, Teh Ying Wah, and David Chek Ling Ngo. Text mining for market prediction: A systematic review. Expert Systems with Applications, 41(16):7653 - 7670, 2014. [ bib | DOI | http ]
The quality of the interpretation of the sentiment in the online buzz in the social media and the online news can determine the predictability of financial markets and cause huge gains or losses. That is why a number of researchers have turned their full attention to the different aspects of this problem lately. However, there is no well-rounded theoretical and technical framework for approaching the problem to the best of our knowledge. We believe the existing lack of such clarity on the topic is due to its interdisciplinary nature that involves at its core both behavioral-economic topics as well as artificial intelligence. We dive deeper into the interdisciplinary nature and contribute to the formation of a clear frame of discussion. We review the related works that are about market prediction based on online-text-mining and produce a picture of the generic components that they all have. We, furthermore, compare each system with the rest and identify their main differentiating factors. Our comparative analysis of the systems expands onto the theoretical and technical foundations behind each. This work should help the research community to structure this emerging field and identify the exact aspects which require further research and are of special significance.

Keywords: Online sentiment analysis
[78] Franck Assous and Joël Chaskalovic. Mathematical and numerical methods for vlasov–maxwell equations: The contribution of data mining. Comptes Rendus Mécanique, 342(10–11):560 - 569, 2014. Theoretical and numerical approaches for Vlasov-maxwell equations. [ bib | DOI | http ]
This paper deals with the applications of data mining techniques in the evaluation of numerical solutions of Vlasov–Maxwell models. This is part of the topic of characterizing the model and approximation errors via learning techniques. We give two examples of application. The first one aims at comparing two Vlasov–Maxwell approximate models. In the second one, a scheme based on data mining techniques is proposed to characterize the errors between a P 1 and a P 2 finite element Particle-In-Cell approach. Beyond these examples, this original approach should operate in all cases where intricate numerical simulations like for the Vlasov–Maxwell equations take a central part.

Keywords: Data mining
[81] R. Mythily, Aisha Banu, and Shriram Raghunathan. Clustering models for data stream mining. Procedia Computer Science, 46(0):619 - 626, 2015. Proceedings of the International Conference on Information and Communication Technologies, {ICICT} 2014, 3-5 December 2014 at Bolgatty Palace & Island Resort, Kochi, India. [ bib | DOI | http ]
The scope of this research is to aggregate news contents that exists in data streams. A data stream may have several research issues. A user may only be interested in a subset of these research issues; there could be many different research issues from multiple streams, which discuss similar topic from different perspectives. A user may be interested in a topic but do not know how to collect all feeds related to this topic. The objective is to cluster all stories in the data streams into hierarchical structure for a better serve to the readers. The work utilizes segment wise distributional clustering that show the effectiveness of data streams. To better serve the news readers, advance data organization is highly desired. Once catching a glimpse of the topic, user can browse the returned hierarchy and find other stories/feeds talking about the same topic in the internet. The dynamically changing of stories needs to use the segment wise distributional clustering algorithm to have the capability to process information incrementally.

Keywords: Data streams
[83] Tarique Anwar and Muhammad Abulaish. A social graph based text mining framework for chat log investigation. Digital Investigation, 11(4):349 - 362, 2014. [ bib | DOI | http ]
This paper presents a unified social graph based text mining framework to identify digital evidences from chat logs data. It considers both users' conversation and interaction data in group-chats to discover overlapping users' interests and their social ties. The proposed framework applies n-gram technique in association with a self-customized hyperlink-induced topic search (HITS) algorithm to identify key-terms representing users' interests, key-users, and key-sessions. We propose a social graph generation technique to model users' interactions, where ties (edges) between a pair of users (nodes) are established only if they participate in at least one common group-chat session, and weights are assigned to the ties based on the degree of overlap in users' interests and interactions. Finally, we present three possible cyber-crime investigation scenarios and a user-group identification method for each of them. We present our experimental results on a data set comprising 1100 chat logs of 11,143 chat sessions continued over a period of 29 months from January 2010 to May 2012. Experimental results suggest that the proposed framework is able to identify key-terms, key-users, key-sessions, and user-groups from chat logs data, all of which are crucial for cyber-crime investigation. Though the chat logs are recovered from a single computer, it is very likely that the logs are collected from multiple computers in real scenario. In this case, logs collected from multiple computers can be combined together to generate more enriched social graph. However, our experiments show that the objectives can be achieved even with logs recovered from a single computer by using group-chats data to draw relationships between every pair of users.

Keywords: Text mining
[88] Isidro Peñalver-Martinez, Francisco Garcia-Sanchez, Rafael Valencia-Garcia, Miguel Ángel Rodríguez-García, Valentín Moreno, Anabel Fraga, and Jose Luis Sánchez-Cervantes. Feature-based opinion mining through ontologies. Expert Systems with Applications, 41(13):5995 - 6008, 2014. [ bib | DOI | http ]
The idiosyncrasy of the Web has, in the last few years, been altered by Web 2.0 technologies and applications and the advent of the so-called Social Web. While users were merely information consumers in the traditional Web, they play a much more active role in the Social Web since they are now also data providers. The mass involved in the process of creating Web content has led many public and private organizations to focus their attention on analyzing this content in order to ascertain the general public’s opinions as regards a number of topics. Given the current Web size and growth rate, automated techniques are essential if practical and scalable solutions are to be obtained. Opinion mining is a highly active research field that comprises natural language processing, computational linguistics and text analysis techniques with the aim of extracting various kinds of added-value and informational elements from users’ opinions. However, current opinion mining approaches are hampered by a number of drawbacks such as the absence of semantic relations between concepts in feature search processes or the lack of advanced mathematical methods in sentiment analysis processes. In this paper we propose an innovative opinion mining methodology that takes advantage of new Semantic Web-guided solutions to enhance the results obtained with traditional natural language processing techniques and sentiment analysis processes. The main goals of the proposed methodology are: (1) to improve feature-based opinion mining by using ontologies at the feature selection stage, and (2) to provide a new vector analysis-based method for sentiment analysis. The methodology has been implemented and thoroughly tested in a real-world movie review-themed scenario, yielding very promising results when compared with other conventional approaches.

Keywords: Opinion mining
[95] Chung-Hong Lee. Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Systems with Applications, 39(10):9623 - 9641, 2012. [ bib | DOI | http ]
Social networks have been regarded as a timely and cost-effective source of spatio-temporal information for many fields of application. However, while some research groups have successfully developed topic detection methods from the text streams for a while, and even some popular microblogging services such as Twitter did provide information of top trending topics for selection, it is still unable to fully support users for picking up all of the real-time event topics with a comprehensive spatio-temporal viewpoint to satisfy their information needs. This paper aims to investigate how microblogging social networks (i.e. Twitter) can be used as a reliable information source of emerging events by extracting their spatio-temporal features from the messages to enhance event awareness. In this work, we applied a density-based online clustering method for mining microblogging text streams, in order to obtain temporal and geospatial features of real-world events. By analyzing the events detected by our system, the temporal and spatial impacts of the emerging events can be estimated, for achieving the goals of situational awareness and risk management.

Keywords: Topic detection
[97] Peng Tang and Tommy W.S. Chow. Mining language variation using word using and collocation characteristics. Expert Systems with Applications, 41(17):7805 - 7819, 2014. [ bib | DOI | http ]
Two textual metrics “Frequency Rank” (FR) and “Intimacy” are proposed in this paper to measure the word using and collocation characteristics which are two important aspects of text style. The FR, derived from the local index numbers of terms in a sentences ordered by the global frequency of terms, provides single-term-level information. The Intimacy models relationship between a word and others, i.e. the closeness a term is to other terms in the same sentence. Two textual features “Frequency Rank Ratio (FRR)” and “Overall Intimacy (OI)” for capturing language variation are derived by employing the two proposed textual metrics. Using the derived features, language variation among documents can be visualized in a text space. Three corpora consisting of documents of diverse topics, genres, regions, and dates of writing are designed and collected to evaluate the proposed algorithms. Extensive simulations are conducted to verify the feasibility and performance of our implementation. Both theoretical analyses based on entropy and the simulations demonstrate the feasibility of our method. We also show the proposed algorithm can be used for visualizing the closeness of several western languages. Variation of modern English over time is also recognizable when using our analysis method. Finally, our method is compared to conventional text classification implementations. The comparative results indicate our method outperforms the others.

Keywords: Language variation
[98] Zhi-Hong Deng. Fast mining top-rank-k frequent patterns by using node-lists. Expert Systems with Applications, 41(4, Part 2):1763 - 1768, 2014. [ bib | DOI | http ]
Mining Top-Rank-k frequent patterns is an emerging topic in frequent pattern mining in recent years. In this paper, we propose a new mining algorithm, NTK, to mining Top-Rank-k frequent patterns. The {NTK} algorithm employs a data structure, Node-list, to represent patterns. The Node-list structure makes the mining process much efficient. We have experimentally evaluated our algorithm against two representative algorithms on four real datasets. The experimental results show that the {NTK} algorithm is efficient and is at least two orders of magnitude faster than the {FAE} algorithm and also remarkably faster than the {VTK} algorithm, the recently reported state-of-the-art algorithm for mining Top-Rank-k frequent patterns.

Keywords: Data mining
[101] Do-Heon Jeong and Min Song. Time gap analysis by the topic model-based temporal technique. Journal of Informetrics, 8(3):776 - 790, 2014. [ bib | DOI | http ]
This study proposes a temporal analysis method to utilize heterogeneous resources such as papers, patents, and web news articles in an integrated manner. We analyzed the time gap phenomena between three resources and two academic areas by conducting text mining-based content analysis. To this end, a topic modeling technique, Latent Dirichlet Allocation (LDA) was used to estimate the optimal time gaps among three resources (papers, patents, and web news articles) in two research domains. The contributions of this study are summarized as follows: firstly, we propose a new temporal analysis method to understand the content characteristics and trends of heterogeneous multiple resources in an integrated manner. We applied it to measure the exact time intervals between academic areas by understanding the time gap phenomena. The results of temporal analysis showed that the resources of the medical field had more up-to-date property than those of the computer field, and thus prompter disclosure to the public. Secondly, we adopted a power-law exponent measurement and content analysis to evaluate the proposed method. With the proposed method, we demonstrate how to analyze heterogeneous resources more precisely and comprehensively.

Keywords: Text mining
[108] Xiaolin Zheng, Zhen Lin, Xiaowei Wang, Kwei-Jay Lin, and Meina Song. Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification. Knowledge-Based Systems, 61(0):29 - 47, 2014. [ bib | DOI | http ]
With the considerable growth of user-generated content, online reviews are becoming extremely valuable sources for mining customers’ opinions on products and services. However, most of the traditional opinion mining methods are coarse-grained and cannot understand natural languages. Thus, aspect-based opinion mining and summarization are of great interest in academic and industrial research. In this paper, we study an approach to extract product and service aspect words, as well as sentiment words, automatically from reviews. An unsupervised dependency analysis-based approach is presented to extract Appraisal Expression Patterns (AEPs) from reviews, which represent the manner in which people express opinions regarding products or services and can be regarded as a condensed representation of the syntactic relationship between aspect and sentiment words. {AEPs} are high-level, domain-independent types of information, and have excellent domain adaptability. An AEP-based Latent Dirichlet Allocation (AEP-LDA) model is also proposed. This is a sentence-level, probabilistic generative model which assumes that all words in a sentence are drawn from one topic – a generally true assumption, based on our observation. The model also assumes that every review corpus is composed of several mutually corresponding aspect and sentiment topics, as well as a background word topic. The {AEP} information is incorporated into the AEP-LDA model for mining aspect and sentiment words simultaneously. The experimental results on reviews of restaurants, hotels, {MP3} players, and cameras show that the AEP-LDA model outperforms other approaches in identifying aspect and sentiment words.

Keywords: Opinion mining
[119] Oğuz Mustapaşa, Adem Karahoca, Dilek Karahoca, and Hüseyin Uzunboylu. “hello world”, web mining for e-learning. Procedia Computer Science, 3(0):1381 - 1387, 2011. World Conference on Information Technology. [ bib | DOI | http ]
As the internet and mobile applications are getting an important role in our lives, usage of mobile services also took place in educational field since the internet is widespread, which is usually called by the terms “e-learning” or “distance learning”. A known issue on e-learning is all the content’s being online and less face-to-face communication than traditional learning; this brings the problem of chasing student’s success, and advising and managing student’s way of studying. Hence, a recent hot topic, data mining, can be applied on student’s data left on e-learning portals to guide the instructor and advisors to help students’ being more successful. Recent researches done on this topic showed that e-learning combined with data mining can decrease the gap between itself and traditional learning — referred as semantic web mining in general.

Keywords: Semantic web
[120] Zhenxing Xu, Ling Chen, and Gencai Chen. Topic based context-aware travel recommendation method exploiting geotagged photos. Neurocomputing, 155(0):99 - 107, 2015. [ bib | DOI | http ]
The popularity of camera phones and photo sharing websites, e.g. Flickr and Panoramio, has led to huge volumes of community-contributed geotagged photos (CCGPs) available on the Internet, which could be regarded as digital footprints of photo takers. In this paper, we propose a method to recommend travel locations in a city for a user, based on topic distribution of his travel histories in other cities and the given context (i.e., season and weather). A topic model is used to mine the interest distribution of users, which is then exploited to build the user–user similarity model and make travel recommendations. The season and weather context information is considered during the mining and the recommendation processes. Our method is evaluated on a Flickr dataset, which contains photos taken in 11 cities of China. Experimental results show the effectiveness of the proposed method in terms of the precision of travel behavior prediction.

Keywords: Spatiotemporal data mining
[129] Duen-Ren Liu and Chin-Hui Lai. Mining group-based knowledge flows for sharing task knowledge. Decision Support Systems, 50(2):370 - 386, 2011. [ bib | DOI | http ]
In an organization, knowledge is the most important resource in the creation of core competitive advantages. It is circulated and accumulated by knowledge flows (KFs) in the organization to support workers' task needs. Because workers accumulate knowledge of different domains, they may cooperate and participate in several task-based groups to satisfy their needs. In this paper, we propose algorithms that integrate information retrieval and data mining techniques to mine and construct group-based {KFs} (GKFs) for task-based groups. A {GKF} is expressed as a directed knowledge graph which represents the knowledge referencing behavior, or knowledge flow, of a group of workers with similar task needs. Task-related knowledge topics and their relationships (flows) can be identified from the knowledge graph so as to fulfill workers' task needs and promote knowledge sharing for collaboration of group members. Moreover, the frequent knowledge referencing path can be identified from the knowledge graph to indicate the frequent knowledge flow of the workers. To demonstrate the efficacy of the proposed methods, we implement a prototype of the {GKF} mining system. Our {GKF} mining methods can enhance organizational learning and facilitate knowledge management, sharing, and reuse in an environment where collaboration and teamwork are essential.

Keywords: Knowledge flow
[149] Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein. Studying software evolution using topic models. Science of Computer Programming, 80, Part B(0):457 - 479, 2014. [ bib | DOI | http ]
Topic models are generative probabilistic models which have been applied to information retrieval to automatically organize and provide structure to a text corpus. Topic models discover topics in the corpus, which represent real world concepts by frequently co-occurring words. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task. In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, {JHotDraw} and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time. We find that the large majority (87%–89%) of topic evolutions correspond well with actual code change activities by developers. We are thus encouraged to use topic models as tools for studying the evolution of a software system.

Keywords: Software evolution
[150] Di Jiang, Jan Vosecky, Kenneth Wai-Ting Leung, Lingxiao Yang, and Wilfred Ng. Sg-wstd: A framework for scalable geographic web search topic discovery. Knowledge-Based Systems, 84(0):18 - 33, 2015. [ bib | DOI | http ]
Search engine query logs are recognized as an important information source that contains millions of users’ web search needs. Discovering Geographic Web Search Topics (G-WSTs) from a query log can support a variety of downstream web applications such as finding commonality between locations and profiling search engine users. However, the task of discovering G-WSTs is nontrivial, not only because of the diversity of the information in web search but also due to the sheer size of query log. In this paper, we propose a new framework, Scalable Geographic Web Search Topic Discovery (SG-WSTD), which contains highly scalable functionalities such as search session derivation, geographic information extraction and geographic web search topic discovery to discover G-WSTs from query log. Within SG-WSTD, two probabilistic topic models are proposed to discover G-WSTs from two complementary perspectives. The first one is the Discrete Search Topic Model (DSTM), which discovers G-WSTs that capture the commonalities between discrete locations. The second is the Regional Search Topic Model (RSTM), which focuses on a specific geographic region on the map and discovers G-WSTs that demonstrate geographic locality. Since query log is typically voluminous, we implement the functionalities in SG-WSTD based on the MapReduce paradigm to solve the efficiency bottleneck. We evaluate SG-WSTD against several strong baselines on a real-life query log from AOL. The proposed framework demonstrates significantly improved data interpretability, better prediction performance, higher topic distinctiveness and superior scalability in the experimentation.

Keywords: Topic model
[152] He Feng and Xueming Qian. Mining user-contributed photos for personalized product recommendation. Neurocomputing, 129(0):409 - 420, 2014. [ bib | DOI | http ]
With the advent and popularity of social media, users are willing to share their experiences by photos, reviews, blogs, and so on. The social media contents shared by these users reveal potential shopping needs. Product recommender is not limited to just e-commerce sites, it can also be expanded to social media sites. In this paper, we propose a novel hierarchical user interest mining (Huim) approach for personalized products recommendation. The input of our approach consists of user-contributed photos and user generated content (UGC), which include user-annotated photo tags and the comments from others in a social site. The proposed approach consists of four steps. First, we make full use of the visual information and {UGC} of its photos to mine user's interest. Second, we represent user interest by a topic distribution vector, and apply our proposed Huim to enhance interest-related topics. Third, we also represent each product by a topic distribution vector. Then, we measure the relevance of user and product in the topic space and determine the rank of each product for the user. We conduct a series of experiments on Flickr users and the products from Bing Shopping. Experimental results show the effectiveness of the proposed approach.

Keywords: Products recommender
[158] Ali Khwaldeh, Amani Tahat, Jordi Marti, and Mofleh Tahat. Atomic data mining numerical methods, source code {SQlite} with python. Procedia - Social and Behavioral Sciences, 73(0):232 - 239, 2013. Proceedings of the 2nd International Conference on Integrated Information (IC-ININFO 2012), Budapest, Hungary, August 30 – September 3, 2012. [ bib | DOI | http ]
This paper introduces a recently published Python data mining book (chapters, topics, samples of Python source code written by its authors) to be used in data mining via world wide web and any specific database in several disciplines (economic, physics, education, marketing. etc). The book started with an introduction to data mining by explaining some of the data mining tasks involved classification, dependence modelling, clustering and discovery of association rules. The book addressed that using Python in data mining has been gaining some interest from data miner community due to its open source, general purpose programming and web scripting language; furthermore, it is a cross platform and it can be run on a wide variety of operating systens such as Linux, Windows, FreeBSD, Macintosh, Solaris, OS/2, Amiga, AROS, AS/400, BeOS, OS/390, z/OS, Palm OS, QNX, VMS, Psion, Acorn {RISC} OS, VxWorks, PlayStation, Sharp Zaurus, Windows {CE} and even PocketPC. Finally this book can be considered as a teaching textbook for data mining in which several methods such as machine learning and statistics are used to extract high-level knowledge from real-world datasets.

Keywords: Python
[161] Meng-Sung Wu. Modeling query-document dependencies with topic language models for information retrieval. Information Sciences, 312(0):1 - 12, 2015. [ bib | DOI | http ]
This paper addresses deficiencies in current information retrieval models by integrating the concept of relevance into the generation model using various topical aspects of the query. The models are adapted from the latent Dirichlet allocation model, but differ in the way that the notation of query-document relevance is introduced in the modeling framework. In the first method, query terms are added to relevant documents in the training of the latent Dirichlet allocation model. In the second method, the latent Dirichlet allocation model is expanded to deal with relevant query terms. The topic of each term within a given document may be sampled using either the normal document-specific mixture weights in {LDA} using query-specific mixture weights. We also developed an efficient method based on the Gibbs sampling technique for parameter estimation. Experiment results based on the Text {REtrieval} Conference Corpus (TREC) demonstrate the superiority of the proposed models.

Keywords: Topic model
[162] You Chen, Joydeep Ghosh, Cosmin Adrian Bejan, Carl A. Gunter, Siddharth Gupta, Abel Kho, David Liebovitz, Jimeng Sun, Joshua Denny, and Bradley Malin. Building bridges across electronic health record systems through inferred phenotypic topics. Journal of Biomedical Informatics, 55(0):82 - 93, 2015. [ bib | DOI | http ]
Data in electronic health records (EHRs) is being increasingly leveraged for secondary uses, ranging from biomedical association studies to comparative effectiveness. To perform studies at scale and transfer knowledge from one institution to another in a meaningful way, we need to harmonize the phenotypes in such systems. Traditionally, this has been accomplished through expert specification of phenotypes via standardized terminologies, such as billing codes. However, this approach may be biased by the experience and expectations of the experts, as well as the vocabulary used to describe such patients. The goal of this work is to develop a data-driven strategy to (1) infer phenotypic topics within patient populations and (2) assess the degree to which such topics facilitate a mapping across populations in disparate healthcare systems. Methods We adapt a generative topic modeling strategy, based on latent Dirichlet allocation, to infer phenotypic topics. We utilize a variance analysis to assess the projection of a patient population from one healthcare system onto the topics learned from another system. The consistency of learned phenotypic topics was evaluated using (1) the similarity of topics, (2) the stability of a patient population across topics, and (3) the transferability of a topic across sites. We evaluated our approaches using four months of inpatient data from two geographically distinct healthcare systems: (1) Northwestern Memorial Hospital (NMH) and (2) Vanderbilt University Medical Center (VUMC). Results The method learned 25 phenotypic topics from each healthcare system. The average cosine similarity between matched topics across the two sites was 0.39, a remarkably high value given the very high dimensionality of the feature space. The average stability of {VUMC} and {NMH} patients across the topics of two sites was 0.988 and 0.812, respectively, as measured by the Pearson correlation coefficient. Also the {VUMC} and {NMH} topics have smaller variance of characterizing patient population of two sites than standard clinical terminologies (e.g., ICD9), suggesting they may be more reliably transferred across hospital systems. Conclusions Phenotypic topics learned from {EHR} data can be more stable and transferable than billing codes for characterizing the general status of a patient population. This suggests that EHR-based research may be able to leverage such phenotypic topics as variables when pooling patient populations in predictive models.

Keywords: Clinical phenotype modeling
[163] Ming jian Zhou and Xue jiao Chen. An outlier mining algorithm based on dissimilarity. Procedia Environmental Sciences, 12, Part B(0):810 - 814, 2012. 2011 International Conference of Environmental Science and Engineering. [ bib | DOI | http ]
Outlier mining is a hot topic of data mining. After studying the commonly used outlier mining methods, this paper presents an outlier mining algorithm OMABD(Outlier Mining Algorithm Base on Dissimilarity) based on dissimilarity. The algorithm first constructs dissimilarity matrix based on object dissimilarity of each object of data set, then makes the dissimilarity degree of each object according to the dissimilarity matrix, and finally outlier will be detected by comparing the dissimilarity degree with dissimilarity threshold. The experiment results show that this algorithm can detect outlier efficiently.

Keywords: Outlier
[164] Aleksey Panasyuk, Edmund Szu-Li Yu, and Kishan G. Mehrotra. Controversial topic discovery on members of congress with twitter. Procedia Computer Science, 36(0):160 - 167, 2014. Complex Adaptive Systems Philadelphia, {PA} November 3-5, 2014. [ bib | DOI | http ]
This paper addresses how Twitter can be used for identifying conflict between communities of users. We aggregate documents by topic and by community and perform sentiment analysis, which allows us to analyze the overall opinion of each community about each topic. We rank the topics with opposing views (negative for one community and positive for the other). For illustration of the proposed methodology we chose a problem whose results can be evaluated using news articles. We look at tweets for republican and democrat congress members for the 112th House of Representatives from September to December 2013 and demonstrate that our approach is successful by comparing against articles in the news media.

Keywords: Twitter
[170] Hsu-Hao Tsai. Knowledge management vs. data mining: Research trend, forecast and citation approach. Expert Systems with Applications, 40(8):3160 - 3173, 2013. [ bib | DOI | http ]
Knowledge management (KM) and data mining (DM) have become more important today, however, there are few comprehensive researches and categorization schemes to discuss the characteristics for both of them. Using a bibliometric approach, this paper analyzes {KM} and {DM} research trends, forecasts and citations from 1989 to 2009 by locating headings “knowledge management” and “data mining” in topics in the {SSCI} database. The bibliometric analytical technique was used to examine these two topics in {SSCI} journals from 1989 to 2009, we found 1393 articles with {KM} and 1181 articles with DM. This paper implemented and classified {KM} and {DM} articles using the following eight categories—publication year, citation, country/territory, document type, institute name, language, source title and subject area—for different distribution status in order to explore the differences and how {KM} and {DM} technologies have developed in this period and to analyze {KM} and {DM} technology tendencies under the above result. Also, the paper performs the K–S test to check whether the distribution of author article production follows Lotka’s law. The research findings can be extended to investigate author productivity by analyzing variables such as chronological and academic age, number and frequency of previous publications, access to research grants, job status, etc. In such a way characteristics of high, medium and low publishing activity of authors can be identified. Besides, these findings will also help to judge scientific research trends and understand the scale of development of research in {KM} and {DM} through comparing the increases of the article author. Based on the above information, governments and enterprises may infer collective tendencies and demands for scientific researcher in {KM} and {DM} to formulate appropriate training strategies and policies in the future. This analysis provides a roadmap for future research, abstracts technology trend information and facilitates knowledge accumulations, therefore the future research can concentrated in core categories. This implies that the phenomenon “success breeds success” is more common in higher quality publications.

Keywords: Knowledge management
[171] Samet Çokpınar and Taflan İmre Gündem. Positive and negative association rule mining on {XML} data streams in database as a service concept. Expert Systems with Applications, 39(8):7503 - 7511, 2012. [ bib | DOI | http ]
In recent years, data mining has become one of the most popular techniques for data owners to determine their strategies. Association rule mining is a data mining approach that is used widely in traditional databases and usually to find the positive association rules. However, there are some other challenging rule mining topics like data stream mining and negative association rule mining. Besides, organizations want to concentrate on their own business and outsource the rest of their work. This approach is named “database as a service concept” and provides lots of benefits to data owner, but, at the same time, brings out some security problems. In this paper, a rule mining system has been proposed that provides efficient and secure solution to positive and negative association rule computation on {XML} data streams in database as a service concept. The system is implemented and several experiments have been done with different synthetic data sets to show the performance and efficiency of the proposed system.

Keywords: Association rule mining
[172] Petra Perner. Mining sparse and big data by case-based reasoning. Procedia Computer Science, 35(0):19 - 33, 2014. Knowledge-Based and Intelligent Information & Engineering Systems 18th Annual Conference, KES-2014 Gdynia, Poland, September 2014 Proceedings. [ bib | DOI | http ]
The increasing use of digital media in daily life has resulted in a need for novel multimedia data analysis techniques. Case-based Reasoning (CBR) solves problems using the already stored knowledge, and captures new knowledge, making it immediately available for solving the next problem. Therefore, case-based reasoning can be seen as a method for problem solving, and also as a method to capture new experience and make it immediately available for problem solving. Therefore, {CBR} can mine sparse and big data. It can be seen as a learning and knowledge-discovery approach, since it can capture from new experience some general knowledge, such as case classes, prototypes and some higher-level concept. In this talk, we will explain the case-based reasoning process scheme. We will show what kinds of methods are necessary to provide all the functions for such a computer model. We will develop the bridge between {CBR} and Statistics and show how case- based reasoning can mine big and sparse data. Examples are being given based on multimedia applications. Finally, we will show recent new developments and we will give an outline for further work.

Keywords: Sparse Data Mining
[173] Yi-Hui Chen, Eric Jui-Lin Lu, and Meng Fang Tsai. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors. Expert Systems with Applications, 41(2):663 - 670, 2014. [ bib | DOI | http ]
Readers are becoming accustomed to obtaining useful and reliable information from bloggers. To make access to the vastly increasing resource of blogs more effective, clustering is useful. Results of the literature review suggest that using linking information, keywords, or tags/categories to calculate similarity is critical for clustering. Keywords are commonly retrieved from the full text, which can be a time-consuming task if multiple articles must be processed. For tags/categories, there is also a problem of ambiguity; that is, different bloggers may define tags/categories of identical content differently. Keywords are important not only to reflect the theme of an article through blog readers’ perspectives but also to accurately match users’ intentions. In this paper, a tracing code is embedded in Blog Connect, a newly developed platform, to collect the keywords queried by readers and then select candidate keywords as co-keywords. The experiments show positive data to confirm that co-keywords can act as a quick path to an article. In addition, co-keyword generation can reduce the complexity and redundancy of full-text keyword retrieval procedures and satisfy blog readers’ intentions.

Keywords: Blog mining
[174] D. Thorleuchter and D. Van den Poel. Web mining based extraction of problem solution ideas. Expert Systems with Applications, 40(10):3961 - 3969, 2013. [ bib | DOI | http ]
The internet is a valuable source of information where many ideas can be found dealing with different topics. A few numbers of ideas might be able to solve an existing problem. However, it is time-consuming to identify these ideas within the large amount of textual information in the internet. This paper introduces a new web mining approach that enables an automated identification of new technological ideas extracted from internet sources that are able to solve a given problem. It adapts and combines several existing approaches from literature: approaches that extract new technological ideas from a user given text, approaches that investigate the different idea characteristics in different technical domains, and multi-language web mining approaches. In contrast to previous work, the proposed approach enables the identification of problem solution ideas in the internet considering domain dependencies and language aspects. In a case study, new ideas are identified to solve existing technological problems as occurred in research and development (R&D) projects. This supports the process of research planning and technology development.

Keywords: Web mining
[175] Jianshan Sun, Gang Wang, Xusen Cheng, and Yelin Fu. Mining affective text to improve social media item recommendation. Information Processing & Management, 51(4):444 - 457, 2015. [ bib | DOI | http ]
Social media websites, such as YouTube and Flicker, are currently gaining in popularity. A large volume of information is generated by online users and how to appropriately provide personalized content is becoming more challenging. Traditional recommendation models are overly dependent on preference ratings and often suffer from the problem of “data sparsity”. Recent research has attempted to integrate sentiment analysis results of online affective texts into recommendation models; however, these studies are still limited. The one class collaborative filtering (OCCF) method is more applicable in the social media scenario yet it is insufficient for item recommendation. In this study, we develop a novel sentiment-aware social media recommendation framework, referred to as SA_OCCF, in order to tackle the above challenges. We leverage inferred sentiment feedback information and {OCCF} models to improve recommendation performance. We conduct comprehensive experiments on a real social media web site to verify the effectiveness of the proposed framework and methods. The results show that the proposed methods are effective in improving the performance of the baseline {OCCF} methods.

Keywords: Social media
[176] Bai-En Shie, Philip S. Yu, and Vincent S. Tseng. Efficient algorithms for mining maximal high utility itemsets from data streams with different models. Expert Systems with Applications, 39(17):12947 - 12960, 2012. [ bib | DOI | http ]
Data stream mining is an emerging research topic in the data mining field. Finding frequent itemsets is one of the most important tasks in data stream mining with wide applications like online e-business and web click-stream analysis. However, two main problems existed in relevant studies: (1) The utilities (e.g., importance or profits) of items are not considered. Actual utilities of patterns cannot be reflected in frequent itemsets. (2) Existing utility mining methods produce too many patterns and this makes it difficult for the users to filter useful patterns among the huge set of patterns. In view of this, in this paper we propose a novel framework, named {GUIDE} (Generation of maximal high Utility Itemsets from Data strEams), to find maximal high utility itemsets from data streams with different models, i.e., landmark, sliding window and time fading models. The proposed structure, named MUI-Tree (Maximal high Utility Itemset Tree), maintains essential information for the mining processes and the proposed strategies further facilitates the performance of GUIDE. Main contributions of this paper are as follows: (1) To the best of our knowledge, this is the first work on mining the compact form of high utility patterns from data streams; (2) {GUIDE} is an effective one-pass framework which meets the requirements of data stream mining; (3) {GUIDE} generates novel patterns which are not only high utility but also maximal, which provide compact and insightful hidden information in the data streams. Experimental results show that our approach outperforms the state-of-the-art algorithms under various conditions in data stream environments on different models.

Keywords: High utility itemset
[177] Juan D. Velásquez. Web mining and privacy concerns: Some important legal issues to be consider before applying any data and information extraction technique in web-based environments. Expert Systems with Applications, 40(13):5228 - 5239, 2013. [ bib | DOI | http ]
Web mining is a concept that gathers all techniques, methods and algorithms used to extract information and knowledge from data originating on the web (web data). A part of this technique aims to analyze the behavior of users in order to continuously improve both the structure and content of visited web sites. Behind this quite altruistic belief – namely, to help the user feel comfortable when they visit a site through a personalization process – there underlie a series of processing methodologies which operate at least arguably from the point of view of the users’ privacy. Thus, an important question arises; to what extent may the desire to improve the services offered through a web site infringe upon the privacy of those who visit it? The use of powerful processing tools such as those provided by web mining may threaten users’ privacy. Current legal scholarship on privacy issues suggests a flexible approach that enables the determination, within each particular context, of those behaviors that can threaten individual privacy. However, it has been observed that {TIC} professionals, with the purpose of formulating practical rules on this matter, have a very narrow-minded concept of privacy, primarily centered on the dichotomy between personal identifiable information (PII) and anonymous data. The aim of this paper is to adopt an integrative approach based on the distinctive attributes of web mining in order to determine which techniques and uses are harmful.

Keywords: Web mining
[178] Bernard V. Liengme. Chapter 6 - data mining. In Bernard V. Liengme, editor, A Guide to Microsoft Excel 2013 for Scientists and Engineers, pages 103 - 116. Academic Press, Boston, 2016. [ bib | DOI | http ]
The main thrust of this chapter is the consolidation of large databases into usable information. The chapter begins with instructions on how to import a {TXT} file as a way of giving the reader a reasonably large database with which to experiment. The reader is told how to sort and filter a database. Consolidation techniques such as frequency tables and pivot tables are explained.

Keywords: Frequency distribution
[179] Phillippe Mack. Chapter 35 - big data, data mining, and predictive analytics and high performance computing. In Lawrence E. Jones, editor, Renewable Energy Integration, pages 439 - 454. Academic Press, Boston, 2014. [ bib | DOI | http ]
Acceleration in connected devices, computing power and storage capacity has led to an exponential growth of available data. In recent years, all industry sectors have started to face this huge flow of information and tried to turn it into increased value for their business. Renewable energy sources are increasing at a lightning pace all around the world. Uncertainties related to renewable energy market penetration threaten the business of many stakeholders. Advances in predictive analytics have proven to be an efficient solution to cope with uncertainties and transform data into business value. High performance computing and cloud based computation significantly decrease the setup costs of predictive analytics solutions. With the emergence of smart grids, newer solutions based on predictive analytics are starting to appear in the market place, however, market players still face numerous roadblocks in order to leverage the full potential of big data.

Keywords: predictive analytics
[180] Sebastián A. Ríos and Ivan F. Videla–Cavieres. Generating groups of products using graph mining techniques. Procedia Computer Science, 35(0):730 - 738, 2014. Knowledge-Based and Intelligent Information & Engineering Systems 18th Annual Conference, KES-2014 Gdynia, Poland, September 2014 Proceedings. [ bib | DOI | http ]
Retail industry has evolved. Nowadays, companies around the world need a better and deeper understanding of their customers. In order to enhance store layout, generate customers groups, offers and personalized recommendations, among others. To accomplish these objectives, it is very important to know which products are related to each other. Classical approaches for clustering products, such as K-means or SOFM, do not work when exist scattered and large amounts of data. Even association rules give results that are difficult to interpret. These facts motivate us to use a novel approach that generates communities of products. One of the main advantages of these communities is that are meaningful and easily interpretable by retail analysts. This approach allows the processing of billions of transaction records within a reasonable time, according to the needs of companies.

Keywords: Market Basket Analysis
[181] Tim Menzies, Ekrem Kocagüneli, Leandro Minku, Fayola Peters, and Burak Turhan. Chapter 7 - data mining and {SE}. In Tim MenziesEkrem KocagüneliLeandro MinkuFayola PetersBurak Turhan, editor, Sharing Data and Models in Software Engineering, pages 41 - 42. Morgan Kaufmann, Boston, 2015. [ bib | DOI | http ]
In this part of the book Data Science for Software Engineering: Sharing Data and Models, we offer some tutorial notes on commonly used software engineering applications of data mining, along with some tutorial material on data mining algorithms. Covered topics of {SE} problems include effort estimation and defect prediction. Covered aspects of data mining include discretization, column pruning (also known as feature selection), row pruning, clustering, contrast set learning, decision learning, and learning for continuous classes.

[182] David F. Nettleton. Data mining of social networks represented as graphs. Computer Science Review, 7(0):1 - 34, 2013. [ bib | DOI | http ]
In this survey we review the literature and concepts of the data mining of social networks, with special emphasis on their representation as a graph structure. The survey is divided into two principal parts: first we conduct a survey of the literature which forms the ‘basis’ and background for the field; second we define a set of ‘hot topics’ which are currently in vogue in congresses and the literature. The ‘basis’ or background part is divided into four major themes: graph theory, social networks, online social networks and graph mining. The graph mining theme is organized into ten subthemes. The second, ‘hot topic’ part, is divided into five major themes: communities, influence and recommendation, models metrics and dynamics, behaviour and relationships, and information diffusion.

Keywords: Graphs
[183] Jun Liu, Jianke Pan, Yanping Wang, Dingkun Lin, Dan Shen, Hongjun Yang, Xiang Li, Minghui Luo, and Xuewei Cao. Component analysis of chinese medicine and advances in fuming-washing therapy for knee osteoarthritis via unsupervised data mining methods. Journal of Traditional Chinese Medicine, 33(5):686 - 691, 2013. [ bib | DOI | http ]
To analyze the component law of Chinese medicines in fuming-washing therapy for knee osteoarthritis (KOA), and develop new fuming-washing prescriptions for {KOA} through unsupervised data mining methods. Methods Chinese medicine recipes for fuming-washing therapy for {KOA} were collected and recorded in a database. The correlation coefficient among herbs, core combinations of herbs, and new prescriptions were analyzed using modified mutual information, complex system entropy cluster, and unsupervised hierarchical clustering, respectively. Results Based on analysis of 345 Chinese medicine recipes for fuming-washing therapy, 68 herbs occurred frequently, 33 herb pairs occurred frequently, and 12 core combinations were found. Five new fuming-washing recipes for {KOA} were developed. Conclusion Chinese medicines for fuming-washing therapy of {KOA} mainly consist of wind-dampness-dispelling and cold-dispersing herbs, blood-activating and stasis-resolving herbs, and wind-dampness-dispelling and heat-clearing herbs. The treatment of fuming-washing therapy for {KOA} also includes dispelling wind-dampness and dispersing cold, activating blood and resolving stasis, and dispelling wind-dampness and clearing heat. Zhenzhutougucao (Herba Speranskiae Tuberculatae), Honghua (Flos Carthami), Niuxi (Radix Achyranthis Bidentatae), Shenjincao (Herba Lycopodii Japonici), Weilingxian (Radix et Rhizoma Clematidis Chinensis), Chuanwu (Radix Aconiti), Haitongpi (Cortex Erythrinae Variegatae), Ruxiang (Olibanum), Danggui (Radix Angelicae Sinensis), Caowu (Radix Aconiti Kusnezoffii), Moyao (Myrrha), and Aiye (Folium Artemisiae Argyi) are the main herbs used in the fuming-washing treatment for KOA.

Keywords: Knee osteoarthritis
[184] Hsu-Hao Tsai. Global data mining: An empirical study of current trends, future forecasts and technology diffusions. Expert Systems with Applications, 39(9):8172 - 8181, 2012. [ bib | DOI | http ]
Using a bibliometric approach, this paper analyzes research trends and forecasts of data mining from 1989 to 2009 by locating heading “data mining” in topic in the {SSCI} database. The bibliometric analytical technique was used to examine the topic in {SSCI} journals from 1989 to 2009, we found 1181 articles with data mining. This paper implemented and classified data mining articles using the following eight categories—publication year, citation, country/territory, document type, institute name, language, source title and subject area—for different distribution status in order to explore the differences and how data mining technologies have developed in this period and to analyze technology tendencies and forecasts of data mining under the above results. Also, the paper performs the K-S test to check whether the analysis follows Lotka’s law. Besides, the analysis also reviews the historical literatures to come out technology diffusions of data mining. The paper provides a roadmap for future research, abstracts technology trends and forecasts, and facilitates knowledge accumulation so that data mining researchers can save some time since core knowledge will be concentrated in core categories. This implies that the phenomenon “success breeds success” is more common in higher quality publications.

Keywords: Data mining
[185] Dirk Thorleuchter and Dirk Van den Poel. Weak signal identification with semantic web mining. Expert Systems with Applications, 40(12):4978 - 4985, 2013. [ bib | DOI | http ]
We investigate an automated identification of weak signals according to Ansoff to improve strategic planning and technological forecasting. Literature shows that weak signals can be found in the organization’s environment and that they appear in different contexts. We use internet information to represent organization’s environment and we select these websites that are related to a given hypothesis. In contrast to related research, a methodology is provided that uses latent semantic indexing (LSI) for the identification of weak signals. This improves existing knowledge based approaches because {LSI} considers the aspects of meaning and thus, it is able to identify similar textual patterns in different contexts. A new weak signal maximization approach is introduced that replaces the commonly used prediction modeling approach in LSI. It enables to calculate the largest number of relevant weak signals represented by singular value decomposition (SVD) dimensions. A case study identifies and analyses weak signals to predict trends in the field of on-site medical oxygen production. This supports the planning of research and development (R&D) for a medical oxygen supplier. As a result, it is shown that the proposed methodology enables organizations to identify weak signals from the internet for a given hypothesis. This helps strategic planners to react ahead of time.

Keywords: Weak Signal
[186] Tim Menzies, Ekrem Kocagüneli, Leandro Minku, Fayola Peters, and Burak Turhan. Chapter 10 - data mining (under the hood). In Tim MenziesEkrem KocagüneliLeandro MinkuFayola PetersBurak Turhan, editor, Sharing Data and Models in Software Engineering, pages 51 - 75. Morgan Kaufmann, Boston, 2015. [ bib | DOI | http ]
In this part of the book Data Science for Software Engineering: Sharing Data and Models, we offer some tutorial notes on commonly used software engineering applications of data mining, along with some tutorial material on data mining algorithms. Covered topics of {SE} problems include effort estimation and defect prediction. Covered aspects of data mining include discretization, column pruning (also known as feature selection), row pruning, clustering, contrast set learning, decision learning, and learning for continuous classes.

[187] Mohamed M. Mostafa. More than words: Social networks’ text mining for consumer brand sentiments. Expert Systems with Applications, 40(10):4241 - 4251, 2013. [ bib | DOI | http ]
Blogs and social networks have recently become a valuable resource for mining sentiments in fields as diverse as customer relationship management, public opinion tracking and text filtering. In fact knowledge obtained from social networks such as Twitter and Facebook has been shown to be extremely valuable to marketing research companies, public opinion organizations and other text mining entities. However, Web texts have been classified as noisy as they represent considerable problems both at the lexical and the syntactic levels. In this research we used a random sample of 3516 tweets to evaluate consumers’ sentiment towards well-known brands such as Nokia, T-Mobile, IBM, {KLM} and DHL. We used an expert-predefined lexicon including around 6800 seed adjectives with known orientation to conduct the analysis. Our results indicate a generally positive consumer sentiment towards several famous brands. By using both a qualitative and quantitative methodology to analyze brands’ tweets, this study adds breadth and depth to the debate over attitudes towards cosmopolitan brands.

Keywords: Consumer behavior
[188] Fei Zhu, Preecha Patumcharoenpol, Cheng Zhang, Yang Yang, Jonathan Chan, Asawin Meechai, Wanwipa Vongsangnak, and Bairong Shen. Biomedical text mining and its applications in cancer research. Journal of Biomedical Informatics, 46(2):200 - 211, 2013. [ bib | DOI | http ]
Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100 years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research.

Keywords: Biomedical text
[189] Shu-Hsien Liao and Shan-Yuan Chou. Data mining investigation of co-movements on the taiwan and china stock markets for future investment portfolio. Expert Systems with Applications, 40(5):1542 - 1554, 2013. [ bib | DOI | http ]
On June 29, 2010, Taiwan signed an Economic Cooperation Framework Agreement (ECFA) with China as a major step to open markets between Taiwan and China. Thus, the {ECFA} will contribute by creating a closer relationship between China and Taiwan through economic and market interactions. Co-movements of the world’s national financial market indexes are a popular research topic in the finance literature. Some studies examine the co-movements and the benefits of international financial market portfolio diversification/integration and economic performance. Thus, this study investigates the co-movement in the Taiwan and China (Hong Kong) stock markets under the {ECFA} using a data mining approach, including association rules and clustering analysis. Thirty categories of stock indexes are implemented as decision variables to observe the behavior of stock index associations during the periods of {ECFA} implementation. Patterns, rules, and clusters of data mining results are discussed for future stock market investment portfolio.

Keywords: Cross-national stock market
[190] Diego Carvalho Soares, Flávia Maria Santoro, and Fernanda Araujo Baião. Discovering collaborative knowledge-intensive processes through e-mail mining. Journal of Network and Computer Applications, 36(6):1451 - 1465, 2013. [ bib | DOI | http ]
Knowledge Management aims to promote the growth, communication and preservation of knowledge within an organization, which includes managing the appropriate resources to facilitate knowledge sharing and reuse. Business Process-Oriented Knowledge Management focuses on discovering and representing the dynamic conversion of existing knowledge among participants involved in executing business processes. In this context, Knowledge-Intensive Processes are a very important and challenging specific subclass of processes, since they strongly involve socialization and informal exchanges of knowledge among participants. This paper describes in detail a method for semi-automatic discovery of relevant information characterizing Knowledge-Intensive Processes, as well as the results of further evaluation of the method. Our approach draws on the informal exchange of existing knowledge in collaborative tools such as e-mails. The output is a conceptual map that describes the main elements of a Knowledge-Intensive Process, as well as their relationships. The results from the case study conducted to evaluate the method in an organization underlined the prospects for using collaborative environments to discover the way agents perform their activities.

Keywords: Knowledge-intensive process
[191] Lei Duan, Changjie Tang, Xiaosong Li, Guozhu Dong, Xianming Wang, Jie Zuo, Min Jiang, Zhongqi Li, and Yongqing Zhang. Mining effective multi-segment sliding window for pathogen incidence rate prediction. Data & Knowledge Engineering, 87(0):425 - 444, 2013. [ bib | DOI | http ]
Pathogen incidence rate prediction, which can be considered as time series modeling, is an important task for infectious disease incidence rate prediction and for public health. This paper investigates the application of a genetic computation technique, namely GEP, for pathogen incidence rate prediction. To overcome the shortcomings of traditional sliding windows in GEP-based time series modeling, the paper introduces the problem of mining effective sliding window, for discovering optimal sliding windows for building accurate prediction models. To utilize the periodical characteristic of pathogen incidence rates, a multi-segment sliding window consisting of several segments from different periodical intervals is proposed and used. Since the number of such candidate windows is still very large, a heuristic method is designed for enumerating the candidate effective multi-segment sliding windows. Moreover, methods to find the optimal sliding window and then produce a mathematical model based on that window are proposed. A performance study on real-world datasets shows that the techniques are effective and efficient for pathogen incidence rate prediction.

Keywords: Data mining
[192] Chun-Wei Lin, Guo-Cheng Lan, and Tzung-Pei Hong. An incremental mining algorithm for high utility itemsets. Expert Systems with Applications, 39(8):7173 - 7180, 2012. [ bib | DOI | http ]
Association-rule mining, which is based on frequency values of items, is the most common topic in data mining. In real-world applications, customers may, however, buy many copies of products and each product may have different factors, such as profits and prices. Only mining frequent itemsets in binary databases is thus not suitable for some applications. Utility mining is thus presented to consider additional measures, such as profits or costs according to user preference. In the past, a two-phase mining algorithm was designed for fast discovering high utility itemsets from databases. When data come intermittently, the approach needs to process all the transactions in a batch way. In this paper, an incremental mining algorithm for efficiently mining high utility itemsets is proposed to handle the above situation. It is based on the concept of the fast-update (FUP) approach, which was originally designed for association mining. The proposed approach first partitions itemsets into four parts according to whether they are high transaction-weighted utilization itemsets in the original database and in the newly inserted transactions. Each part is then executed by its own procedure. Experimental results also show that the proposed algorithm executes faster than the two-phase batch mining algorithm in the intermittent data environment

Keywords: Utility mining
[193] Lee Sael, Inah Jeon, and U Kang. Scalable tensor mining. Big Data Research, 2(2):82 - 86, 2015. Visions on Big Data. [ bib | DOI | http ]
Tensors, or multi dimensional arrays, are receiving significant attention due to the various types of data that can be modeled by them; examples include call graphs (sender, receiver, time), knowledge bases (subject, verb, object), 3-dimensional web graphs augmented with anchor texts, to name a few. Scalable tensor mining aims to extract important patterns and anomalies from a large amount of tensor data. In this paper, we provide an overview of scalable tensor mining. We first present main algorithms for tensor mining, and their scalable versions. Next, we describe success stories of using tensors for interesting data mining problems including higher order web analysis, knowledge base mining, network traffic analysis, citation analysis, and sensor data analysis. Finally, we discuss interesting future research directions for scalable tensor mining.

Keywords: Tensor
[194] David Lo, Didi Surian, Philips Kokoh Prasetyo, Kuan Zhang, and Ee-Peng Lim. Mining direct antagonistic communities in signed social networks. Information Processing & Management, 49(4):773 - 791, 2013. [ bib | DOI | http ]
Social networks provide a wealth of data to study relationship dynamics among people. Most social networks such as Epinions and Facebook allow users to declare trusts or friendships with other users. Some of them also allow users to declare distrusts or negative relationships. When both positive and negative links co-exist in a network, some interesting community structures can be studied. In this work, we mine Direct Antagonistic Communities (DACs) within such signed networks. Each {DAC} consists of two sub-communities with positive relationships among members of each sub-community, and negative relationships among members of the other sub-community. Identifying direct antagonistic communities is an important step to understand the nature of the formation, dissolution, and evolution of such communities. Knowledge about antagonistic communities allows us to better understand and explain behaviors of users in the communities. Identifying {DACs} from a large signed network is however challenging as various combinations of user sets, which is very large in number, need to be checked. We propose an efficient data mining solution that leverages the properties of DACs, and combines the identification of strongest connected components and bi-clique mining. We have experimented our approach on synthetic, myGamma, and Epinions datasets to showcase the efficiency and utility of our proposed approach. We show that we can mine {DACs} in less than 15 min from a signed network of myGamma, which is a mobile social networking site, consisting of 600,000 members and 8 million links. An investigation on the behavior of users participating in {DACs} shows that antagonism significantly affects the way people behave and interact with one another.

Keywords: Direct antagonistic community
[195] David F. Motta Cabrera and Hamidreza Zareipour. Data association mining for identifying lighting energy waste patterns in educational institutes. Energy and Buildings, 62(0):210 - 216, 2013. [ bib | DOI | http ]
A significant portion of the energy consumption in post-secondary educational institutes is for lighting classrooms. The occupancy patterns in post-secondary educational institutes are not stable and predictable, and thus, alternative solutions may be required to match energy consumption and occupancy in order to increase energy efficiency. In this paper, we report an experimental research on quantifying and understanding lighting energy waste patterns in a post-secondary educational institute. Data has been collected over a full academic year in three typical classrooms. Data association mining, a powerful data mining tool, is applied to the data in order to extract association rules and explore lighting waste patterns. The simulations results show that if the waste patterns are avoided, significant savings, as high as 70% of the current energy use, are achievable.

Keywords: Building energy use
[196] Jun won Lee and Christophe Giraud-Carrier. Results on mining {NHANES} data: A case study in evidence-based medicine. Computers in Biology and Medicine, 43(5):493 - 503, 2013. [ bib | DOI | http ]
The National Health and Nutrition Examination Survey (NHANES), administered annually by the National Center for Health Statistics, is designed to assess the general health and nutritional status of adults and children in the United States. Given to several thousands of individuals, the extent of this survey is very broad, covering demographic, laboratory and examination information, as well as responses to a fairly comprehensive health questionnaire. In this paper, we adapt and extend association rule mining and clustering algorithms to extract useful knowledge regarding diabetes and high blood pressure from the 1999–2008 survey results, thus demonstrating how data mining techniques may be used to support evidence-based medicine.

Keywords: Medical data mining
[197] Janghyeok Yoon. Detecting weak signals for long-term business opportunities using text mining of web news. Expert Systems with Applications, 39(16):12543 - 12550, 2012. [ bib | DOI | http ]
In an uncertain business environment, competitive intelligence requires peripheral vision to scan and identify weak signals that can affect the future business environment. Weak signals are defined as imprecise and early indicators of impending important events or trends, which are considered key to formulating new potential business items. However, existing methods for discovering weak signals rely on the knowledge and expertise of experts, whose services are not widely available and tend to be costly. They may even provide different analysis results. Therefore, this paper presents a quantitative method that identifies weak signal topics by exploiting keyword-based text mining. The proposed method is illustrated using Web news articles related to solar cells. As a supportive tool for the expert-based approach, this method can be incorporated into long-term business planning processes to assist experts in identifying potential business items.

Keywords: Weak signal
[198] Filip Caron, Jan Vanthienen, and Bart Baesens. A comprehensive investigation of the applicability of process mining techniques for enterprise risk management. Computers in Industry, 64(4):464 - 475, 2013. [ bib | DOI | http ]
Process mining techniques and tools perfectly complement the existing set of enterprise risk management approaches. Enterprise risk management aims at minimizing the negative effects of uncertainty on the objectives, while at the same time promoting the potential positive effects. Process mining research has proposed a broad range of techniques and tools that could be used to effectively support the activities related to the different phases of risk management. This paper contributes to the process mining and risk management research by providing a full exploration of the applicability of process mining in the context of the eight components of the {COSO} Enterprise Risk Management Framework. The identified applications will be illustrated based on the risks involved in insurance claim handling processes.

Keywords: Enterprise risk management
[199] Wu He, Shenghua Zha, and Ling Li. Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management, 33(3):464 - 472, 2013. [ bib | DOI | http ]
Social media have been adopted by many businesses. More and more companies are using social media tools such as Facebook and Twitter to provide various services and interact with customers. As a result, a large amount of user-generated content is freely available on social media sites. To increase competitive advantage and effectively assess the competitive environment of businesses, companies need to monitor and analyze not only the customer-generated content on their own social media sites, but also the textual information on their competitors’ social media sites. In an effort to help companies understand how to perform a social media competitive analysis and transform social media data into knowledge for decision makers and e-marketers, this paper describes an in-depth case study which applies text mining to analyze unstructured text content on Facebook and Twitter sites of the three largest pizza chains: Pizza Hut, Domino's Pizza and Papa John's Pizza. The results reveal the value of social media competitive analysis and the power of text mining as an effective technique to extract business value from the vast amount of available social media data. Recommendations are also provided to help companies develop their social media competitive analysis strategy.

Keywords: Social media
[200] Laura Ferrari and Marco Mamei. Classification and prediction of whereabouts patterns from the reality mining dataset. Pervasive and Mobile Computing, 9(4):516 - 527, 2013. [ bib | DOI | http ]
Classification and prediction of users’ whereabouts patterns is important for many emerging ubiquitous computing applications. Latent Dirichlet Allocation (LDA) is a powerful mechanism to extract recurrent behaviors and high-level patterns (called topics) from mobility data in an unsupervised manner. One drawback of {LDA} is that it is difficult to give meaningful and usable labels to the extracted topics. We present a methodology to automatically classify the topic with meaningful labels so as to support their use in applications. We also present a topic prediction mechanism to infer user’s future whereabouts on the basis of the extracted topics. Both these two mechanisms are tested and evaluated using the Reality Mining dataset consisting of a large set of continuous data on human behavior.

Keywords: Whereabouts data
[201] Jing Huang, Bo Yang, Di Jin, and Yi Yang. Decentralized mining social network communities with agents. Mathematical and Computer Modelling, 57(11–12):2998 - 3008, 2013. Information System Security and Performance Modeling and Simulation for Future Mobile Networks. [ bib | DOI | http ]
Network community mining algorithms aim at efficiently and effectively discovering all such communities from a given network. Many related methods have been proposed and applied to different areas including social network analysis, gene network analysis and web clustering engines. Most of the existing methods for mining communities are centralized. In this paper, we present a multi-agent based decentralized algorithm, in which a group of autonomous agents work together to mine a network through a proposed self-aggregation and self-organization mechanism. Thanks to its decentralized feature, our method is potentially suitable for dealing with distributed networks, whose global structures are hard to obtain due to their geographical distributions, decentralized controls or huge sizes. The effectiveness of our method has been tested against different benchmark networks.

Keywords: Social network
[202] Shyi-Ming Chen and Po-Jui Sue. Constructing concept maps for adaptive learning systems based on data mining techniques. Expert Systems with Applications, 40(7):2746 - 2755, 2013. [ bib | DOI | http ]
In this paper, we propose a new method for automatically constructing concepts maps for adaptive learning systems based on data mining techniques. First, we calculate the counter values between any two questions, where the counter values indicate the answer-consistence between any two questions. Then, we consider four kinds of association rules between two questions to mine some information. Finally, we calculate the relevance degree between two concepts derived from the association rule to construct concept maps for adaptive learning systems. The proposed method can overcome the drawbacks of Chen and Bai’s (2010) and Lee et al.’s method (2009). It provides us with a useful way to construct concept maps for adaptive learning systems based on data mining techniques.

Keywords: Adaptive learning systems
[203] Prakash Shelokar, Arnaud Quirin, and Óscar Cordón. A multiobjective evolutionary programming framework for graph-based data mining. Information Sciences, 237(0):118 - 136, 2013. Prediction, Control and Diagnosis using Advanced Neural Computations. [ bib | DOI | http ]
Subgraph mining is the process of identifying concepts describing interesting and repetitive subgraphs within graph-based data. The exponential number of possible subgraphs makes the problem very challenging. Existing methods apply a single-objective subgraph search with the view that interesting subgraphs are those capable of not merely compressing the data, but also enhancing the interpretation of the data considerably. Usually the methods operate by posing simple constraints (or user-defined thresholds) such as returning all subgraphs whose frequency is above a specified threshold. Such search approach may often return either a large number of solutions in the case of a weakly defined objective or very few in the case of a very strictly defined objective. In this paper, we propose a framework based on multiobjective evolutionary programming to mine subgraphs by jointly maximizing two objectives, support and size of the extracted subgraphs. The proposed methodology is able to discover a nondominated set of interesting subgraphs subject to tradeoff between the two objectives, which otherwise would not be achieved by the single-objective search. Besides, it can use different specific multiobjective evolutionary programming methods. Experimental results obtained by three of the latter methods on synthetically generated as well as real-life graph-based datasets validate the utility of the proposed methodology when benchmarked against classical single-objective methods and their previous, nonevolutionary multiobjective extensions.

Keywords: Graph-based data mining
[204] Kazushi Ikeda, Gen Hattori, Chihiro Ono, Hideki Asoh, and Teruo Higashino. Twitter user profiling based on text and community mining for market analysis. Knowledge-Based Systems, 51(0):35 - 47, 2013. [ bib | DOI | http ]
This paper proposes demographic estimation algorithms for profiling Twitter users, based on their tweets and community relationships. Many people post their opinions via social media services such as Twitter. This huge volume of opinions, expressed in real time, has great appeal as a novel marketing application. When automatically extracting these opinions, it is desirable to be able to discriminate discrimination based on user demographics, because the ratio of positive and negative opinions differs depending on demographics such as age, gender, and residence area, all of which are essential for market analysis. In this paper, we propose a hybrid text-based and community-based method for the demographic estimation of Twitter users, where these demographics are estimated by tracking the tweet history and clustering of followers/followees. Our experimental results from 100,000 Twitter users show that the proposed hybrid method improves the accuracy of the text-based method. The proposed method is applicable to various user demographics and is suitable even for users who only tweet infrequently.

Keywords: Web mining
[205] Gabriel Oberreuter and Juan D. Velásquez. Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9):3756 - 3763, 2013. [ bib | DOI | http ]
Plagiarism detection is of special interest to educational institutions, and with the proliferation of digital documents on the Web the use of computational systems for such a task has become important. While traditional methods for automatic detection of plagiarism compute the similarity measures on a document-to-document basis, this is not always possible since the potential source documents are not always available. We do text mining, exploring the use of words as a linguistic feature for analyzing a document by modeling the writing style present in it. The main goal is to discover deviations in the style, looking for segments of the document that could have been written by another person. This can be considered as a classification problem using self-based information where paragraphs with significant deviations in style are treated as outliers. This so-called intrinsic plagiarism detection approach does not need comparison against possible sources at all, and our model relies only on the use of words, so it is not language specific. We demonstrate that this feature shows promise in this area, achieving reasonable results compared to benchmark models.

Keywords: Text mining
[206] Duen-Ren Liu, Chih-Kun Ke, Jia-Yuan Lee, and Chun-Feng Lee. Knowledge maps for composite e-services: A mining-based system platform coupling with recommendations. Expert Systems with Applications, 34(1):700 - 716, 2008. [ bib | DOI | http ]
Providing various e-services on the Internet by enterprises is an important trend in e-business. Composite e-services, which consist of various e-services provided by different e-service providers, are complex processes that require the cooperation among cross-organizational e-service providers. The flexibility and success of e-business depend on effective knowledge support to access related information resources of composite e-services. Thus, providing effective knowledge support for accessing composite e-services is a challenging task. This work proposes a knowledge map platform to provide an effective knowledge support for utilizing composite e-services. A data mining approach is applied to extract knowledge patterns from the usage records of composite e-services. Based on the mining result, topic maps are employed to construct the knowledge map. Meanwhile, the proposed knowledge map is integrated with recommendation capability to generate recommendations for composite e-services via data mining and collaborative filtering techniques. A prototype system is implemented to demonstrate the proposed platform. The proposed knowledge map enhanced with recommendation capability can provide users customized decision support to effectively utilize composite e-services.

Keywords: Composite e-service
[207] Bernard Kamsu-Foguem, Fabien Rigal, and Félix Mauget. Mining association rules for the quality improvement of the production process. Expert Systems with Applications, 40(4):1034 - 1045, 2013. [ bib | DOI | http ]
Academics and practitioners have a common interest in the continuing development of methods and computer applications that support or perform knowledge-intensive engineering tasks. Operations management dysfunctions and lost production time are problems of enormous magnitude that impact the performance and quality of industrial systems as well as their cost of production. Association rule mining is a data mining technique used to find out useful and invaluable information from huge databases. This work develops a better conceptual base for improving the application of association rule mining methods to extract knowledge on operations and information management. The emphasis of the paper is on the improvement of the operations processes. The application example details an industrial experiment in which association rule mining is used to analyze the manufacturing process of a fully integrated provider of drilling products. The study reports some new interesting results with data mining and knowledge discovery techniques applied to a drill production process. Experiment’s results on real-life data sets show that the proposed approach is useful in finding effective knowledge associated to dysfunctions causes.

Keywords: Data mining
[208] Dnyanesh G. Rajpathak. An ontology based text mining system for knowledge discovery from the diagnosis data in the automotive domain. Computers in Industry, 64(5):565 - 580, 2013. [ bib | DOI | http ]
In automotive domain, overwhelming volume of textual data is recorded in the form of repair verbatim collected during the fault diagnosis (FD) process. Here, the aim of knowledge discovery using text mining (KDT) task is to discover the best-practice repair knowledge from millions of repair verbatim enabling accurate FD. However, the complexity of {KDT} problem is largely due to the fact that a significant amount of relevant knowledge is buried in noisy and unstructured verbatim. In this paper, we propose a novel ontology-based text mining system, which uses the diagnosis ontology for annotating key terms recorded in the repair verbatim. The annotated terms are extracted in different tuples, which are used to identify the field anomalies. The extracted tuples are further used by the frequently co-occurring clustering algorithm to cluster the repair verbatim data such that the best-practice repair actions used to fix commonly observed symptoms associated with the faulty parts can be discovered. The performance of our system has been validated by using the real world data and it has been successfully implemented in a web based distributed architecture in real life industry.

Keywords: Text mining
[209] D. Coomans, C. Smyth, I. Lee, T. Hancock, and J. Yang. 2.26 - unsupervised data mining: Introduction. In Steven D. BrownRomá TaulerBeata Walczak, editor, Comprehensive Chemometrics, pages 559 - 576. Elsevier, Oxford, 2009. [ bib | DOI | http ]
This chapter focuses on cluster analysis in the context of unsupervised data mining. Various facets of cluster analysis, including proximities, are discussed in detail. Techniques of determining the natural number of clusters are described. Finally, techniques of assessing cluster accuracy and reproducibility are detailed. Techniques mentioned in this chapter are expanded upon in the following chapters.

Keywords: cluster analysis
[210] Anne Baumgrass and Mark Strembeck. Bridging the gap between role mining and role engineering via migration guides. Information Security Technical Report, 17(4):148 - 172, 2013. Special Issue: {ARES} 2012 7th International Conference on Availability, Reliability and Security. [ bib | DOI | http ]
In the context of role-based access control (RBAC), mining approaches, such as role mining or organizational mining, can be applied to derive permissions and roles from a system's configuration or from log files. In this way, mining techniques document the current state of a system and produce current-state {RBAC} models. However, such current-state {RBAC} models most often follow from structures that have evolved over time and are not the result of a systematic rights management procedure. In contrast, role engineering is applied to define a tailored {RBAC} model for a particular organization or information system. Thus, role engineering techniques produce a target-state {RBAC} model that is customized for the business processes supported via the respective information system. The migration from a current-state {RBAC} model to a tailored target-state {RBAC} model is, however, a complex task. In this paper, we present a systematic approach to migrate current-state {RBAC} models to target-state {RBAC} models. In particular, we use model comparison techniques to identify differences between two {RBAC} models. Based on these differences, we derive migration rules that define which elements and element relations must be changed, added, or removed. A migration guide then includes all migration rules that need to be applied to a particular current-state {RBAC} model to produce the corresponding target-state {RBAC} model. We conducted two comparative studies to identify which visualization technique is most suitable to make migration guides available to human users. Based on the results of these comparative studies, we implemented tool support for the derivation and visualization of migration guides. Our software tool is based on the Eclipse Modeling Framework (EMF). Moreover, this paper describes the experimental evaluation of our tool.

Keywords: {RBAC}
[211] Abd. Samad Hasan Basari, Burairah Hussin, I. Gede Pramudya Ananta, and Junta Zeniarja. Opinion mining of movie review using hybrid method of support vector machine and particle swarm optimization. Procedia Engineering, 53(0):453 - 462, 2013. Malaysian Technical Universities Conference on Engineering & Technology 2012, {MUCET} 2012. [ bib | DOI | http ]
Nowadays, online social media is online discourse where people contribute to create content, share it, bookmark it, and network at an impressive rate. The faster message and ease of use in social media today is Twitter. The messages on Twitter include reviews and opinions on certain topics such as movie, book, product, politic, and so on. Based on this condition, this research attempts to use the messages of twitter to review a movie by using opinion mining or sentiment analysis. Opinion mining refers to the application of natural language processing, computational linguistics, and text mining to identify or classify whether the movie is good or not based on message opinion. Support Vector Machine (SVM) is supervised learning methods that analyze data and recognize the patterns that are used for classification. This research concerns on binary classification which is classified into two classes. Those classes are positive and negative. The positive class shows good message opinion; otherwise the negative class shows the bad message opinion of certain movies. This justification is based on the accuracy level of {SVM} with the validation process uses 10-Fold cross validation and confusion matrix. The hybrid Partical Swarm Optimization (PSO) is used to improve the election of best parameter in order to solve the dual optimization problem. The result shows the improvement of accuracy level from 71.87% to 77%.

Keywords: Opinion
[212] Safaà Hachana, Frédéric Cuppens, Nora Cuppens-Boulahia, and Joaquin Garcia-Alfaro. Semantic analysis of role mining results and shadowed roles detection. Information Security Technical Report, 17(4):131 - 147, 2013. Special Issue: {ARES} 2012 7th International Conference on Availability, Reliability and Security. [ bib | DOI | http ]
The use of role engineering has grown in importance with the expansion of highly abstracted access control frameworks in organizations. In particular, the use of role mining techniques for the discovery of roles from previously deployed authorizations has facilitated the configuration of such frameworks. However, the literature lacks from a clear basis for appraising and leveraging the learning outcomes of the role mining process. In this paper, we provide such a formal basis. We compare sets of roles by projecting roles from one set into the other set. This approach is useful to measure how comparable the two configurations of roles are, and to interpret each role. We formally define the problem of comparing sets of roles, and prove that the problem is NP-complete. Then, we propose an algorithm to map the inherent relationship between the sets based on Boolean expressions. We demonstrate the correctness and completeness of our solution, and investigate some further issues that may benefit from our approach, such as detection of unhandled perturbations or source misconfiguration. In particular, we emphasize that the presence of shadowed roles in the role configuration increases the time complexity of sets of roles comparison. We provide a definition of the shadowed roles problem and propose a solution that detects different cases of role shadowing.

Keywords: Access control
[213] Wu He. Examining students’ online interaction in a live video streaming environment using data mining and text mining. Computers in Human Behavior, 29(1):90 - 102, 2013. Including Special Section Youth, Internet, and Wellbeing. [ bib | DOI | http ]
This study analyses the online questions and chat messages automatically recorded by a live video streaming (LVS) system using data mining and text mining techniques. We apply data mining and text mining techniques to analyze two different datasets and then conducted an in-depth correlation analysis for two educational courses with the most online questions and chat messages respectively. The study found the discrepancies as well as similarities in the students’ patterns and themes of participation between online questions (student–instructor interaction) and online chat messages (student–students interaction or peer interaction). The results also identify disciplinary differences in students’ online participation. A correlation is found between the number of online questions students asked and students’ final grades. The data suggests that a combination of using data mining and text mining techniques for a large amount of online learning data can yield considerable insights and reveal valuable patterns in students’ learning behaviors. Limitations with data and text mining were also revealed and discussed in the paper.

Keywords: Educational data mining
[214] Ying-Ho Liu and Chun-Sheng Wang. Constrained frequent pattern mining on univariate uncertain data. Journal of Systems and Software, 86(3):759 - 778, 2013. [ bib | DOI | http ]
In this paper, we propose a new algorithm called CUP-Miner (Constrained Univariate Uncertain Data Pattern Miner) for mining frequent patterns from univariate uncertain data under user-specified constraints. The discovered frequent patterns are called constrained frequent {U2} patterns (where “U2” represents “univariate uncertain”). In univariate uncertain data, each attribute in a transaction is associated with a quantitative interval and a probability density function. The CUP-Miner algorithm is implemented in two phases: In the first phase, a U2P-tree (Univariate Uncertain Pattern tree) is constructed by compressing the target database transactions into a compact tree structure. Then, in the second phase, the constrained frequent {U2} pattern is enumerated by traversing the U2P-tree with different strategies that correspond to different types of constraints. The algorithm speeds up the mining process by exploiting five constraint properties: succinctness, anti-monotonicity, monotonicity, convertible anti-monotonicity, and convertible monotonicity. Our experimental results demonstrate that CUP-Miner outperforms the modified {CAP} algorithm, the modified {FIC} algorithm, the modified U2P-Miner algorithm, and the modified Apriori algorithm.

Keywords: Frequent pattern mining
[215] Simon Fong. 18 - opportunities and challenges of integrating bio-inspired optimization and data mining algorithms. In Xin-She YangZhihua CuiRenbin XiaoAmir Hossein GandomiMehmet Karamanoglu, editor, Swarm Intelligence and Bio-Inspired Computation, pages 385 - 402. Elsevier, Oxford, 2013. [ bib | DOI | http ]
Data mining has evolved from methods of simple statistical analysis to complex pattern recognition in the past decades. During the progression, the data mining algorithms are modified or extended in order to overcome some specific problems. This chapter discusses about the prospects of improving data mining algorithms by integrating bio-inspired optimization, which has lately captivated much of researchers’ attention. In particular, high dimensionality and the unavailability of the whole data set (as in stream mining) in the training data have known to be two major challenges. We demonstrated that these two challenges, through two small examples such as K-means clustering and time-series classification, can be overcome by integrating data mining and bio-inspired algorithms.

Keywords: Bio-inspired
[216] Noora Nenonen. Analysing factors related to slipping, stumbling, and falling accidents at work: Application of data mining methods to finnish occupational accidents and diseases statistics database. Applied Ergonomics, 44(2):215 - 224, 2013. [ bib | DOI | http ]
The utilisation of data mining methods has become common in many fields. In occupational accident analysis, however, these methods are still rarely exploited. This study applies methods of data mining (decision tree and association rules) to the Finnish national occupational accidents and diseases statistics database to analyse factors related to slipping, stumbling, and falling (SSF) accidents at work from 2006 to 2007. {SSF} accidents at work constitute a large proportion (22%) of all accidents at work in Finland. In addition, they are more likely to result in longer periods of incapacity for work than other workplace accidents. The most important factor influencing whether or not an accident at work is related to {SSF} is the specific physical activity of movement. In addition, the risk of {SSF} accidents at work seems to depend on the occupation and the age of the worker. The results were in line with previous research. Hence the application of data mining methods was considered successful. The results did not reveal anything unexpected though. Nevertheless, because of the capability to illustrate a large dataset and relationships between variables easily, data mining methods were seen as a useful supplementary method in analysing occupational accident data.

Keywords: Occupational accident
[217] Wu He. Improving user experience with case-based reasoning systems using text mining and web 2.0. Expert Systems with Applications, 40(2):500 - 507, 2013. [ bib | DOI | http ]
Many {CBR} systems have been developed in the past. However, currently many {CBR} systems are facing a sustainability issue such as outdated cases and stagnant case growth. Some {CBR} systems have fallen into disuse due to the lack of new cases, case update, user participation and user engagement. To encourage the use of {CBR} systems and give users better experience, {CBR} system developers need to come up with new ways to add new features and values to the {CBR} systems. The author proposes a framework to use text mining and Web 2.0 technologies to improve and enhance {CBR} systems for providing better user experience. Two case studies were conducted to evaluate the usefulness of text mining techniques and Web 2.0 technologies for enhancing a large scale {CBR} system. The results suggest that text mining and Web 2.0 are promising ways to bring additional values to {CBR} and they should be incorporated into the {CBR} design and development process for the benefit of {CBR} users.

Keywords: Case-based reasoning
[218] Ashis Kumar Chanda, Swapnil Saha, Manziba Akanda Nishi, Md. Samiullah, and Chowdhury Farhan Ahmed. An efficient approach to mine flexible periodic patterns in time series databases. Engineering Applications of Artificial Intelligence, 44(0):46 - 63, 2015. [ bib | DOI | http ]
Periodic pattern mining in time series databases is one of the most interesting data mining problems that is frequently appeared in many real-life applications. Some of the existing approaches find fixed length periodic patterns by using suffix tree structure, i.e., unable to mine flexible patterns. One of the existing approaches generates periodic patterns by skipping intermediate events, i.e., flexible patterns, using apriori based sequential pattern mining approach. Since, apriori based approaches suffer from the issues of huge amount of candidate generation and large percentage of false pattern pruning, we propose an efficient algorithm {FPPM} (Flexible Periodic Pattern Mining) using suffix trie data structure. The proposed algorithm can capture more effective variable length flexible periodic patterns by neglecting unimportant or undesired events and considering only the important events in an efficient way. To the best of our knowledge, ours is the first approach that simultaneously handles various starting position throughout the sequences, flexibility among events in the mined patterns and interactive tuning of period values on the go. Complexity analysis of the proposed approach and comparison with existing approaches along with analytical comparison on various issues have been performed. As well as extensive experimental analyses are conducted to evaluate the performance of proposed {FPPM} algorithm using real-life datasets. The proposed approach outperforms existing algorithms in terms of processing time, scalability, and quality of mined patterns.

Keywords: Data mining
[219] Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G. Bringas. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231(0):64 - 82, 2013. Data Mining for Information Security. [ bib | DOI | http ]
Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a critical topic in computer security. Currently, signature-based detection is the most widespread method used in commercial antivirus. In spite of the broad use of this method, it can detect malware only after the malicious executable has already caused damage and provided the malware is adequately documented. Therefore, the signature-based method consistently fails to detect new malware. In this paper, we propose a new method to detect unknown malware families. This model is based on the frequency of the appearance of opcode sequences. Furthermore, we describe a technique to mine the relevance of each opcode and assess the frequency of each opcode sequence. In addition, we provide empirical validation that this new method is capable of detecting unknown malware.

Keywords: Malware detection
[220] Jochen De Weerdt, Annelies Schupp, An Vanderloock, and Bart Baesens. Process mining for the multi-faceted analysis of business processes—a case study in a financial services organization. Computers in Industry, 64(1):57 - 67, 2013. [ bib | DOI | http ]
Most organizations have some kind of process-oriented information system that keeps track of business events. Process Mining starts from event logs extracted from these systems in order to discover, analyze, diagnose and improve processes, organizational, social and data structures. Notwithstanding the large number of contributions to the process mining literature over the last decade, the number of studies actually demonstrating the applicability and value of these techniques in practice has been limited. As a consequence, there is a need for real-life case studies suggesting methodologies to conduct process mining analysis and to show the benefits of its application in real-life environments. In this paper we present a methodological framework for a multi-faceted analysis of real-life event logs based on Process Mining. As such, we demonstrate the usefulness and flexibility of process mining techniques to expose organizational inefficiencies in a real-life case study that is centered on the back office process of a large Belgian insurance company. Our analysis shows that process mining techniques constitute an ideal means to tackle organizational challenges by suggesting process improvements and creating a company-wide process awareness.

Keywords: Process Mining
[221] Stavros Valsamidis, Theodosios Theodosiou, Ioannis Kazanidis, and Michael Nikolaidis. A framework for opinion mining in blogs for agriculture. Procedia Technology, 8(0):264 - 274, 2013. 6th International Conference on Information and Communication Technologies in Agriculture, Food and Environment (HAICTA 2013). [ bib | DOI | http ]
In recent years there is much talk about blogging and the way in which blogs influence media and change the way people communicate and share knowledge. Blogs are also at the center of attention commercially, while a large number of academic staff researches on them. Furthermore, blogs represent an important new arena in agriculture sector for knowledge discovery since farmers use them for professional reasons. Opinion mining is a kind of text mining. Its goal is to assess the attitude of the author with respect to a given subject. The attitude may be a positive or negative opinion. This paper outlines the challenges and opportunities of the blogs for agriculture in terms of analyzing the information which is stored in them. The used technique in an experiment blog with the aid of the RapidMiner software is opinion mining. This framework may thus help establish baselines for these opinion mining tasks in agriculture.

Keywords: blogs
[222] Chandrika Kamath and Ya Ju Fan. Chapter 2 - data mining in materials science and engineering. In Krishna Rajan, editor, Informatics for Materials Science and Engineering, pages 17 - 36. Butterworth-Heinemann, Oxford, 2013. [ bib | DOI | http ]
Data mining is the process of uncovering patterns, associations, anomalies, and statistically significant structures and events in data. It borrows and builds on ideas from many disciplines, ranging from statistics to machine learning, mathematical optimization, and signal and image processing. Data mining techniques are becoming an integral part of scientific endeavors in many application domains, including astronomy, bioinformatics, chemistry, materials science, climate, fusion, and combustion. In this chapter, we provide a brief introduction to the data mining process and some of the algorithms used in extracting information from scientific data sets.

Keywords: Data mining
[223] Filip Caron, Jan Vanthienen, and Bart Baesens. Comprehensive rule-based compliance checking and risk management with process mining. Decision Support Systems, 54(3):1357 - 1369, 2013. [ bib | DOI | http ]
Process mining researchers have primarily focused on developing and improving process discovery techniques, while attention for the applicability of process mining has been below par. As a result, there only exists a partial fit with the traditional requirements for compliance checking and risk management. This paper proposes a comprehensive rule-based process mining approach for a timely investigation of a complete set of enriched process event data. Additionally, the contribution elaborates a two-dimensional business rule taxonomy that serves as a source of business rules for the comprehensive rule-based compliance checking approach. Finally, the study provides a formal grounding for and an evaluation of the comprehensive rule-based compliance checking approach.

Keywords: Business rules
[224] Anoop Verma, Xiupeng Wei, and Andrew Kusiak. Predicting the total suspended solids in wastewater: A data-mining approach. Engineering Applications of Artificial Intelligence, 26(4):1366 - 1372, 2013. [ bib | DOI | http ]
Total suspended solids (TSS) are a major pollutant that affects waterways all over the world. Predicting the values of {TSS} is of interest to quality control of wastewater processing. Due to infrequent measurements, time series data for {TSS} are constructed using influent flow rate and influent carbonaceous bio-chemical oxygen demand (CBOD). We investigated different scenarios of daily average influent {CBOD} and influent flow rate measured at 15 min intervals. Then, we used five data-mining algorithms, i.e., multi-layered perceptron, k-nearest neighbor, multi-variate adaptive regression spline, support vector machine, and random forest, to construct day-ahead, time-series prediction models for TSS. Historical {TSS} values were used as input parameters to predict current and future values of TSS. A sliding-window approach was used to improve the results of the predictions.

Keywords: Total suspended solids
[225] Peter Wittek and Sándor Darányi. Accelerating text mining workloads in a mapreduce-based distributed {GPU} environment. Journal of Parallel and Distributed Computing, 73(2):198 - 206, 2013. [ bib | DOI | http ]
Scientific computations have been using GPU-enabled computers successfully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infrastructure. Since the initial steps of text mining are typically data intensive, and the ease of deployment of algorithms is an important factor in developing advanced applications, we introduce a flexible, distributed, MapReduce-based text mining workflow that performs I/O-bound operations on {CPUs} with industry-standard tools and then runs compute-bound operations on {GPUs} which are optimized to ensure coalesced memory access and effective use of shared memory. We have performed extensive tests of our algorithms on a cluster of eight nodes with two {NVidia} Tesla {M2050s} attached to each, and we achieve considerable speedups for random projection and self-organizing maps.

Keywords: {GPU} computing
[226] Yu-Shiang Hung, Kuei-Ling B. Chen, Chi-Ta Yang, and Guang-Feng Deng. Web usage mining for analysing elder self-care behavior patterns. Expert Systems with Applications, 40(2):775 - 783, 2013. [ bib | DOI | http ]
The rapid growth of the elderly population has increased the need to support elders in maintaining independent and healthy lifestyles in their homes rather than through more expensive and isolated care facilities. Self-care can improve the competence of elderly participants in managing their own health conditions without leaving home. This main purpose of this study is to understand the self-care behavior of elderly participants in a developed self-care service system that provides self-care service and to analyze the daily self-care activities and health status of elders who live at home alone. To understand elder self-care patterns, log data from actual cases of elder self-care service were collected and analysed by Web usage mining. This study analysed 3391 sessions of 157 elders for the month of March, 2012. First, self-care use cycle, time, function numbers, and the depth and extent (range) of services were statistically analysed. Association rules were then used for data mining to find relationship between these functions of self-care behavior. Second, data from interest-based representation schemes were used to construct elder sessions. The ART2-enhance K-mean algorithm was then used to mine cluster patterns. Finally, sequential profiles for elder self-care behavior patterns were captured by applying sequence-based representation schemes in association with Markov models and ART2-enhanced K-mean clustering algorithms for sequence behavior mining cluster patterns for the elders. The analysis results can be used for research in medicine, public health, nursing and psychology and for policy-making in the health care domain.

Keywords: Elder self-care behavior pattern
[227] Stephen Won, Inkeun Cho, Kishan Sudusinghe, Jie Xu, Yu Zhang, Mihaela van der Schaar, and Shuvra S. Bhattacharyya. A design methodology for distributed adaptive stream mining systems. Procedia Computer Science, 18(0):2482 - 2491, 2013. 2013 International Conference on Computational Science. [ bib | DOI | http ]
Data-driven, adaptive computations are key to enabling the deployment of accurate and efficient stream mining systems, which invoke suitably configured queries in real-time on streams of input data. Due to the physical separation among data sources and computational resources, it is often necessary to deploy such stream mining systems in a distributed fashion, where local learners have access to disjoint subsets of the data that is to be mined, and forward their intermediate results to an ensemble learner that combines the results from the local learners. In this paper, we develop a design methodology for integrated de- sign, simulation, and implementation of dynamic data-driven adaptive stream mining systems. By systematically integrating considerations associated with local embedded processing, classifier configuration, data-driven adaptation and networked com- munication, our approach allows for effective assessment, prototyping, and implementation of alternative distributed design methods for data-driven, adaptive stream mining systems. We demonstrate our results on a dynamic data-driven application involving patient health care monitoring.

Keywords: Adaptive stream mining
[228] Yongli Li, Chong Wu, Xudong Wang, and Shitang Wu. A tree-network model for mining short message services seed users and its empirical analysis. Knowledge-Based Systems, 40(0):50 - 57, 2013. [ bib | DOI | http ]
Identifying short message services (SMSs) seed users helps to discover the information’s originals and transmission paths. A tree-network model was proposed to depict the characteristics of {SMS} seed users who have such three features as “ahead of time”, “mass texting” and “numerous retransmissions”. For acquiring the established network model’s width and depth, a clustering algorithm based on density was adopted and a recursion algorithm was designed to solve such problems. An objective, comprehensive and scale-free evaluation function was further presented to rank the potential seed users by using the width and the depth obtained above. Furthermore, the model’s empirical analysis was made based on part of the Shenzhen’s cell phone {SMS} data in February of 2012. The model is effective and applicable as a powerful tool to solve the {SMS} seed users’ mining problem.

Keywords: Network analysis
[229] Fermín L. Cruz, José A. Troyano, Fernando Enríquez, F. Javier Ortega, and Carlos G. Vallejo. ‘long autonomy or long delay?’ the importance of domain in opinion mining. Expert Systems with Applications, 40(8):3174 - 3184, 2013. [ bib | DOI | http ]
Nowadays, people do not only navigate the web, but they also contribute contents to the Internet. Among other things, they write their thoughts and opinions in review sites, forums, social networks, blogs and other websites. These opinions constitute a valuable resource for businesses, governments and consumers. In the last years, some researchers have proposed opinion extraction systems, mostly domain-independent ones, to automatically extract structured representations of opinions contained in those texts. In this work, we tackle this task in a domain-oriented approach, defining a set of domain-specific resources which capture valuable knowledge about how people express opinions on a given domain. These resources are automatically induced from a set of annotated documents. Some experiments were carried out on three different domains (user-generated reviews of headphones, hotels and cars), comparing our approach to other state-of-the-art, domain-independent techniques. The results confirm the importance of the domain in order to build accurate opinion extraction systems. Some experiments on the influence of the dataset size and an example of aggregation and visualization of the extracted opinions are also shown.

Keywords: Sentiment analysis
[230] Yu-Chiun Chiou, Lawrence W. Lan, and Wen-Pin Chen. A two-stage mining framework to explore key risk conditions on one-vehicle crash severity. Accident Analysis & Prevention, 50(0):405 - 415, 2013. [ bib | DOI | http ]
This paper proposes a two-stage mining framework to explore the key risk conditions that may have contributed to the one-vehicle crash severity in Taiwan's freeways. In the first stage, a genetic mining rule (GMR) model is developed, using a novel stepwise rule-mining algorithm, to identify the potential risk conditions that best elucidate the one-vehicle crash severity. In the second stage, a mixed logit model is estimated, using the antecedent part of the mined-rules as explanatory variables, to test the significance of the risk conditions. A total of 5563 one-vehicle crash cases (226 fatalities, 1593 injuries and 3744 property losses) occurred in Taiwan's freeways over 2003–2007 are analyzed. The {GMR} model has mined 29 rules for use. By incorporating these 29 mined-rules into a mixed logit model, we further identify one key safe condition and four key risk conditions leading to serious crashes (i.e., fatalities and injuries). Each key risk condition is discussed and compared with its adjacent rules. Based on the findings, some countermeasures to rectify the freeway's serious one-vehicle crashes are proposed.

Keywords: Crash severity
[231] Amina Madani, Omar Boussaid, and Djamel Eddine Zegour. Semi-structured documents mining: A review and comparison. Procedia Computer Science, 22(0):330 - 339, 2013. 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems - {KES2013}. [ bib | DOI | http ]
The number of semi-structured documents that is produced is steadily increasing. Thus, it will be essential for discovering new knowledge from them. In this survey paper, we review popular semi-structured documents mining approaches (structure alone and both structure and content). We provide a brief description of each technique as well as efficient algorithms for implementing the technique and comparing them using different comparison criteria.

Keywords: Semi-structured documents
[232] Kwang-Il Ahn. Effective product assignment based on association rule mining in retail. Expert Systems with Applications, 39(16):12551 - 12556, 2012. [ bib | DOI | http ]
Much academic research has been conducted about the process of association rule mining. More effort is now required for practical application of association rules in various commercial fields. A potential application of association rule mining is the problem of product assignment in retail. The product assignment problem involves how to most effectively assign items to sites in retail stores to grow sales. Effective product assignment facilitates cross-selling and convenient shopping for customers to promote maximum sales for retailers. However, little practical research has been done to address the issue. The current study approaches the product assignment problem using association rule mining for retail environments. There are some barriers to overcome in applying association rule mining to the product assignment problem for retail. This study conducts some generalizing to overcome drawbacks caused by the short lifecycles of current products. As a measure of cross-selling, lift is used to compare the effectiveness of various assignments for products. The proposed algorithm consists of three processes, which include mining associations among items, nearest neighbor assignments, and updating assignments. The algorithm was tested on synthetic databases. The results show very effective product assignment in terms of the potential for cross-selling to drive maximum sales for retailers.

Keywords: Data mining
[233] Edison Marrese-Taylor, Juan D. Velásquez, Felipe Bravo-Marquez, and Yutaka Matsuo. Identifying customer preferences about tourism products using an aspect-based opinion mining approach. Procedia Computer Science, 22(0):182 - 191, 2013. 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems - {KES2013}. [ bib | DOI | http ]
In this study we extend Bing Liu's aspect-based opinion mining technique to apply it to the tourism domain. Using this extension, we also offer an approach for considering a new alternative to discover consumer preferences about tourism products, particularly hotels and restaurants, using opinions available on the Web as reviews. An experiment is also conducted, using hotel and restaurant reviews obtained from TripAdvisor, to evaluate our proposals. Results showed that tourism product reviews available on web sites contain valuable information about customer preferences that can be extracted using an aspect-based opinion mining approach. The proposed approach proved to be very effective in determining the sentiment orientation of opinions, achieving a precision and recall of 90%. However, on average, the algorithms were only capable of extracting 35% of the explicit aspect expressions.

Keywords: opinion mining
[234] Cristian I. Pinzón, Juan F. De Paz, Álvaro Herrero, Emilio Corchado, Javier Bajo, and Juan M. Corchado. idmas-sql: Intrusion detection based on {MAS} to detect and block {SQL} injection through data mining. Information Sciences, 231(0):15 - 31, 2013. Data Mining for Information Security. [ bib | DOI | http ]
This study presents a multiagent architecture aimed at detecting {SQL} injection attacks, which are one of the most prevalent threats for modern databases. The proposed architecture is based on a hierarchical and distributed strategy where the functionalities are structured on layers. SQL-injection attacks, one of the most dangerous attacks to online databases, are the focus of this research. The agents in each one of the layers are specialized in specific tasks, such as data gathering, data classification, and visualization. This study presents two key agents under a hybrid architecture: a classifier agent that incorporates a Case-Based Reasoning engine employing advanced algorithms in the reasoning cycle stages, and a visualizer agent that integrates several techniques to facilitate the visual analysis of suspicious queries. The former incorporates a new classification model based on a mixture of a neural network and a Support Vector Machine in order to classify {SQL} queries in a reliable way. The latter combines clustering and neural projection techniques to support the visual analysis and identification of target attacks. The proposed approach was tested in a real-traffic case study and its experimental results, which validate the performance of the proposed approach, are presented in this paper.

Keywords: Intrusion Detection
[235] Yi-Ning Tu and Jia-Lang Seng. Indices of novelty for emerging topic detection. Information Processing & Management, 48(2):303 - 325, 2012. [ bib | DOI | http ]
Emerging topic detection is a vital research area for researchers and scholars interested in searching for and tracking new research trends and topics. The current methods of text mining and data mining used for this purpose focus only on the frequency of which subjects are mentioned, and ignore the novelty of the subject which is also critical, but beyond the scope of a frequency study. This work tackles this inadequacy to propose a new set of indices for emerging topic detection. They are the novelty index (NI) and the published volume index (PVI). This new set of indices is created based on time, volume, frequency and represents a resolution to provide a more precise set of prediction indices. They are then utilized to determine the detection point (DP) of new emerging topics. Following the detection point, the intersection decides the worth of a new topic. The algorithms presented in this paper can be used to decide the novelty and life span of an emerging topic in a specific field. The entire comprehensive collection of the {ACM} Digital Library is examined in the experiments. The application of the {NI} and {PVI} gives a promising indication of emerging topics in conferences and journals.

Keywords: Topic detection and tracking
[236] Chung-Hong Lee. Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams. Expert Systems with Applications, 39(18):13338 - 13356, 2012. [ bib | DOI | http ]
Due to the explosive growth of social-media applications, enhancing event-awareness by social mining has become extremely important. The contents of microblogs preserve valuable information associated with past disastrous events and stories. To learn the experiences from past events for tackling emerging real-world events, in this work we utilize the social-media messages to characterize real-world events through mining their contents and extracting essential features for relatedness analysis. On one hand, we established an online clustering approach on Twitter microblogs for detecting emerging events, and meanwhile we performed event relatedness evaluation using an unsupervised clustering approach. On the other hand, we developed a supervised learning model to create extensible measure metrics for offline evaluation of event relatedness. By means of supervised learning, our developed measure metrics are able to compute relatedness of various historical events, allowing the event impacts on specified domains to be quantitatively measured for event comparison. By combining the strengths of both methods, the experimental results showed that the combined framework in our system is sensible for discovering more unknown knowledge about event impacts and enhancing event awareness.

Keywords: Stream mining
[237] Marco Quartulli and Igor G. Olaizola. A review of {EO} image information mining. {ISPRS} Journal of Photogrammetry and Remote Sensing, 75(0):11 - 28, 2013. [ bib | DOI | http ]
We analyze the state of the art of content-based retrieval in Earth observation image archives focusing on complete systems showing promise for operational implementation. The different paradigms at the basis of the main system families are introduced. The approaches taken are considered, focusing in particular on the phases after primitive feature extraction. The solutions envisaged for the issues related to feature simplification and synthesis, indexing, semantic labeling are reviewed. The methodologies for query specification and execution are evaluated. Conclusions are drawn on the state of published research in Earth observation (EO) mining.

Keywords: Remote sensing
[238] Enric Junqué de Fortuny, Tom De Smedt, David Martens, and Walter Daelemans. Media coverage in times of political crisis: A text mining approach. Expert Systems with Applications, 39(14):11616 - 11622, 2012. [ bib | DOI | http ]
At the year end of 2011 Belgium formed a government, after a world record breaking period of 541 days of negotiations. We have gathered and analysed 68,000 related on-line news articles published in 2011 in Flemish newspapers. These articles were analysed by a custom-built expert system. The results of our text mining analyses show interesting differences in media coverage and votes for several political parties and politicians. With opinion mining, we are able to automatically detect the sentiment of each article, thereby allowing to visualise how the tone of reporting evolved throughout the year, on a party, politician and newspaper level. Our suggested framework introduces a generic text mining approach to analyse media coverage on political issues, including a set of methodological guidelines, evaluation metrics, as well as open source opinion mining tools. Since all analyses are based on automated text mining algorithms, an objective overview of the manner of reporting is provided. The analysis shows peaks of positive and negative sentiments during key moments in the negotiation process.

Keywords: Politics
[239] Hongzhou Sha, Tingwen Liu, Peng Qin, Yong Sun, and Qingyun Liu. Eplogcleaner: Improving data quality of enterprise proxy logs for efficient web usage mining. Procedia Computer Science, 17(0):812 - 818, 2013. First International Conference on Information Technology and Quantitative Management. [ bib | DOI | http ]
Data cleaning is an important step performed in the preprocessing stage of web usage mining, and is widely used in many data mining systems. Despite many efforts on data cleaning for web server logs, it is still an open question for enterprise proxy logs. With unlimited accesses to websites, enterprise proxy logs trace web requests from multiple clients to multiple web servers,which make them quite different from web sever logs on both location and content. Therefore, many irrelevant items such as software updating requests cannot be filtered out by traditional data cleaning methods. In this paper, we propose the first method named {EPLogCleaner} that can filter out plenty of irrelevant items based on the common prefix of their URLs. We make an evaluation of {EPLogCleaner} with a real network traffic trace captured from one enterprise proxy. Experimental results show that {EPLogCleaner} can improve data quality of enterprise proxy logs by further filtering out more than 30% {URL} requests comparing with traditional data cleaning methods.

Keywords: Web usage mining
[240] StašaVujičić Stanković, Nemanja Kojić, Goran Rakočević, Duško Vitas, and Veljko Milutinović. Chapter 4 - a classification of data mining algorithms for wireless sensor networks, and classification extension to concept modeling in system of wireless sensor networks based on natural language processing. In Ali Hurson, editor, Connected Computing Environment, volume 90 of Advances in Computers, pages 223 - 283. Elsevier, 2013. [ bib | DOI | http ]
In this article, we propose one original classification and one extension thereof, which takes into consideration the relevant issues in Natural Language Processing. The newly introduced classification of Data Mining algorithms is on the level of a single Wireless Sensor Network and its extension to Concept Modeling on the level of a System of Wireless Sensor Networks. Most of the scientists in this field put emphasis on issues related to applications of Wireless Sensor Networks in different areas, while we here put emphasis on categorization of the selected approaches from the open literature, to help application designers/developers get a better understanding of their options in different areas. Our main goal is to provide a good starting point for a more effective analysis leading to possible new solutions, possible improvements of existing solutions, and possible combination of two or more of the existing solutions into new ones, using the hybridization principle. Another contribution of this article is a synergistic interdisciplinary review of problems in two areas: Data Mining and Natural Language Processing. This enables interoperability improvements on the interface between Wireless Sensor Networks that often share data in native natural languages.

Keywords: Data mining
[241] Sheng-Tun Li and Fu-Ching Tsai. A fuzzy conceptualization model for text mining with application in opinion polarity classification. Knowledge-Based Systems, 39(0):23 - 33, 2013. [ bib | DOI | http ]
Automatic text classification in text mining is a critical technique to manage huge collections of documents. However, most existing document classification algorithms are easily affected by ambiguous terms. The ability to disambiguate for a classifier is thus as important as the ability to classify accurately. In this paper, we propose a novel classification framework based on fuzzy formal concept analysis to conceptualize documents into a more abstract form of concepts, and use these as the training examples to alleviate the arbitrary outcomes caused by ambiguous terms. The proposed model is evaluated on a benchmark testbed and two opinion polarity datasets. The experimental results indicate superior performance in all datasets. Applying concept analysis to opinion polarity classification is a leading endeavor in the disambiguation of Web 2.0 contents, and the approach presented in this paper offers significant improvements on current methods. The results of the proposed model reveal its ability to decrease the sensitivity to noise, as well as its adaptability in cross domain applications.

Keywords: Text classification
[242] Ahmad A. Kardan and Mahnaz Ebrahimi. A novel approach to hybrid recommendation systems based on association rules mining for content recommendation in asynchronous discussion groups. Information Sciences, 219(0):93 - 110, 2013. [ bib | DOI | http ]
Recommender systems have been developed in variety of domains, including asynchronous discussion group which is one of the most interesting ones. Due to the information overload and its varieties in discussion groups, it is difficult to draw out the relevant information. Therefore, recommender systems play an important role in filtering and customizing the desired information. Nowadays, collaborative and content-based filtering are the most adopted techniques being utilized in recommender systems. The collaborative filtering technique recommends items based on liked-mind users’ opinions and users’ preferences. Alternatively, the aim of the content-based filtering technique is the identification of items which are similar to those a user has preferred in past. To overcome the drawbacks of the aforementioned techniques, a hybrid recommender system combines two or more recommendation techniques to obtain more accuracy. The most important achievement of this study is to present a novel approach in hybrid recommendation systems, which identifies the user similarity neighborhood from implicit information being collected in a discussion group. In the proposed system, initially the association rules mining technique is applied to discover the similar users, and then the related posts are recommended to them. To select the appropriate contents in the transacted posts, it is necessary to focus on the concepts rather than the key words. Therefore, to locate the semantic related concepts Word Sense Disambiguation strategy based on WordNet lexical database is exploited. The experiments carried out on the discussion group datasets proved a noticeable improvement on the accuracy of useful posts recommended to the users in comparison to content-based and the collaborative filtering techniques as well.

Keywords: [Abbreviations: WSD
[243] Changbo Wang, Yuhua Liu, Zhao Xiao, Aoying Zhou, and Kang Zhang. Analyzing internet topics by visualizing microblog retweeting. Journal of Visual Languages & Computing, 28(0):122 - 133, 2015. [ bib | DOI | http ]
Microblog is a large-scale information sharing platform where retweeting plays an important role in information diffusion. Analyzing retweeting evolutions can help reasoning about the trend of public opinions. Information visualization techniques are used to demonstrate the retweeting behavior in order to understand how Internet topics diffuse on Microblogs. First, a graph clustering method is used to analyze the retweeting relationships among people of different occupations. Then a new algorithm based on electric field is proposed to visualize the layout of the relationship links. A prediction method based on three diffusion models is presented to predict the number of retweets over time. Finally, three real world case studies show the validity of our methods.

Keywords: Microblog retweeting
[244] Ilhami Colak, Seref Sagiroglu, and Mehmet Yesilbudak. Data mining and wind power prediction: A literature review. Renewable Energy, 46(0):241 - 247, 2012. [ bib | DOI | http ]
Wind power generated by wind turbines has a non-schedulable nature due to the stochastic nature of meteorological conditions. Hence, wind power predictions are required for few seconds to one week ahead in turbine control, load tracking, pre-load sharing, power system management and energy trading. In order to overcome problems in the predictions, many different wind power prediction models have been used to achieve in the literature. Data mining and its applications have more attention in recent years. This paper presents a review study banned on very short-term, short-term, medium-term and long-term wind power predictions. The studies available in the literature have been evaluated and criticized in consideration with their prediction accuracies and deficiencies. It is shown that adaptive neuro-fuzzy inference systems, neural networks and multilayer perceptrons give better results in wind power predictions.

Keywords: Data mining
[245] Célia Ghedini Ralha and Carlos Vinícius Sarmento Silva. A multi-agent data mining system for cartel detection in brazilian government procurement. Expert Systems with Applications, 39(14):11642 - 11656, 2012. [ bib | DOI | http ]
The main focus of this research project is the problem of extracting useful information from the Brazilian federal procurement process databases used by government auditors in the process of corruption detection and prevention to identify cartel formation among applicants. Extracting useful information to enhance cartel detection is a complex problem from many perspectives due to the large volume of data used to correlate information and the dynamic and diversified strategies companies use to hide their fraudulent operations. To attack the problem of data volume, we have used two data mining model functions, clustering and association rules, and a multi-agent approach to address the dynamic strategies of companies that are involved in cartel formation. To integrate both solutions, we have developed AGMI, an agent-mining tool that was validated using real data from the Brazilian Office of the Comptroller General, an institution of government auditing, where several measures are currently used to prevent and fight corruption. Our approach resulted in explicit knowledge discovery because {AGMI} presented many association rules that provided a 90% correct identification of cartel formation, according to expert assessment. According to auditing specialists, the extracted knowledge could help in the detection, prevention and monitoring of cartels that act in public procurement processes.

Keywords: Multi-agent data mining system
[246] Dirk Thorleuchter and Dirk Van den Poel. Predicting e-commerce company success by mining the text of its publicly-accessible website. Expert Systems with Applications, 39(17):13026 - 13034, 2012. [ bib | DOI | http ]
We analyze the impact of textual information from e-commerce companies’ websites on their commercial success. The textual information is extracted from web content of e-commerce companies divided into the Top 100 worldwide most successful companies and into the Top 101 to 500 worldwide most successful companies. It is shown that latent semantic concepts extracted from the analysis of textual information can be adopted as success factors for a Top 100 e-commerce company classification. This contributes to the existing literature concerning e-commerce success factors. As evaluation, a regression model based on these concepts is built that is successful in predicting the commercial success of the Top 100 companies. These findings are valuable for e-commerce websites creation.

Keywords: Success factor
[247] Vineet R. Khare and Rahul Chougule. Decision support for improved service effectiveness using domain aware text mining. Knowledge-Based Systems, 33(0):29 - 40, 2012. [ bib | DOI | http ]
This paper presents a decision support system ‘Domain Aware Text & Association Mining (DATAM)’ which has been developed to improve after-sales service and repairs for the automotive domain. A novel approach that compares textual and non-textual data for anomaly detection is proposed. It combines association and ontology based text mining. Association mining has been employed to identify the repairs performed in the field for a given symptom, whereas, text mining is used to infer repairs from the textual instructions mentioned in service documents for the same symptom. These in turn are compared and contrasted to identify the anomalous cases. The developed approach has been applied to automotive field data. Using the top 20 most frequent symptoms, observed in a mid-sized sedan built and sold in North America, it is demonstrated that {DATAM} can identify all the anomalous symptom – repair code combinations (with a false positive rate of 0.04). This knowledge, in the form of anomalies, can subsequently be used to improve the service/trouble-shooting procedure and identify technician training needs.

Keywords: Decision support systems
[248] B.L. Batista, L.R.S. da Silva, B.A. Rocha, J.L. Rodrigues, A.A. Berretta-Silva, T.O. Bonates, V.S.D. Gomes, R.M. Barbosa, and F. Barbosa. Multi-element determination in brazilian honey samples by inductively coupled plasma mass spectrometry and estimation of geographic origin with data mining techniques. Food Research International, 49(1):209 - 215, 2012. [ bib | DOI | http ]
Multi-element analysis of honey samples was carried out with the aim of developing a reliable method of tracing the origin of honey. Forty-two chemical elements were determined (Al, Cu, Pb, Zn, Mn, Cd, Tl, Co, Ni, Rb, Ba, Be, Bi, U, V, Fe, Pt, Pd, Te, Hf, Mo, Sn, Sb, P, La, Mg, I, Sm, Tb, Dy, Sd, Th, Pr, Nd, Tm, Yb, Lu, Gd, Ho, Er, Ce, Cr) by inductively coupled plasma mass spectrometry (ICP-MS). Then, three machine learning tools for classification and two for attribute selection were applied in order to prove that it is possible to use data mining tools to find the region where honey originated. Our results clearly demonstrate the potential of Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) chemometric tools for honey origin identification. Moreover, the selection tools allowed a reduction from 42 trace element concentrations to only 5.

Keywords: Data mining
[249] Zhengxing Huang, Xudong Lu, and Huilong Duan. On mining clinical pathway patterns from medical behaviors. Artificial Intelligence in Medicine, 56(1):35 - 50, 2012. [ bib | DOI | http ]
Clinical pathway analysis, as a pivotal issue in ensuring specialized, standardized, normalized and sophisticated therapy procedures, is receiving increasing attention in the field of medical informatics. Clinical pathway pattern mining is one of the most important components of clinical pathway analysis and aims to discover which medical behaviors are essential/critical for clinical pathways, and also where temporal orders of these medical behaviors are quantified with numerical bounds. Even though existing clinical pathway pattern mining techniques can tell us which medical behaviors are frequently performed and in which order, they seldom precisely provide quantified temporal order information of critical medical behaviors in clinical pathways. Methods This study adopts process mining to analyze clinical pathways. The key contribution of the paper is to develop a new process mining approach to find a set of clinical pathway patterns given a specific clinical workflow log and minimum support threshold. The proposed approach not only discovers which critical medical behaviors are performed and in which order, but also provides comprehensive knowledge about quantified temporal orders of medical behaviors in clinical pathways. Results The proposed approach is evaluated via real-world data-sets, which are extracted from Zhejiang Huzhou Central hospital of China with regard to six specific diseases, i.e., bronchial lung cancer, gastric cancer, cerebral hemorrhage, breast cancer, infarction, and colon cancer, in two years (2007.08–2009.09). As compared to the general sequence pattern mining algorithm, the proposed approach consumes less processing time, generates quite a smaller number of clinical pathway patterns, and has a linear scalability in terms of execution time against the increasing size of data sets. Conclusion The experimental results indicate the applicability of the proposed approach, based on which it is possible to discover clinical pathway patterns that can cover most frequent medical behaviors that are most regularly encountered in clinical practice. Therefore, it holds significant promise in research efforts related to the analysis of clinical pathways.

Keywords: Clinical pathway analysis
[250] Joaquim F. Silva and Celeste Jacinto. Finding occupational accident patterns in the extractive industry using a systematic data mining approach. Reliability Engineering & System Safety, 108(0):108 - 122, 2012. [ bib | DOI | http ]
This paper deals with occupational accident patterns of in the Portuguese Extractive Industry. It constitutes a significant advance with relation to a previous study made in 2008, both in terms of methodology and extended knowledge on the patterns’ details. This work uses more recent data (2005–2007) and this time the identification of the “typical accident” shifts from a bivariate, to a multivariate pattern, for characterising more accurately the accident mechanisms. Instead of crossing only two variables (Deviation x Contact), the new methodology developed here uses data mining techniques to associate nine variables, through their categories, and to quantify the statistical cohesion of each pattern. The results confirmed the “typical accident” of the 2008 study, but went much further: it reveals three statistically significant patterns (the top-3 categories in frequency); moreover, each pattern includes now more variables (4–5 categories) and indicates their statistical cohesion. This approach allowed a more accurate vision of the reality, which is fundamental for risk management. The methodology is best suited for large groups, such as national Authorities, Insurers or Corporate Groups, to assist them planning target-oriented safety strategies. Not least importantly, researchers can apply the same algorithm to other study areas, as it is not restricted to accidents, neither to safety.

Keywords: Extractive Industry
[251] Li Chen, Luole Qi, and Feng Wang. Comparison of feature-level learning methods for mining online consumer reviews. Expert Systems with Applications, 39(10):9588 - 9601, 2012. [ bib | DOI | http ]
The tasks of feature-level opinion mining usually include the extraction of product entities from consumer reviews, the identification of opinion words that are associated with the entities, and the determining of these opinions’ polarities (e.g., positive, negative, or neutral). In recent years, two major approaches have been proposed to determine opinions at the feature level: model based methods such as the one based on lexicalized Hidden Markov Model (L-HMMs), and statistical methods like the association rule mining based technique. However, little work has compared these algorithms regarding their practical abilities in identifying various types of review elements, such as features, opinions, intensifiers, entity phrases and infrequent entities. On the other hand, little attentions has been paid to applying more discriminative learning models to accomplish these opinion mining tasks. In this paper, we not only experimentally compared these methods based on a real-world review dataset, but also in particular adopted the Conditional Random Fields (CRFs) model and evaluated its performance in comparison with related algorithms. Moreover, for CRFs-based mining algorithm, we tested the role of a self-tagging process in two automatic training conditions, and further identified the ideal combination of learning functions to optimize its learning performance. The comparative experiment eventually revealed the CRFs-based method’s outperforming accuracy in terms of mining multiple review elements, relative to other methods.

Keywords: Consumer reviews
[252] Shu-Hsien Liao, Pei-Hui Chu, and Pei-Yuan Hsiao. Data mining techniques and applications – a decade review from 2000 to 2011. Expert Systems with Applications, 39(12):11303 - 11311, 2012. [ bib | DOI | http ]
In order to determine how data mining techniques (DMT) and their applications have developed, during the past decade, this paper reviews data mining techniques and their applications and development, through a survey of literature and the classification of articles, from 2000 to 2011. Keyword indices and article abstracts were used to identify 216 articles concerning {DMT} applications, from 159 academic journals (retrieved from five online databases), this paper surveys and classifies DMT, with respect to the following three areas: knowledge types, analysis types, and architecture types, together with their applications in different research and practical domains. A discussion deals with the direction of any future developments in {DMT} methodologies and applications: (1) {DMT} is finding increasing applications in expertise orientation and the development of applications for {DMT} is a problem-oriented domain. (2) It is suggested that different social science methodologies, such as psychology, cognitive science and human behavior might implement DMT, as an alternative to the methodologies already on offer. (3) The ability to continually change and acquire new understanding is a driving force for the application of {DMT} and this will allow many new future applications.

Keywords: Data mining
[253] Tingyu Liu, Yalong Cheng, and Zhonghua Ni. Mining event logs to support workflow resource allocation. Knowledge-Based Systems, 35(0):320 - 331, 2012. [ bib | DOI | http ]
Currently, workflow technology is widely used to facilitate the business process in enterprise information systems (EIS), and it has the potential to reduce design time, enhance product quality and decrease product cost. However, significant limitations still exist: as an important task in the context of workflow, many present resource allocation (also known as “staff assignment”) operations are still performed manually, which are time-consuming. This paper presents a data mining approach to address the resource allocation problem (RAP) and improve the productivity of workflow resource management. Specifically, an Apriori-like algorithm is used to find the frequent patterns from the event log, and association rules are generated according to predefined resource allocation constraints. Subsequently, a correlation measure named lift is utilized to annotate the negatively correlated resource allocation rules for resource reservation. Finally, the rules are ranked using the confidence measures as resource allocation rules. Comparative experiments are performed using C4.5, SVM, ID3, Naïve Bayes and the presented approach, and the results show that the presented approach is effective in both accuracy and candidate resource recommendations.

Keywords: Workflow
[254] Chih-Fong Tsai, Yu-Hsin Lu, and David C. Yen. Determinants of intangible assets value: The data mining approach. Knowledge-Based Systems, 31(0):67 - 77, 2012. [ bib | DOI | http ]
It is very important for investors and creditors to understand the critical factors affecting a firm’s value before making decisions about investments and loans. Since the knowledge-based economy has evolved, the method for creating firm value has transferred from traditional physical assets to intangible knowledge. Therefore, valuation of intangible assets has become a widespread topic of interest in the future of the economy. This study takes advantage of feature selection, an important data-preprocessing step in data mining, to identify important and representative factors affecting intangible assets. Particularly, five feature selection methods are considered, which include principal component analysis (PCA), stepwise regression (STEPWISE), decision trees (DT), association rules (AR), and genetic algorithms (GA). In addition, multi-layer perceptron (MLP) neural networks are used as the prediction model in order to understand which features selected from these five methods can allow the prediction model to perform best. Based on the chosen dataset containing 61 variables, the experimental result shows that combining the results from multiple feature selection methods performs the best. GA ∩ STEPWISE, DT ∪ PCA, and the {DT} single feature selection method generate approximately 75% prediction accuracy, which select 26, 22, and 7 variables respectively.

Keywords: Feature selection
[255] Chung-Hong Lee and Shih-Hao Wang. An information fusion approach to integrate image annotation and text mining methods for geographic knowledge discovery. Expert Systems with Applications, 39(10):8954 - 8967, 2012. [ bib | DOI | http ]
Due to the steady increase in the number of heterogeneous types of location information on the internet, it is hard to organize a complete overview of the geospatial information for the tasks of knowledge acquisition related to specific geographic locations. The text- and photo-types of geographical dataset contain numerous location data, such as location-based tourism information, therefore defining high dimensional spaces of attributes that are highly correlated. In this work, we utilized text- and photo-types of location information with a novel approach of information fusion that exploits effective image annotation and location based text-mining approaches to enhance identification of geographic location and spatial cognition. In this paper, we describe our feature extraction methods to annotating images, and utilizing text mining approach to analyze images and texts simultaneously, in order to carry out geospatial text mining and image classification tasks. Subsequently, photo-images and textual documents are projected to a unified feature space, in order to generate a co-constructed semantic space for information fusion. Also, we employed text mining approaches to classify documents into various categories based upon their geospatial features, with the aims to discovering relationships between documents and geographical zones. The experimental results show that the proposed method can effectively enhance the tasks of location based knowledge discovery.

Keywords: Geographic knowledge discovery
[256] Alireza Tamaddoni-Nezhad, Ghazal Afroozi Milani, Alan Raybould, Stephen Muggleton, and David A. Bohan. Chapter four - construction and validation of food webs using logic-based machine learning and text mining. In Guy Woodward and David A. Bohan, editors, Ecological Networks in an Agricultural World, volume 49 of Advances in Ecological Research, pages 225 - 289. Academic Press, 2013. [ bib | DOI | http ]
Network ecology holds great promise as an approach to modelling and predicting the effects of agricultural management on ecosystem service provision, as it bridges the gap between community and ecosystem ecology. Unfortunately, trophic interactions between most species in agricultural farmland are not well characterised empirically, and only partial food webs are available for a few systems. Large agricultural datasets of the nodes (i.e., species) in the webs are now available, and if these can be enriched with information on the links between them then the current shortage of network data can potentially be overcome. We demonstrate that a logic-based machine learning method can be used to automatically assign interactions between nodes, thereby generating plausible and testable food webs from ecological census data. Many of the learned trophic links were corroborated by the literature: in particular, links ascribed with high probability by machine learning corresponded with those having multiple references in the literature. In some cases, previously unobserved but high probability links were suggested and subsequently confirmed by other research groups. We evaluate these food webs using a new cross-validation method and present new results on automatic corroboration of a large, complex food web. The simulated frequencies of trophic links were also correlated with the total number of literature ‘hits’ for these links from the automatic corroboration. Finally, we also show that a network constructed by learning trophic links between functional groups is at least as accurate as the species-based trophic network.

Keywords: Network ecology
[257] David Lo, G. Ramalingam, Venkatesh-Prasad Ranganath, and Kapil Vaswani. Mining quantified temporal rules: Formalism, algorithms, and evaluation. Science of Computer Programming, 77(6):743 - 759, 2012. (1) Coordination 2009 (2) {WCRE} 2009. [ bib | DOI | http ]
Libraries usually impose constraints on how clients should use them. Often these constraints are not well-documented. In this paper, we address the problem of recovering such constraints automatically, a problem referred to as specification mining. Given some client programs that use a given library, we identify constraints on the library usage that are (almost) satisfied by the given set of clients. The class of rules we target for mining combines simple binary temporal operators with state predicates (composed of equality constraints) and quantification. This is a simple yet expressive subclass of temporal properties (LTL formulae) that allows us to capture many common {API} usage rules. We focus on recovering rules from execution traces and apply classical data mining concepts to be robust against bugs (API usage rule violations) in clients. We present new algorithms for mining rules from execution traces. We show how a propositional rule mining algorithm can be generalized to treat quantification and state predicates in a unified way. Our approach enables the miner to be complete (i.e. , mine all rules within the targeted class that are satisfied by the given traces) while avoiding an exponential blowup. We have implemented these algorithms and used them to mine {API} usage rules for several Windows APIs. Our experiments show the efficiency and effectiveness of our approach.

Keywords: Specification mining
[258] Alfonso Capozzoli, Fiorella Lauro, and Imran Khan. Fault detection analysis using data mining techniques for a cluster of smart office buildings. Expert Systems with Applications, 42(9):4324 - 4338, 2015. [ bib | DOI | http ]
There is an increasing need for automated fault detection tools in buildings. The total energy request in buildings can be significantly reduced by detecting abnormal consumption effectively. Numerous models are used to tackle this problem but either they are very complex and mostly applicable to components level, or they cannot be adopted for different buildings and equipment. In this study a simplified approach to automatically detect anomalies in building energy consumption based on actual recorded data of active electrical power for lighting and total active electrical power of a cluster of eight buildings is presented. The proposed methodology uses statistical pattern recognition techniques and artificial neural ensembling networks coupled with outliers detection methods for fault detection. The results show the usefulness of this data analysis approach in automatic fault detection by reducing the number of false anomalies. The method allows to identify patterns of faults occurring in a cluster of bindings; in this way the energy consumption can be further optimized also through the building management staff by informing occupants of their energy usage and educating them to be proactive in their energy consumption. Finally, in the context of smart buildings, the common detected outliers in the cluster of buildings demonstrate that the management of a smart district can be operated with the whole buildings cluster approach.

Keywords: Smart building
[259] Anne Sunikka and Johanna Bragge. Applying text-mining to personalization and customization research literature – who, what and where? Expert Systems with Applications, 39(11):10049 - 10058, 2012. [ bib | DOI | http ]
Personalization and customization have numerous definitions that are sometimes used interchangeably in the literature. This study combines a text-mining approach for profiling personalization and customization research with a traditional literature review in order to distinguish the main characteristics of these two research streams. Research profiling with search words personalization and customization is conducted using the Web of Science literature database. The elements typical to the personalization and customization research are identified. Personalization research has a strong focus on technology and the internet; in addition to which it emphasizes customers’ needs and preferences as well as information collection for user modeling and recommender systems. Customization is an older research stream, and the main body of the research has focused on tangible products but has lately initiated research in service fields. Based on the insights gained from research profiling and literature review, this study suggests a new classification of concepts linked to personalization.

Keywords: Personalization
[260] Hamed Majidi Zolbanin, Dursun Delen, and Amir Hassan Zadeh. Predicting overall survivability in comorbidity of cancers: A data mining approach. Decision Support Systems, 74(0):150 - 161, 2015. [ bib | DOI | http ]
Cancer and other chronic diseases have constituted (and will do so at an increasing pace) a significant portion of healthcare costs in the United States in recent years. Although prior research has shown that diagnostic and treatment recommendations might be altered based on the severity of comorbidities, chronic diseases are still being investigated in isolation from one another in most cases. To illustrate the significance of concurrent chronic diseases in the course of treatment, this study uses SEER's cancer data to create two comorbid data sets: one for breast and female genital cancers and another for prostate and urinal cancers. Several popular machine learning techniques are then applied to the resultant data sets to build predictive models. Comparison of the results shows that having more information about comorbid conditions of patients can improve models' predictive power, which in turn, can help practitioners make better diagnostic and treatment decisions. Therefore, proper identification, recording, and use of patients' comorbidity status can potentially lower treatment costs and ease the healthcare related economic challenges.

Keywords: Medical decision making
[261] Bethany Haalboom. The intersection of corporate social responsibility guidelines and indigenous rights: Examining neoliberal governance of a proposed mining project in suriname. Geoforum, 43(5):969 - 979, 2012. [ bib | DOI | http ]
With neoliberal reforms and the growth of multinational mining investment in developing countries, corporate social responsibility (CSR) has become notable (and debatable) for its potential to fill a social and environmental governance gap. As yet, there has been limited analytical attention paid to the political struggles and power dynamics that get reflected through specific {CSR} guidelines and their implementation in local contexts; this is particularly apparent with respect to the human rights dimension of CSR, and more specifically, indigenous rights. This study documents the debates, issues of accountability, and different interpretations of {CSR} between {NGOs} representing indigenous rights and a mining corporation. These debates focus on environmental impact assessments; indigenous rights to land; and the indigenous right to Free, Prior, and Informed Consent. These exchanges illustrate the socio-political, as well as economic, positioning of these actors, and the different agendas associated with their positions that determine issues of accountability and shape alternate interpretations of {CSR} guidelines. The outcomes of these debates also reflect the different degrees of power that these actors hold in such contexts, irrespective of the strength or validity of their arguments about CSR. This dialogue is thereby a lens into the more complex and contentious entanglements that emerge with {CSR} as a mode of governance, as it plays out ‘on the ground.’ These findings also reinforce questions regarding what we can expect of {CSR} as a mode of governance for addressing human rights issues with resource extraction projects, particularly within the constraints of overriding political and social structures.

Keywords: Corporate social responsibility
[262] Wenyin Tang, Flora S. Tsai, and Lihui Chen. Blended metrics for novel sentence mining. Expert Systems with Applications, 37(7):5172 - 5177, 2010. [ bib | DOI | http ]
With the abundance of raw text documents available on the internet, many articles contain redundant information. Novel sentence mining can discover novel, yet relevant, sentences given a specific topic defined by a user. In real-time novelty mining, an important issue is to how to select a suitable novelty metric that quantitatively measures the novelty of a particular sentence. To utilize the merits of different metrics, a blended metric is proposed by combining both cosine similarity and new word count metrics. The blended metric has been tested on {TREC} 2003 and {TREC} 2004 Novelty Track data. The experimental results show that the blended metric can perform generally better on topics with different ratios of novelty, which is useful for real-time novelty mining in topics with varying degrees of novelty.

Keywords: Novel sentence mining
[263] Changyong Lee, Bomi Song, and Yongtae Park. Design of convergent product concepts based on functionality: An association rule mining and decision tree approach. Expert Systems with Applications, 39(10):9534 - 9542, 2012. [ bib | DOI | http ]
Recent trends in paradigms of digital convergence have accentuated the notions of convergent products that are formed by adding new functions to an existing base product. However, a lacuna still remains in the literature as to systematic design of convergent product concepts (CPCs) based on functionality. This study proposes a systematic approach to design of {CPCs} based on online community information using data mining techniques. At the heart of the suggested approach is the combined use of association rule mining (ARM) and decision tree (DT) for discovering the significant relationships among items and detecting the meaningful conditions of items. Specifically, the proposed approach is composed of four steps: data collection and transformation, definition of target functions, identification of critical product features, and specification of design details. Three maps – function co-preferences map, feature relations map, and concepts specification map – are developed to aid decision making in design of CPCs, structuring and visualizing design implications. A case of the portable multimedia player (PMP) is presented to illustrate the proposed approach. We believe that our approach can reduce uncertainty and risk involved in the concept design stage.

Keywords: New product development (NPD)
[264] Yan Liang, Ying Liu, Chun Kit Kwong, and Wing Bun Lee. Learning the “whys”: Discovering design rationale using text mining — an algorithm perspective. Computer-Aided Design, 44(10):916 - 930, 2012. Fundamentals of Next Generation CAD/E Systems. [ bib | DOI | http ]
Collecting design rationale (DR) and making it available in a well-organized manner will better support product design, innovation and decision-making. Many {DR} systems have been developed to capture {DR} since the 1970s. However, the {DR} capture process is heavily human involved. In addition, with the increasing amount of {DR} available in archived design documents, it has become an acute problem to research a new computational approach that is able to capture {DR} from free textual contents effectively. In our previous study, we have proposed an {ISAL} (issue, solution and artifact layer) model for {DR} representation. In this paper, we focus on algorithm design to discover {DR} from design documents according to the {ISAL} modeling. For the issue layer of the {ISAL} model, we define a semantic sentence graph to model sentence relationships through language patterns. Based on this graph, we improve the manifold-ranking algorithm to extract issue-bearing sentences. To discover solution–reason bearing sentences for the solution layer, we propose building up two sentence graphs based on candidate solution-bearing sentences and reason-bearing sentences respectively, and propagating information between them. For artifact information extraction, we propose two term relations, i.e. positional term relation and mutual term relation. Using these relations, we extend our document profile model to score the candidate terms. The performance and scalability of the algorithms proposed are tested using patents as research data joined with an example of prior art search to illustrate its application prospects.

Keywords: Design rationale
[265] Qiwei He, Bernard P. Veldkamp, and Theo de Vries. Screening for posttraumatic stress disorder using verbal features in self narratives: A text mining approach. Psychiatry Research, 198(3):441 - 447, 2012. [ bib | DOI | http ]
Much evidence has shown that people's physical and mental health can be predicted by the words they use. However, such verbal information is seldom used in the screening and diagnosis process probably because the procedure to handle these words is rather difficult with traditional quantitative methods. The first challenge would be to extract robust information from diversified expression patterns, the second to transform unstructured text into a structuralized dataset. The present study developed a new textual assessment method to screen the posttraumatic stress disorder (PTSD) patients using lexical features in the self narratives with text mining techniques. Using 300 self narratives collected online, we extracted highly discriminative keywords with the Chi-square algorithm and constructed a textual assessment model to classify individuals with the presence or absence of PTSD. This resulted in a high agreement between computer and psychiatrists' diagnoses for {PTSD} and revealed some expressive characteristics in the writings of {PTSD} patients. Although the results of text analysis are not completely analogous to the results of structured interviews in {PTSD} diagnosis, the application of text mining is a promising addition to assessing {PTSD} in clinical and research settings.

Keywords: Posttraumatic stress disorder
[266] Weiyi Shang, Bram Adams, and Ahmed E. Hassan. Using pig as a data preparation language for large-scale mining software repositories studies: An experience report. Journal of Systems and Software, 85(10):2195 - 2204, 2012. Automated Software Evolution. [ bib | DOI | http ]
The Mining Software Repositories (MSR) field analyzes software repository data to uncover knowledge and assist development of ever growing, complex systems. However, existing approaches and platforms for {MSR} analysis face many challenges when performing large-scale {MSR} studies. Such approaches and platforms rarely scale easily out of the box. Instead, they often require custom scaling tricks and designs that are costly to maintain and that are not reusable for other types of analysis. We believe that the web community has faced many of these software engineering scaling challenges before, as web analyses have to cope with the enormous growth of web data. In this paper, we report on our experience in using a web-scale platform (i.e., Pig) as a data preparation language to aid large-scale {MSR} studies. Through three case studies, we carefully validate the use of this web platform to prepare (i.e., Extract, Transform, and Load, ETL) data for further analysis. Despite several limitations, we still encourage {MSR} researchers to leverage Pig in their large-scale studies because of Pig's scalability and flexibility. Our experience report will help other researchers who want to scale their analyses.

Keywords: Software engineering
[267] Cornelia Schoor and Maria Bannert. Exploring regulatory processes during a computer-supported collaborative learning task using process mining. Computers in Human Behavior, 28(4):1321 - 1331, 2012. [ bib | DOI | http ]
The purpose of this study was to explore sequences of social regulatory processes during a computer-supported collaborative learning task and their relationship to group performance. Analogous to self-regulation during individual learning, we conceptualized social regulation both as individual and as collaborative activities of analyzing, planning, monitoring and evaluating cognitive and motivational aspects during collaborative learning. We analyzed the data of 42 participants working together in dyads. They had 90 min to develop a common handout on a statistical topic while communicating only via chat and common editor. The log files of chat and editor were coded regarding activities of social regulation. Results show that participants in dyads with higher group performance (N = 20) did not differ from participants with lower group performance (N = 22) in the frequencies of regulatory activities. In an exploratory way, we used process mining to identify process patterns for high versus low group performance dyads. The resulting models show clear parallels between high and low achieving dyads in a double loop of working on the task, monitoring, and coordinating. Moreover, there are no major differences in the process of high versus low achieving dyads. Both results are discussed with regard to theoretical and empirical issues. Furthermore, the method of process mining is discussed.

Keywords: Computer-supported collaborative learning
[268] Dusan Stevanovic, Aijun An, and Natalija Vlajic. Feature evaluation for web crawler detection with data mining techniques. Expert Systems with Applications, 39(10):8707 - 8717, 2012. [ bib | DOI | http ]
Distributed Denial of Service (DDoS) is one of the most damaging attacks on the Internet security today. Recently, malicious web crawlers have been used to execute automated {DDoS} attacks on web sites across the WWW. In this study we examine the effect of applying seven well-established data mining classification algorithms on static web server access logs in order to: (1) classify user sessions as belonging to either automated web crawlers or human visitors and (2) identify which of the automated web crawlers sessions exhibit ‘malicious’ behavior and are potentially participants in a {DDoS} attack. The classification performance is evaluated in terms of classification accuracy, recall, precision and {F1} score. Seven out of nine vector (i.e. web-session) features employed in our work are borrowed from earlier studies on classification of user sessions as belonging to web crawlers. However, we also introduce two novel web-session features: the consecutive sequential request ratio and standard deviation of page request depth. The effectiveness of the new features is evaluated in terms of the information gain and gain ratio metrics. The experimental results demonstrate the potential of the new features to improve the accuracy of data mining classifiers in identifying malicious and well-behaved web crawler sessions.

Keywords: Web crawler detection
[269] Jia Rong, Huy Quan Vu, Rob Law, and Gang Li. A behavioral analysis of web sharers and browsers in hong kong using targeted association rule mining. Tourism Management, 33(4):731 - 740, 2012. [ bib | DOI | http ]
With the widespread use of Internet technology, electronic word-of-mouth [eWOM] communication through online reviews of products and services has a strong influence on consumer behavior and preferences. Although prior research efforts have attempted to investigate the behavior of users regarding the sharing of personal experiences and browsing the experiences of others online, it remains a challenge for business managers to incorporate eWOM effects into their business planning and decision-making processes effectively. Applying a newly proposed association rule mining technique, this study investigates eWOM in the context of the tourism industry using an outbound domestic tourism data set that was recently collected in Hong Kong. The complete profiles and the relations of online experience sharers and travel website browsers are explored. The empirical results are useful in helping tourism managers to define new target customers and to plan more effective marketing strategies.

Keywords: Sharers
[270] Pawel Sobkowicz, Michael Kaschesky, and Guillaume Bouchard. Opinion mining in social media: Modeling, simulating, and forecasting political opinions in the web. Government Information Quarterly, 29(4):470 - 479, 2012. Social Media in Government - Selections from the 12th Annual International Conference on Digital Government Research (dg.o2011). [ bib | DOI | http ]
Affordable and ubiquitous online communications (social media) provide the means for flows of ideas and opinions and play an increasing role for the transformation and cohesion of society – yet little is understood about how online opinions emerge, diffuse, and gain momentum. To address this problem, an opinion formation framework based on content analysis of social media and sociophysical system modeling is proposed. Based on prior research and own projects, three building blocks of online opinion tracking and simulation are described: (1) automated topic, emotion and opinion detection in real-time, (2) information flow modeling and agent-based simulation, and (3) modeling of opinion networks, including special social and psychological circumstances, such as the influence of emotions, media and leaders, changing social networks etc. Finally, three application scenarios are presented to illustrate the framework and motivate further research.

Keywords: Management
[271] Pauray S.M. Tsai. Mining top-k frequent closed itemsets over data streams using the sliding window model. Expert Systems with Applications, 37(10):6968 - 6973, 2010. [ bib | DOI | http ]
Association rule mining is an important research topic in the data mining community. There are two difficulties occurring in mining association rules. First, the user must specify a minimum support for mining. Typically it may require tuning the value of the minimum support many times before a set of useful association rules could be obtained. However, it is not easy for the user to find an appropriate minimum support. Secondly, there are usually a lot of frequent itemsets generated in the mining result. It will result in the generation of a large number of association rules, giving rise to difficulties of applications. In this paper, we consider mining top-k frequent closed itemsets from data streams using a sliding window technique. A single pass algorithm, called FCI_max, is developed for the generation of top-k frequent closed itemsets of length no more than max_l. Our method can efficiently resolve the mentioned two difficulties in association rule mining, which promotes the usability of the mining result in practice.

Keywords: Data mining
[272] Baha Şen, Emine Uçar, and Dursun Delen. Predicting and analyzing secondary education placement-test scores: A data mining approach. Expert Systems with Applications, 39(10):9468 - 9476, 2012. [ bib | DOI | http ]
Understanding the factors that lead to success (or failure) of students at placement tests is an interesting and challenging problem. Since the centralized placement tests and future academic achievements are considered to be related concepts, analysis of the success factors behind placement tests may help understand and potentially improve academic achievement. In this study using a large and feature rich dataset from Secondary Education Transition System in Turkey we developed models to predict secondary education placement test results, and using sensitivity analysis on those prediction models we identified the most important predictors. The results showed that {C5} decision tree algorithm is the best predictor with 95% accuracy on hold-out sample, followed by support vector machines (with an accuracy of 91%) and artificial neural networks (with an accuracy of 89%). Logistic regression models came out to be the least accurate of the four with and overall accuracy of 82%. The sensitivity analysis revealed that previous test experience, whether a student has a scholarship, student’s number of siblings, previous years’ grade point average are among the most important predictors of the placement test scores.

Keywords: Data mining
[273] Tsui-Ping Chang and Shih-Ying Chen. An efficient algorithm of frequent {XML} query pattern mining for ebxml applications in e-commerce. Expert Systems with Applications, 39(2):2183 - 2193, 2012. [ bib | DOI | http ]
Providing efficient query to {XML} data for ebXML applications in e-commerce is crucial, as {XML} has become the most important technique to exchange data over the Internet. ebXML is a set of specifications for companies to exchange their data in e-commerce. Following the ebXML specifications, companies have a standard method to exchange business messages, communicate data, and business rules in e-commerce. Due to its tree-structure paradigm, {XML} is superior for its capability of storing and querying complex data for ebXML applications. Therefore, discovering frequent {XML} query patterns has become an interesting topic for {XML} data management in ebXML applications. In this paper, we present an efficient mining algorithm, namely ebXMiner, to discover the frequent {XML} query patterns for ebXML applications. Unlike the existing algorithms, we propose a new idea by collecting the equivalent {XML} queries and then enumerating the candidates from infrequent {XML} queries in our ebXMiner. Furthermore, our simulation results show that ebXMiner outperforms other algorithms in its execution time.

Keywords: {XML} query pattern mining
[274] A.A. Javadi, A. Ahangar-Asr, A. Johari, A. Faramarzi, and D. Toll. Modelling stress–strain and volume change behaviour of unsaturated soils using an evolutionary based data mining technique, an incremental approach. Engineering Applications of Artificial Intelligence, 25(5):926 - 933, 2012. [ bib | DOI | http ]
Modelling of unsaturated soils has been the subject of many research works in the past few decades. A number of constitutive models have been developed to describe the complex behaviour of unsaturated soils. However, many have proven to be unable to predict all aspects of the behaviour of unsaturated soils in a unified manner. In this paper an alternative new approach is presented, based on the Evolutionary Polynomial Regression (EPR) technique. {EPR} is a data mining technique that generates a transparent and structured representation of the behaviour of a system directly from input test data. The capabilities of the proposed EPR-based framework in modelling of behaviour of unsaturated soils are illustrated using results from a comprehensive set of triaxial tests on samples of compacted unsaturated soils from literature. The main parameters used for modelling of the behaviour of unsaturated soils during shearing are initial water content, initial dry density, mean net stress, axial strain, suction, volumetric strain, and deviator stress. The model developed is used to predict different aspects of the behaviour of unsaturated soils for conditions not used in the model building process. The results show that the proposed approach provides a useful framework for modelling of unsaturated soils. The merits and advantages of the proposed approach are highlighted.

Keywords: Unsaturated soil
[275] Álvaro Rebuge and Diogo R. Ferreira. Business process analysis in healthcare environments: A methodology based on process mining. Information Systems, 37(2):99 - 116, 2012. Management and Engineering of Process-Aware Information Systems. [ bib | DOI | http ]
Performing business process analysis in healthcare organizations is particularly difficult due to the highly dynamic, complex, ad hoc, and multi-disciplinary nature of healthcare processes. Process mining is a promising approach to obtain a better understanding about those processes by analyzing event data recorded in healthcare information systems. However, not all process mining techniques perform well in capturing the complex and ad hoc nature of clinical workflows. In this work we introduce a methodology for the application of process mining techniques that leads to the identification of regular behavior, process variants, and exceptional medical cases. The approach is demonstrated in a case study conducted at a hospital emergency service. For this purpose, we implemented the methodology in a tool that integrates the main stages of process analysis. The tool is specific to the case study, but the same methodology can be used in other healthcare environments.

Keywords: Business process analysis
[276] Ayhan Demiriz, Gürdal Ertek, Tankut Atan, and Ufuk Kula. Re-mining item associations: Methodology and a case study in apparel retailing. Decision Support Systems, 52(1):284 - 293, 2011. [ bib | DOI | http ]
Association mining is the conventional data mining technique for analyzing market basket data and it reveals the positive and negative associations between items. While being an integral part of transaction data, pricing and time information have not been integrated into market basket analysis in earlier studies. This paper proposes a new approach to mine price, time and domain related attributes through re-mining of association mining results. The underlying factors behind positive and negative relationships can be characterized and described through this second data mining stage. The applicability of the methodology is demonstrated through the analysis of data coming from a large apparel retail chain, and its algorithmic complexity is analyzed in comparison to the existing techniques.

Keywords: Data mining
[277] Jong Hwan Suh. Forecasting the daily outbreak of topic-level political risk from social media using hidden markov model-based techniques. Technological Forecasting and Social Change, 94(0):115 - 132, 2015. [ bib | DOI | http ]
Nowadays, as an arena of politics, social media ignites political protests, so analyzing topics discussed negatively in the social media has increased in importance for detecting a nation's political risk. In this context, this paper designs and examines an automatic approach to forecast the daily outbreak of political risk from social media at a topic level. It evaluates the forecasting performances of topic features, investigated among the previous works that analyze social media data for politics, hidden Markov model (HMM)-based techniques, widely used for the anomaly detection with time-series data, and detection models, into which the topic features and the detection techniques are combined. When applied to South Korea's Web forum, Daum Agora, statistical comparisons with the constraints of false positive rate of < 0.1 and timeliness of < 0 show that, for accuracy, social network-based feature and, for sensitivity, energy-based feature give the best results but there is no single best detection technique for accuracy and sensitivity. Besides, they demonstrate that the detection model using Markov switching model with jumps (MSJ) with social-network based feature is the best combination for accuracy; there is no single best detection model for sensitivity. This paper helps make a move to prevent the national political risk, and eventually the predictive governance benefits the people.

Keywords: Political risk
[278] Subhashini Venugopalan and Varun Rai. Topic based classification and pattern identification in patents. Technological Forecasting and Social Change, 94(0):236 - 250, 2015. [ bib | DOI | http ]
Patent classification systems and citation networks are used extensively in innovation studies. However, non-unique mapping of classification codes onto specific products/markets and the difficulties in accurately capturing knowledge flows based just on citation linkages present limitations to these conventional patent analysis approaches. We present a natural language processing based hierarchical technique that enables the automatic identification and classification of patent datasets into technology areas and sub-areas. The key novelty of our technique is to use topic modeling to map patents to probability distributions over real world categories/topics. Accuracy and usefulness of our technique are tested on a dataset of 10,201 patents in solar photovoltaics filed in the United States Patent and Trademark Office (USPTO) between 2002 and 2013. We show that linguistic features from topic models can be used to effectively identify the main technology area that a patent's invention applies to. Our computational experiments support the view that the topic distribution of a patent offers a reduced-form representation of the knowledge content in a patent. Accordingly, we suggest that this hidden thematic structure in patents can be useful in studies of the policy–innovation–geography nexus. To that end, we also demonstrate an application of our technique for identifying patterns in technological convergence.

Keywords: Patents
[279] Zhi-Hong Deng and Xiao-Ran Xu. Fast mining erasable itemsets using nc_sets. Expert Systems with Applications, 39(4):4453 - 4463, 2012. [ bib | DOI | http ]
Mining erasable itemsets first introduced in 2009 is one of new emerging data mining tasks. In this paper, we present a new data representation called NC_set, which keeps track of the complete information used for mining erasable itemsets. Based on NC_set, we propose a new algorithm called {MERIT} for mining erasable itemsets efficiently. The efficiency of {MERIT} is achieved with three techniques as follows. First, the NC_set is a compact structure, which prunes irrelevant data automatically. Second, the computation of the gain of an itemset is transformed into the combination of NC_sets, which can be completed in linear time complexity by an ingenious strategy. Third, {MERIT} can directly find erasable itemsets without generating candidate itemsets in some cases. For evaluating MERIT, we have conducted extensive experiments on a lot of synthetic product databases. Our performance study shows that the {MERIT} is efficient and is on average about two orders of magnitude faster than the META, the first algorithm for mining erasable itemsets.

Keywords: Data mining
[280] Magdalini Eirinaki, Shamita Pisal, and Japinder Singh. Feature-based opinion mining and ranking. Journal of Computer and System Sciences, 78(4):1175 - 1184, 2012. [ bib | DOI | http ]
The proliferation of blogs and social networks presents a new set of challenges and opportunities in the way information is searched and retrieved. Even though facts still play a very important role when information is sought on a topic, opinions have become increasingly important as well. Opinions expressed in blogs and social networks are playing an important role influencing everything from the products people buy to the presidential candidate they support. Thus, there is a need for a new type of search engine which will not only retrieve facts, but will also enable the retrieval of opinions. Such a search engine can be used in a number of diverse applications like product reviews to aggregating opinions on a political candidate or issue. Enterprises can also use such an engine to determine how users perceive their products and how they stand with respect to competition. This paper presents an algorithm which not only analyzes the overall sentiment of a document/review, but also identifies the semantic orientation of specific components of the review that lead to a particular sentiment. The algorithm is integrated in an opinion search engine which presents results to a query along with their overall tone and a summary of sentiments of the most important features.

Keywords: Opinion mining
[281] Tony Cheng-Kui Huang. Mining the change of customer behavior in fuzzy time-interval sequential patterns. Applied Soft Computing, 12(3):1068 - 1086, 2012. [ bib | DOI | http ]
Comprehending changes of customer behavior is an essential problem that must be faced for survival in a fast-changing business environment. Particularly in the management of electronic commerce (EC), many companies have developed on-line shopping stores to serve customers and immediately collect buying logs in databases. This trend has led to the development of data-mining applications. Fuzzy time-interval sequential pattern mining is one type of serviceable data-mining technique that discovers customer behavioral patterns over time. To take a shopping example, (Bread, Short, Milk, Long, Jam), means that Bread is bought before Milk in a Short period, and Jam is bought after Milk in a Long period, where Short and Long are predetermined linguistic terms given by managers. This information shown in this example reveals more general and concise knowledge for managers, allowing them to make quick-response decisions, especially in business. However, no studies, to our knowledge, have yet to address the issue of changes in fuzzy time-interval sequential patterns. The fuzzy time-interval sequential pattern, (Bread, Short, Milk, Long, Jam), became available in last year; however, is not a trend this year, and has been substituted by (Bread, Short, Yogurt, Short, Jam). Without updating this knowledge, managers might map out inappropriate marketing plans for products or services and dated inventory strategies with respect to time-intervals. To deal with this problem, we propose a novel change mining model, MineFuzzChange, to detect the change in fuzzy time-interval sequential patterns. Using a brick-and-mortar transactional dataset collected from a retail chain in Taiwan and a {B2C} {EC} dataset, experiments are carried out to evaluate the proposed model. We empirically demonstrate how the model helps managers to understand the changing behaviors of their customers and to formulate timely marketing and inventory strategies.

Keywords: Data mining
[282] Karel Dejaeger, Frank Goethals, Antonio Giangreco, Lapo Mola, and Bart Baesens. Gaining insight into student satisfaction using comprehensible data mining techniques. European Journal of Operational Research, 218(2):548 - 562, 2012. [ bib | DOI | http ]
As a consequence of the heightened competition on the education market, the management of educational institutions often attempts to collect information on what drives student satisfaction by e.g. organizing large scale surveys amongst the student population. Until now, this source of potentially very valuable information remains largely untapped. In this study, we address this issue by investigating the applicability of different data mining techniques to identify the main drivers of student satisfaction in two business education institutions. In the end, the resulting models are to be used by the management to support the strategic decision making process. Hence, the aspect of model comprehensibility is considered to be at least equally important as model performance. It is found that data mining techniques are able to select a surprisingly small number of constructs that require attention in order to manage student satisfaction.

Keywords: Data mining
[283] Dóra Szakonyi, Sofie Van Landeghem, Katja Baerenfaller, Lieven Baeyens, Jonas Blomme, Rubén Casanova-Sáez, Stefanie De Bodt, David Esteve-Bruna, Fabio Fiorani, Nathalie Gonzalez, Jesper Grønlund, Richard G.H. Immink, Sara Jover-Gil, Asuka Kuwabara, Tamara Muñoz-Nortes, Aalt D.J. van Dijk, David Wilson-Sánchez, Vicky Buchanan-Wollaston, Gerco C. Angenent, Yves Van de Peer, Dirk Inzé, José Luis Micol, Wilhelm Gruissem, Sean Walsh, and Pierre Hilson. The knownleaf literature curation system captures knowledge about arabidopsis leaf growth and development and facilitates integrated data mining. Current Plant Biology, 2(0):1 - 11, 2015. [ bib | DOI | http ]
The information that connects genotypes and phenotypes is essentially embedded in research articles written in natural language. To facilitate access to this knowledge, we constructed a framework for the curation of the scientific literature studying the molecular mechanisms that control leaf growth and development in Arabidopsis thaliana (Arabidopsis). Standard structured statements, called relations, were designed to capture diverse data types, including phenotypes and gene expression linked to genotype description, growth conditions, genetic and molecular interactions, and details about molecular entities. Relations were then annotated from the literature, defining the relevant terms according to standard biomedical ontologies. This curation process was supported by a dedicated graphical user interface, called Leaf Knowtator. A total of 283 primary research articles were curated by a community of annotators, yielding 9947 relations monitored for consistency and over 12,500 references to Arabidopsis genes. This information was converted into a relational database (KnownLeaf) and merged with other public Arabidopsis resources relative to transcriptional networks, protein–protein interaction, gene co-expression, and additional molecular annotations. Within KnownLeaf, leaf phenotype data can be searched together with molecular data originating either from this curation initiative or from external public resources. Finally, we built a network (LeafNet) with a portion of the KnownLeaf database content to graphically represent the leaf phenotype relations in a molecular context, offering an intuitive starting point for knowledge mining. Literature curation efforts such as ours provide high quality structured information accessible to computational analysis, and thereby to a wide range of applications. DATA: The presented work was performed in the framework of the AGRON-OMICS project (Arabidopsis {GRO} wth Network integrating {OMICS} technologies) supported by European Commission 6th Framework Programme project (Grant number LSHG-CT-2006-037704). This is a data integration and data sharing portal collecting all the all the major results from the consortium. All data presented in our paper is available here. https://agronomics.ethz.ch/.

Keywords: Arabidopsis
[284] Puteri N.E. Nohuddin, Frans Coenen, Rob Christley, Christian Setzkorn, Yogesh Patel, and Shane Williams. Finding “interesting” trends in social networks using frequent pattern mining and self organizing maps. Knowledge-Based Systems, 29(0):104 - 113, 2012. Artificial Intelligence 2010. [ bib | DOI | http ]
This paper introduces a technique that uses frequent pattern mining and {SOM} techniques to identify, group and analyse trends in sequences of time stamped social networks so as to identify “interesting” trends. In this study, trends are defined in terms of a series of occurrence counts associated with frequent patterns that may be identified within social networks. Typically a large number of frequent patterns, and by extension a large number of trends, are discovered. Thus, to assist with the analysis of the discovered trends, the use of {SOM} techniques is advocated so that similar trends can be grouped together. To identify “interesting” trends a sequences of {SOMs} are generated which can be interpreted by considering how trends move from one {SOM} to the next. The further a trend moves from one {SOM} to the next, the more “interesting” the trend is deemed to be. The study is focused two types of network, Star networks and Complex star networks, exemplified by two real applications: the Cattle Tracing System in operation in Great Britain and a car insurance quotation application.

Keywords: Trends
[285] Feng Zhao, Zheng Sun, and Hai Jin. Topic-centric and semantic-aware retrieval system for internet of things. Information Fusion, 23(0):33 - 42, 2015. [ bib | DOI | http ]
The Internet of things (IoT) has been considered as one of the promising paradigms that can allow people and objects to seamlessly interact. So far, numerous applications and services have been proposed, such as retrieval service. The retrieval, however, faces a big challenge in IoT because the data belongs to different domains and user interaction with the surrounding environment is constrained. This paper proposes Acrost, a retrieval system based on topic discovery and semantic awareness in IoT environment. The initial contents with interesting information is obtained through the combination of two topic centric collectors. The metadata is extracted by aggregating regular expression-based and conditional random fields-based approaches. Moreover, the semantic-aware retrieval is achieved by parsing the query and ranking the relevance of contents. In addition, we present a case study on academic conference retrieval to validate the proposed approaches. Experimental results show that the proposed system can significantly improve the response time and efficiency of topic self-adaptive retrieval manner.

Keywords: Internet of things
[286] Ali Serhan Koyuncugil and Nermin Ozgulbas. Financial early warning system model and data mining application for risk detection. Expert Systems with Applications, 39(6):6238 - 6253, 2012. [ bib | DOI | http ]
One of the biggest problems of {SMEs} is their tendencies to financial distress because of insufficient finance background. In this study, an early warning system (EWS) model based on data mining for financial risk detection is presented. {CHAID} algorithm has been used for development of the EWS. Developed {EWS} can be served like a tailor made financial advisor in decision making process of the firms with its automated nature to the ones who have inadequate financial background. Besides, an application of the model implemented which covered 7853 {SMEs} based on Turkish Central Bank (TCB) 2007 data. By using {EWS} model, 31 risk profiles, 15 risk indicators, 2 early warning signals, and 4 financial road maps has been determined for financial risk mitigation.

Keywords: {CHAID}
[287] Chee Kian Leong, Yew Haur Lee, and Wai Keong Mak. Mining sentiments in {SMS} texts for teaching evaluation. Expert Systems with Applications, 39(3):2584 - 2589, 2012. [ bib | DOI | http ]
This paper explores the potential application of sentiment mining for analyzing short message service (SMS) texts in teaching evaluation. Data preparation involves the reading, parsing and categorization of the {SMS} texts. Three models were developed: the base model, the “corrected” model which adjusts for spelling errors and the “sentiment” model which extends the “corrected” model by performing sentiment mining. An “interestingness” criterion selects the “sentiment” model from which the sentiments of the students towards the lecture are discerned. Two types of incomplete {SMS} texts are also identified and the implications of their removal for the analysis ascertained.

Keywords: Sentiment mining
[288] Shihang Huang, Ying Liu, and Depeng Dang. Burst topic discovery and trend tracing based on storm. Physica A: Statistical Mechanics and its Applications, 416(0):331 - 339, 2014. [ bib | DOI | http ]
With the rapid development of the Internet and the promotion of mobile Internet, microblogs have become a major source and route of transmission for public opinion, including burst topics that are caused by emergencies. To facilitate real time mining of a large range of burst topics, in this paper, we proposed a method to discover burst topics in real time and trace their trends based on the variation trends of word frequencies. First, for the variation trend of the words in microblogs, we adopt a non-homogeneous Poisson process model to fit the data. To represent the heat and trend of the words, we introduce heat degree factor and trend degree factor and realise the real time discovery and trend tracing of the burst topics based on these two factors. Second, to improve the computing performance, this paper was based on the Storm stream computing framework for real time computing. Finally, the experimental results indicate that by adjusting the observation window size and trend degree threshold, topics with different cycles and different burst strengths can be discovered.

Keywords: Non-homogeneous Poisson process
[289] Tony Cheng-Kui Huang, Chuang-Chun Liu, and Dong-Cheng Chang. An empirical investigation of factors influencing the adoption of data mining tools. International Journal of Information Management, 32(3):257 - 270, 2012. [ bib | DOI | http ]
Previous studies explored the adoption of various information technologies. However, there is little empirical research on factors influencing the adoption of data mining tools (DMTs), particularly at an individual level. This study investigates how users perceive and adopt {DMTs} to broaden practical knowledge for the business intelligence community. First, this study develops a theoretical model based on the Technology Acceptance Model 3, and then examines its perceived usefulness, perceived ease of use, and its ability to explain users’ intentions to use DMTs. The model's determinants include 4 categories: the task-oriented dimension (job relevance, output quality, result demonstrability, response time, and format), control beliefs (computer self-efficacy and perceptions of external control), emotion (computer anxiety), and intrinsic motivation (computer playfulness). This study also surveys the moderating effect of experience and output quality on the determinants of {DMT} adoption and use. An empirical study involving 206 {DMT} users was conducted to evaluate the model using structural equation modeling. Results demonstrate that the proposed model explains 58% of the variance. The findings of this study have interesting implications with respect to {DMT} adoption, both for researchers and practitioners.

Keywords: Technology acceptance model
[290] Yi Zhang, Flora S. Tsai, and Agus Trisnajaya Kwee. Multilingual sentence categorization and novelty mining. Information Processing & Management, 47(5):667 - 675, 2011. Managing and Mining Multilingual Documents. [ bib | DOI | http ]
A challenge for sentence categorization and novelty mining is to detect not only when text is relevant to the user’s information need, but also when it contains something new which the user has not seen before. It involves two tasks that need to be solved. The first is identifying relevant sentences (categorization) and the second is identifying new information from those relevant sentences (novelty mining). Many previous studies of relevant sentence retrieval and novelty mining have been conducted on the English language, but few papers have addressed the problem of multilingual sentence categorization and novelty mining. This is an important issue in global business environments, where mining knowledge from text in a single language is not sufficient. In this paper, we perform the first task by categorizing Malay and Chinese sentences, then comparing their performances with that of English. Thereafter, we conduct novelty mining to identify the sentences with new information. Experimental results on {TREC} 2004 Novelty Track data show similar categorization performance on Malay and English sentences, which greatly outperform Chinese. In the second task, it is observed that we can achieve similar novelty mining results for all three languages, which indicates that our algorithm is suitable for novelty mining of multilingual sentences. In addition, after benchmarking our results with novelty mining without categorization, it is learnt that categorization is necessary for the successful performance of novelty mining.

Keywords: Multilingual categorization
[291] Philippe Fournier-Viger, Usef Faghihi, Roger Nkambou, and Engelbert Mephu Nguifo. Cmrules: Mining sequential rules common to several sequences. Knowledge-Based Systems, 25(1):63 - 76, 2012. Special Issue on New Trends in Data Mining. [ bib | DOI | http ]
Sequential rule mining is an important data mining task used in a wide range of applications. However, current algorithms for discovering sequential rules common to several sequences use very restrictive definitions of sequential rules, which make them unable to recognize that similar rules can describe a same phenomenon. This can have many undesirable effects such as (1) similar rules that are rated differently, (2) rules that are not found because they are considered uninteresting when taken individually, (3) and rules that are too specific, which makes them less likely to be used for making predictions. In this paper, we address these problems by proposing a more general form of sequential rules such that items in the antecedent and in the consequent of each rule are unordered. We propose an algorithm named {CMRules} for mining this form of rules. The algorithm proceeds by first finding association rules to prune the search space for items that occur jointly in many sequences. Then it eliminates association rules that do not meet the minimum confidence and support thresholds according to the sequential ordering. We evaluate the performance of {CMRules} in three different ways. First, we provide an analysis of its time complexity. Second, we compare its performance (in terms of execution time, memory usage and scalability) with an adaptation of an algorithm from the literature that we name CMDeo. For this comparison, we use three real-life public datasets, which have different characteristics and represent three kinds of data. In many cases, results show that {CMRules} is faster and has a better scalability for low support thresholds than CMDeo. Lastly, we report a successful application of the algorithm in a tutoring agent.

Keywords: Sequential rule mining
[292] Aleksander Tonkovich, Zhanbiao Li, Sante DiCecco, William Altenhof, Richard Banting, and Henry Hu. Experimental observations of tyre deformation characteristics on heavy mining vehicles under static and quasi-static loading. Journal of Terramechanics, 49(3–4):215 - 231, 2012. [ bib | DOI | http ]
Due to large sidewall and bead thicknesses, multi-piece rims are necessary for use with large off-the-road (OTR) tyres. This paper presents the testing protocol and observed load/deflection and vertical/sidewall deflection characteristics of three Goodyear {OTR} tyre assemblies, namely, (1) a radial 29.5R29 (2) a bias-ply 29.5-29, and (3) a bias-ply 26.5-26. Localized tyre deformations and rim displacements were measured using optical displacement transducers and post-processing high-speed camera images using digital image analysis software. A validation analysis illustrated a maximum difference of 4.05% of vertical wheel displacements between the aforementioned methods. Quasi-static tests show the maximum values of vertical rim displacement and lateral tyre deflection are in the range of 72.2–78.9 mm and 23.3–27.1 mm, respectively, for a severe excitation condition. Differences ranging from 0.2% to 21.5% for maximum vertical and lateral tyre deflections were found between static load tests and engineering data provided by the tyre manufacturer. Linear relationships were observed for both vertical wheel displacement and lateral tyre deflection versus load for all tests. This study demonstrates a thorough methodology to study deflection characteristics of heavy duty {OTR} tyres and the collected data could be very useful in the development of numerical models of wheel and tyre assemblies for mining vehicles.

Keywords: Heavy mining vehicle
[293] Zeynab Abbasi Khalifelu and Farhad Soleimanian Gharehchopogh. Comparison and evaluation of data mining techniques with algorithmic models in software cost estimation. Procedia Technology, 1(0):65 - 71, 2012. First World Conference on Innovation and Computer Sciences (INSODE 2011). [ bib | DOI | http ]
Software Cost Estimation (SCE) is one of important topics in producing software in recent decades. Real estimation requires cost and effort factors in producing software by using of algorithmic or Artificial Intelligent (AI) techniques. Boehm developed the Constructive Cost Model (COCOMO) that is one of the algorithmic {SCE} models. Also, these models contain three increasingly basic, intermediate and detailed forms, i.e. basic {COCOMO} is suitable for quick, early, rough order of among the estimates of required effort in producing software, but its accuracy is limited due to its loss of factors to account for difference between cost drivers. Intermediate {COCOMO} assumes these project attributes into account. In addition detailed {COCOMO} accounts for individual project phases used. The {COCOMO} algorithmic techniques families have used since 1981. In recent years, some techniques emerged by using intelligent techniques to solve and estimate the effort required in producing software. In this paper, different data mining techniques to estimate software costs are presented and then the results of each technique are evaluated and compared. However, NASA's projects to train and test each of these techniques are applied. Then, data set to train and test the data mining techniques improve the estimation accuracy of the models in many cases. We show the comparison between {COCOMO} model and data mining techniques here. The results indicate that these methods result in many benefit answers. Also we show the comparison of the estimation accuracy of {COCOMO} model with data mining techniques. Data mining techniques improve the estimation accuracy of the models in many cases. So the estimated effort more improvement in this models.

Keywords: Software Cost Estimation
[295] Ram Gopal, James R. Marsden, and Jan Vanthienen. Information mining — reflections on recent advancements and the road ahead in data, text, and media mining. Decision Support Systems, 51(4):727 - 731, 2011. Recent Advances in Data, Text, and Media Mining & Information Issues in Supply Chain and in Service System Design. [ bib | DOI | http ]
In this introduction, we briefly summarize the state of data and text mining today. Taking a very broad view, we use the term information mining to refer to the organization and analysis of structured or unstructured data that can be quantitative, textual, and/or pictorial in nature. The key question, in our view, is, “How can we transform data (in the very broad sense of this term) into ‘actionable knowledge’, knowledge that we can use in pursuit of a specified objective(s).” After detailing a set of key components of information mining, we introduce each of the papers in this volume and detail the focus of their contributions.

Keywords: Data mining
[296] J.S. Hallinan. Chapter 2 - data mining for microbiologists. In Colin Harwood and Anil Wipat, editors, Systems Biology of Bacteria, volume 39 of Methods in Microbiology, pages 27 - 79. Academic Press, 2012. [ bib | DOI | http ]
The enormous amounts of molecular microbiological data currently produced by high-throughput analytical techniques pose both huge opportunities and huge challenges for microbiologists. With over 1000 databases online, it is clearly not feasible for researchers to manually search each one for information about the genes and processes in which they are interested. Much of the data stored in these databases never makes it into the peer-reviewed literature, and so becomes essentially unavailable in its entirety. A powerful approach to maximising the usefulness of large datasets, whether generated in-house or obtained from public repositories, is data integration and mining. Data integration is the process of bringing together large amounts of disparate data into a single, computationally accessible data source, while data mining is the process of finding hidden patterns and relationships in such large datasets. A wide range of algorithms is used for data mining, including established statistical methods, and approaches from the field of machine learning. The various algorithms available have different strengths and weaknesses, and are applicable to different types of data. In this review we first discuss the data mining life cycle and then describe some of the most widely used algorithms, illustrating their applications with examples from the microbiological literature. Where possible, we have identified freely available software for implementing these algorithms.

Keywords: Data mining
[297] Deanna Kemp, John R. Owen, and Shashi van de Graaff. Corporate social responsibility, mining and “audit culture”. Journal of Cleaner Production, 24(0):1 - 10, 2012. [ bib | DOI | http ]
This article engages internal organizational aspects of ‘accountability’ for corporate social responsibility (CSR) in mining by challenging the current ‘audit culture’. Audits offer a tool through which to shape and regulate corporate social performance (CSP). Where audits have limited value is in their ability to stimulate internal engagement around social and organizational norms and principles, as the process relies on auditors to generate performance data against pre-selected indicators. Data is then utilized to produce a measure of risk or effectiveness through which to demonstrate compliance. Focusing on the internal organizational aspects of accountability and the processes, mechanisms and methodologies used to establish critical reflection, three alternatives within the current audit regime are presented. These forms of ‘new accounting’ stand in contrast to conventional auditing, as their focus is on building cross-functional connections and collaborative internal relationships that are based on dialogue and mutual exchange about the problems and possibilities of {CSR} implementation.

Keywords: Audit
[298] Flora S. Tsai and Agus T. Kwee. Experiments in term weighting for novelty mining. Expert Systems with Applications, 38(11):14094 - 14101, 2011. [ bib | DOI | http ]
Obtaining new information in a short time is becoming crucial in today’s economy. A lot of information both offline or online is easily acquired, exacerbating the problem of information overload. Novelty mining detects documents/sentences that contain novel or new information and presents those results directly to users (Tang, Tsai, & Chen, 2010). Many methods and algorithms for novelty mining have previously been studied, but none have compared and discussed the impact of term weighting on the evaluation measures. This paper performed experiments to recommend the best term weighting function for both document and sentence-level novelty mining.

Keywords: Novelty mining
[299] Wouter Verbeke, Karel Dejaeger, David Martens, Joon Hur, and Bart Baesens. New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. European Journal of Operational Research, 218(1):211 - 229, 2012. [ bib | DOI | http ]
Customer churn prediction models aim to indicate the customers with the highest propensity to attrite, allowing to improve the efficiency of customer retention campaigns and to reduce the costs associated with churn. Although cost reduction is their prime objective, churn prediction models are typically evaluated using statistically based performance measures, resulting in suboptimal model selection. Therefore, in the first part of this paper, a novel, profit centric performance measure is developed, by calculating the maximum profit that can be generated by including the optimal fraction of customers with the highest predicted probabilities to attrite in a retention campaign. The novel measure selects the optimal model and fraction of customers to include, yielding a significant increase in profits compared to statistical measures. In the second part an extensive benchmarking experiment is conducted, evaluating various classification techniques applied on eleven real-life data sets from telecom operators worldwide by using both the profit centric and statistically based performance measures. The experimental results show that a small number of variables suffices to predict churn with high accuracy, and that oversampling generally does not improve the performance significantly. Finally, a large group of classifiers is found to yield comparable performance.

Keywords: Data mining
[300] Chun-Wei Lin and Tzung-Pei Hong. Temporal data mining with up-to-date pattern trees. Expert Systems with Applications, 38(12):15143 - 15150, 2011. [ bib | DOI | http ]
Mining interesting and useful frequent patterns from large databases attracts much attention in recent years. Among the mining approaches, finding temporal patterns and regularities is very important due to its practicality. In the past, Hong et al. proposed the up-to-date patterns, which were frequent within their up-to-date lifetime. Formally, an up-to-date pattern is a pair with the itemset and its valid corresponding lifetime in which the user-defined minimum support threshold must be satisfied. They also proposed an Apriori-like approach to find the up-to-date patterns. This paper thus proposes the up-to-date pattern tree (UDP tree) to keep the up-to-date 1-patterns in a tree structure for reducing database scan. It is similar to the FP-tree structure but more complex due to the requirement of up-to-date patterns. The UDP-growth mining approach is also designed to find the up-to-date patterns from the {UDP} tree. The experimental results show that the proposed approach has a better performance than the level-wise mining algorithm.

Keywords: Data mining
[301] Ying Liu, Hui Zhang, Chunping Li, and Roger Jianxin Jiao. Workflow simulation for operational decision support using event graph through process mining. Decision Support Systems, 52(3):685 - 697, 2012. [ bib | DOI | http ]
It is increasingly common to see computer-based simulation being used as a vehicle to model and analyze business processes in relation to process management and improvement. While there are a number of business process management (BPM) and business process simulation (BPS) methodologies, approaches and tools available, it is more desirable to have a systemic {BPS} approach for operational decision support, from constructing process models based on historical data to simulating processes for typical and common problems. In this paper, we have proposed a generic approach of {BPS} for operational decision support which includes business processes modeling and workflow simulation with the models generated. Processes are modeled with event graphs through process mining from workflow logs that have integrated comprehensive information about the control-flow, data and resource aspects of a business process. A case study of a credit card application is presented to illustrate the steps involved in constructing an event graph. The evaluation detail is also given in terms of precision, generalization and robustness. Based on the event graph model constructed, we simulate the process under different scenarios and analyze the simulation logs for three generic problems in the case study: 1) suitable resource allocation plan for different case arrival rates; 2) teamwork performance under different case arrival rates; and 3) evaluation and prediction for personal performances. Our experimental results show that the proposed approach is able to model business processes using event graphs and simulate the processes for common operational decision support which collectively play an important role in process management and improvement.

Keywords: Business process management
[302] Luca Casaburi, Francesco Colace, Massimo De Santo, and Luca Greco. “magic mirror in my hand, what is the sentiment in the lens?”: An action unit based approach for mining sentiments from multimedia contents. Journal of Visual Languages & Computing, 27(0):19 - 28, 2015. Distributed Multimedia Systems {DMS2014} Part {II}. [ bib | DOI | http ]
In psychology and philosophy, emotion is a subjective, conscious experience characterized primarily by psychophysiological expressions, biological reactions, and mental states. Emotion could be also considered as a “positive or negative experience” that is associated with a particular pattern of physiological activity. So, the extraction and recognition of emotions from multimedia contents is becoming one of the most challenging research topics in human–computer interaction. Facial expressions, posture, gestures, speech, emotive changes of physical parameters (e.g. body temperature, blush and changes in the tone of the voice) can reflect changes in the user׳s emotional state and all this kind of parameters can be detected and interpreted by a computer leading to the so-called “affective computing”. In this paper an approach for the extraction of emotions from images and videos will be introduced. In particular, it involves the adoption of action units׳ extraction from facial expression according to the Ekman theory. The proposed approach has been tested on standard and real datasets with interesting and promising results.

Keywords: Affective computing
[303] W. Boulila, I.R. Farah, K. Saheb Ettabaa, B. Solaiman, and H. Ben Ghézala. A data mining based approach to predict spatiotemporal changes in satellite images. International Journal of Applied Earth Observation and Geoinformation, 13(3):386 - 395, 2011. [ bib | DOI | http ]
The interpretation of remotely sensed images in a spatiotemporal context is becoming a valuable research topic. However, the constant growth of data volume in remote sensing imaging makes reaching conclusions based on collected data a challenging task. Recently, data mining appears to be a promising research field leading to several interesting discoveries in various areas such as marketing, surveillance, fraud detection and scientific discovery. By integrating data mining and image interpretation techniques, accurate and relevant information (i.e. functional relation between observed parcels and a set of informational contents) can be automatically elicited. This study presents a new approach to predict spatiotemporal changes in satellite image databases. The proposed method exploits fuzzy sets and data mining concepts to build predictions and decisions for several remote sensing fields. It takes into account imperfections related to the spatiotemporal mining process in order to provide more accurate and reliable information about land cover changes in satellite images. The proposed approach is validated using {SPOT} images representing the Saint-Denis region, capital of Reunion Island. Results show good performances of the proposed framework in predicting change for the urban zone.

Keywords: Remote sensing
[304] Juan D. Velásquez, Luis E. Dujovne, and Gaston L’Huillier. Extracting significant website key objects: A semantic web mining approach. Engineering Applications of Artificial Intelligence, 24(8):1532 - 1541, 2011. Semantic-based Information and Engineering Systems. [ bib | DOI | http ]
Web mining has been traditionally used in different application domains in order to enhance the content that Web users are accessing. Likewise, Website administrators are interested in finding new approaches to improve their Website content according to their users' preferences. Furthermore, the Semantic Web has been considered as an alternative to represent Web content in a way which can be used by intelligent techniques to provide the organization, meaning, and definition of Web content. In this work, we define the Website Key Object Extraction problem, whose solution is based on a Semantic Web mining approach to extract from a given Website core ontology, new relations between objects according to their Web user interests. This methodology was applied to a real Website, whose results showed that the automatic extraction of Key Objects is highly competitive against traditional surveys applied to Web users.

Keywords: Website Key Objects
[305] Tias Guns, Siegfried Nijssen, and Luc De Raedt. Itemset mining: A constraint programming perspective. Artificial Intelligence, 175(12–13):1951 - 1983, 2011. [ bib | DOI | http ]
The field of data mining has become accustomed to specifying constraints on patterns of interest. A large number of systems and techniques has been developed for solving such constraint-based mining problems, especially for mining itemsets. The approach taken in the field of data mining contrasts with the constraint programming principles developed within the artificial intelligence community. While most data mining research focuses on algorithmic issues and aims at developing highly optimized and scalable implementations that are tailored towards specific tasks, constraint programming employs a more declarative approach. The emphasis lies on developing high-level modeling languages and general solvers that specify what the problem is, rather than outlining how a solution should be computed, yet are powerful enough to be used across a wide variety of applications and application domains. This paper contributes a declarative constraint programming approach to data mining. More specifically, we show that it is possible to employ off-the-shelf constraint programming techniques for modeling and solving a wide variety of constraint-based itemset mining tasks, such as frequent, closed, discriminative, and cost-based itemset mining. In particular, we develop a basic constraint programming model for specifying frequent itemsets and show that this model can easily be extended to realize the other settings. This contrasts with typical procedural data mining systems where the underlying procedures need to be modified in order to accommodate new types of constraint, or novel combinations thereof. Even though the performance of state-of-the-art data mining systems outperforms that of the constraint programming approach on some standard tasks, we also show that there exist problems where the constraint programming approach leads to significant performance improvements over state-of-the-art methods in data mining and as well as to new insights into the underlying data mining problems. Many such insights can be obtained by relating the underlying search algorithms of data mining and constraint programming systems to one another. We discuss a number of interesting new research questions and challenges raised by the declarative constraint programming approach to data mining.

Keywords: Data mining
[306] Huawen Liu and Shichao Zhang. Noisy data elimination using mutual k-nearest neighbor for classification mining. Journal of Systems and Software, 85(5):1067 - 1074, 2012. [ bib | DOI | http ]
k nearest neighbor (kNN) is an effective and powerful lazy learning algorithm, notwithstanding its easy-to-implement. However, its performance heavily relies on the quality of training data. Due to many complex real-applications, noises coming from various possible sources are often prevalent in large scale databases. How to eliminate anomalies and improve the quality of data is still a challenge. To alleviate this problem, in this paper we propose a new anomaly removal and learning algorithm under the framework of kNN. The primary characteristic of our method is that the evidence of removing anomalies and predicting class labels of unseen instances is mutual nearest neighbors, rather than k nearest neighbors. The advantage is that pseudo nearest neighbors can be identified and will not be taken into account during the prediction process. Consequently, the final learning result is more creditable. An extensive comparative experimental analysis carried out on {UCI} datasets provided empirical evidence of the effectiveness of the proposed method for enhancing the performance of the k-NN rule.

Keywords: Data mining
[307] David L. Olson, Dursun Delen, and Yanyan Meng. Comparative analysis of data mining methods for bankruptcy prediction. Decision Support Systems, 52(2):464 - 473, 2012. [ bib | DOI | http ]
A great deal of research has been devoted to prediction of bankruptcy, to include application of data mining. Neural networks, support vector machines, and other algorithms often fit data well, but because of lack of comprehensibility, they are considered black box technologies. Conversely, decision trees are more comprehensible by human users. However, sometimes far too many rules result in another form of incomprehensibility. The number of rules obtained from decision tree algorithms can be controlled to some degree through setting different minimum support levels. This study applies a variety of data mining tools to bankruptcy data, with the purpose of comparing accuracy and number of rules. For this data, decision trees were found to be relatively more accurate compared to neural networks and support vector machines, but there were more rule nodes than desired. Adjustment of minimum support yielded more tractable rule sets.

Keywords: Bankruptcy prediction
[308] Jieh-Shan Yeh and Po-Chiang Hsu. {HHUIF} and msicf: Novel algorithms for privacy preserving utility mining. Expert Systems with Applications, 37(7):4779 - 4786, 2010. [ bib | DOI | http ]
Privacy preserving data mining (PPDM) is a popular topic in the research community. How to strike a balance between privacy protection and knowledge discovery in the sharing process is an important issue. This study focuses on privacy preserving utility mining (PPUM) and presents two novel algorithms, {HHUIF} and MSICF, to achieve the goal of hiding sensitive itemsets so that the adversaries cannot mine them from the modified database. The work also minimizes the impact on the sanitized database of hiding sensitive itemsets. The experimental results show that {HHUIF} achieves lower miss costs than {MSICF} on two synthetic datasets. On the other hand, {MSICF} generally has a lower difference ratio than {HHUIF} between original and sanitized databases.

Keywords: Privacy preserving
[309] Flora S. Tsai and Agus T. Kwee. Database optimization for novelty mining of business blogs. Expert Systems with Applications, 38(9):11040 - 11047, 2011. [ bib | DOI | http ]
The widespread growth of business blogs has created opportunities for companies as channels of marketing, communication, customer feedback, and mass opinion measurement. However, many blogs often contain similar information and the sheer volume of available information really challenges the ability of organizations to act quickly in today’s business environment. Thus, novelty mining can help to single out novel information out of a massive set of text documents. This paper explores the feasibility and performance of novelty mining and database optimization of business blogs, which have not been studied before. The results show that our novelty mining system can detect novelty in our dataset of business blogs with very high accuracy, and that database optimization can significantly improve the performance.

Keywords: Novelty mining
[310] Yuan Guo, Jie Hu, and Yinghong Peng. Research on {CBR} system based on data mining. Applied Soft Computing, 11(8):5006 - 5014, 2011. [ bib | DOI | http ]
Case based reasoning (CBR) is a popular problem solving methodology which solves a new problem by remembering previous similar situations and reusing knowledge from the solutions to these situations. Aiming at traditional {CBR} system's too much dependence upon experts or engineers, this paper introduces data mining technology into {CBR} system and {GHSOM} (Growing Hierarchical Self Organizing Map), an excellent data mining tool with {ANN} (artificial neural network) technology, is integrated with it. After principal features are selected from numerous initial features to represent a case, through {GHSOM} cases are organized and managed in case base and while case retrieval is conducted, new case is guided into corresponding sub-case base, which greatly raises system's accuracy and efficiency. At last, experiments are implemented to validate the effectiveness of the proposed methods by comparing the proposed methods with other recent researches.

Keywords: {CBR}
[311] Heidi A. McKee. Policy matters now and in the future: Net neutrality, corporate data mining, and government surveillance. Computers and Composition, 28(4):276 - 291, 2011. Composition 20/20: How the Future of the Web Could Sharpen the Teaching of Writing. [ bib | DOI | http ]
In this article, I will detail three key policy issues that have a profound effect on the future of the World Wide Web and Internet-based communications: net neutrality, corporate data mining, and government surveillance. Focusing on policy issues in the U.S., I will describe not only current practices and cases, but future possibilities for writers and teachers of writing. I will draw from work in composition, interdisciplinary studies on privacy, information sharing, and surveillance on the Internet, analyses of applicable policies and laws, and the advocacy efforts by organizations. Issues I will examine include the importance of and threats to net neutrality; how data mining and (so-called) privacy agreements currently work, specifically at social networking sites often used in writing classrooms; and how government and institutional surveillance is far more prevalent than many realize. I will close with recommendations for what writing instructors (and students) can do to try to craft a different future, one where writers and the visual, verbal, aural writing they read and produce online will not be collected, scrutinized, and controlled (or, realistically, at least not as much).

Keywords: Data mining
[312] Hsin-Min Lu. Detecting short-term cyclical topic dynamics in the user-generated content and news. Decision Support Systems, 70(0):1 - 14, 2015. [ bib | DOI | http ]
With the maturation of the Internet and the mobile technology, Internet users are now able to produce and consume text data in different contexts. Linking the context to the text data can provide valuable information regarding users' activities and preferences, which are useful for decision support tasks such as market segmentation and product recommendation. To this end, previous studies have proposed to incorporate into topic models contextual information such as authors' identities and timestamps. Despite recent efforts to incorporate contextual information, few studies have focused on the short-term cyclical topic dynamics that connect the changes in topic occurrences to the time of day, the day of the week, and the day of the month. Short-term cyclical topic dynamics can both characterize the typical contexts to which a user is exposed at different occasions and identify user habits in specific contexts. Both abilities are essential for decision support tasks that are context dependent. To address this challenge, we present the Probit-Dirichlet hybrid allocation (PDHA) topic model, which incorporates a document's temporal features to capture a topic's short-term cyclical dynamics. A document's temporal features enter the topic model through the regression covariates of a multinomial-Probit-like structure that influences the prior topic distribution of individual tokens. By incorporating temporal features for monthly, weekly, and daily cyclical dynamics, {PDHA} is able to capture interesting short-term cyclical patterns that characterize topic dynamics. We developed an augmented Gibbs sampling algorithm for the non-Dirichlet-conjugate setting in PDHA. We then demonstrated the utility of {PDHA} using text collections from user generated content, newswires, and newspapers. Our experiments show that {PDHA} achieves higher hold-out likelihood values compared to baseline models, including latent Dirichlet allocation (LDA) and Dirichlet-multinomial regression (DMR). The temporal features for short-term cyclical dynamics and the novel model structure of {PDHA} both contribute to this performance advantage. The results suggest that {PDHA} is an attractive approach for decision support tasks involving text mining.

Keywords: Topic models
[313] Aleksandar Kovačević, Zora Konjović, Branko Milosavljević, and Goran Nenadic. Mining methodologies from {NLP} publications: A case study in automatic terminology recognition. Computer Speech & Language, 26(2):105 - 126, 2012. [ bib | DOI | http ]
The task of reviewing scientific publications and keeping up with the literature in a particular domain is extremely time-consuming. Extraction and exploration of methodological information, in particular, requires systematic understanding of the literature, but in many cases is performed within a limited context of publications that can be manually reviewed by an individual or group. Automated methodology identification could provide an opportunity for systematic retrieval of relevant documents and for exploring developments within a given discipline. In this paper we present a system for the identification of methodology mentions in scientific publications in the area of natural language processing, and in particular in automatic terminology recognition. The system comprises two major layers: the first layer is an automatic identification of methodological sentences; the second layer highlights methodological phrases (segments). Each mention is categorised in four semantic categories: Task, Method, Resource/Feature and Implementation. Extraction and classification of the segments is formalised as a sequence tagging problem and four separate phrase-based Conditional Random Fields are used to accomplish the task. The system has been evaluated on a manually annotated corpus comprising 45 full text articles. The results for the segment level annotation show an F-measure of 53% for identification of Task and Method mentions (with 70% precision), whereas the F-measures for Resource/Feature and Implementation identification were 61% (with 67% precision) and 75% (with 86% precision) respectively. At the document-level, an F-measure of 72% (with 81% precision) for Task mentions, 60% (with 81% precision) for Method mentions, 74% (with 78% precision) for the Resource/Feature and 79% (with 81% precision) for the Implementation categories have been achieved. We provide a detailed analysis of errors and explore the impact that the particular groups of features have on the extraction of methodological segments.

Keywords: Information extraction
[314] Oğuz Mustapaşa, Dilek Karahoca, Adem Karahoca, Ahmet Yücel, and Huseyin Uzunboylu. Implementation of semantic web mining on e-learning. Procedia - Social and Behavioral Sciences, 2(2):5820 - 5823, 2010. Innovation and Creativity in Education. [ bib | DOI | http ]
Semantic Web is a product of Web 2.0 (second generation of web) that is supported with automated semantic agents for processing user data to help the user on ease of use and personalization of services. Web Mining is an application of data mining which focuses on discovering patterns from Web logs and data. The semantic structure can be built with the pattern or relation results discovered via web mining. By combining those two applications of both disciplines, it's possible to achieve Semantic Web Mining which is a recent hot topic in educational research. This paper gives an overview of current applications of Semantic Web Mining on e-learning which already became a base component of education.

Keywords: Semantic web
[315] Seonho Kim and Juntae Yoon. Link-topic model for biomedical abbreviation disambiguation. Journal of Biomedical Informatics, 53(0):367 - 380, 2015. [ bib | DOI | http ]
The ambiguity of biomedical abbreviations is one of the challenges in biomedical text mining systems. In particular, the handling of term variants and abbreviations without nearby definitions is a critical issue. In this study, we adopt the concepts of topic of document and word link to disambiguate biomedical abbreviations. Methods We newly suggest the link topic model inspired by the latent Dirichlet allocation model, in which each document is perceived as a random mixture of topics, where each topic is characterized by a distribution over words. Thus, the most probable expansions with respect to abbreviations of a given abstract are determined by word-topic, document-topic, and word-link distributions estimated from a document collection through the link topic model. The model allows two distinct modes of word generation to incorporate semantic dependencies among words, particularly long form words of abbreviations and their sentential co-occurring words; a word can be generated either dependently on the long form of the abbreviation or independently. The semantic dependency between two words is defined as a link and a new random parameter for the link is assigned to each word as well as a topic parameter. Because the link status indicates whether the word constitutes a link with a given specific long form, it has the effect of determining whether a word forms a unigram or a skipping/consecutive bigram with respect to the long form. Furthermore, we place a constraint on the model so that a word has the same topic as a specific long form if it is generated in reference to the long form. Consequently, documents are generated from the two hidden parameters, i.e. topic and link, and the most probable expansion of a specific abbreviation is estimated from the parameters. Results Our model relaxes the bag-of-words assumption of the standard topic model in which the word order is neglected, and it captures a richer structure of text than does the standard topic model by considering unigrams and semantically associated bigrams simultaneously. The addition of semantic links improves the disambiguation accuracy without removing irrelevant contextual words and reduces the parameter space of massive skipping or consecutive bigrams. The link topic model achieves 98.42% disambiguation accuracy on 73,505 {MEDLINE} abstracts with respect to 21 three letter abbreviations and their 139 distinct long forms.

Keywords: Topic model
[316] E.W.T. Ngai, Yong Hu, Y.H. Wong, Yijun Chen, and Xin Sun. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3):559 - 569, 2011. On quantitative methods for detection of financial fraud. [ bib | DOI | http ]
This paper presents a review of — and classification scheme for — the literature on the application of data mining techniques for the detection of financial fraud. Although financial fraud detection (FFD) is an emerging topic of great importance, a comprehensive literature review of the subject has yet to be carried out. This paper thus represents the first systematic, identifiable and comprehensive academic literature review of the data mining techniques that have been applied to FFD. 49 journal articles on the subject published between 1997 and 2008 was analyzed and classified into four categories of financial fraud (bank fraud, insurance fraud, securities and commodities fraud, and other related financial fraud) and six classes of data mining techniques (classification, regression, clustering, prediction, outlier detection, and visualization). The findings of this review clearly show that data mining techniques have been applied most extensively to the detection of insurance fraud, although corporate fraud and credit card fraud have also attracted a great deal of attention in recent years. In contrast, we find a distinct lack of research on mortgage fraud, money laundering, and securities and commodities fraud. The main data mining techniques used for {FFD} are logistic models, neural networks, the Bayesian belief network, and decision trees, all of which provide primary solutions to the problems inherent in the detection and classification of fraudulent data. This paper also addresses the gaps between {FFD} and the needs of the industry to encourage additional research on neglected topics, and concludes with several suggestions for further {FFD} research.

Keywords: Financial fraud
[317] João Ventura and Joaquim Silva. Mining concepts from texts. Procedia Computer Science, 9(0):27 - 36, 2012. Proceedings of the International Conference on Computational Science, {ICCS} 2012. [ bib | DOI | http ]
The extraction of multi-word relevant expressions has been an increasingly hot topic in the last few years. Relevant expressions are applicable in diverse areas such as Information Retrieval, document clustering, or classification and indexing of documents. However, relevant single-words, which represent much of the knowledge in texts, have been a relatively dormant field. In this paper we present a statistical language-independent approach to extract concepts formed by relevant single and multi-word units. By achieving promising precision/recall values, it can be an alternative both to language dependent approaches and to extractors that deal exclusively with multi-words.

Keywords: Text Mining
[318] Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu. An effective tree structure for mining high utility itemsets. Expert Systems with Applications, 38(6):7419 - 7424, 2011. [ bib | DOI | http ]
In the past, many algorithms were proposed to mine association rules, most of which were based on item frequency values. Considering a customer may buy many copies of an item and each item may have different profits, mining frequent patterns from a traditional database is not suitable for some real-world applications. Utility mining was thus proposed to consider costs, profits and other measures according to user preference. In this paper, the high utility pattern tree (HUP tree) is designed and the HUP-growth mining algorithm is proposed to derive high utility patterns effectively and efficiently. The proposed approach integrates the previous two-phase procedure for utility mining and the FP-tree concept to utilize the downward-closure property and generate a compressed tree structure. Experimental results also show that the proposed approach has a better performance than Liu et al.’s two-phase algorithm in execution time. At last, the numbers of tree nodes generated from three different item ordering methods are also compared, with results showing that the frequency ordering produces less tree nodes than the other two.

Keywords: Utility mining
[319] Shahab Dean Mohaghegh. Reservoir simulation and modeling based on artificial intelligence and data mining (ai&dm). Journal of Natural Gas Science and Engineering, 3(6):697 - 705, 2011. Artificial Intelligence and Data Mining. [ bib | DOI | http ]
In this paper a new class of reservoir models that are developed based on the pattern recognition technologies collectively known as Artificial Intelligence and Data Mining (AI&DM) is introduced. The workflows developed based on this new class of reservoir simulation and modeling tools break new ground in modeling fluid flow through porous media by providing a completely new and different angle on reservoir simulation and modeling. The philosophy behind this modeling approach and its major commonalities and differences with numerical and analytical models are explored and two different categories of such models are explained. Details of this technology are presented using examples of most recent applications to several prolific reservoirs in the Middle East and in the Gulf of Mexico. AI-based Reservoir Models can be developed for green or brown fields. Since these models are developed based on spatio-temporal databases that are specifically developed for this purpose, they require the existence of a basic numerical reservoir simulator for the green fields while can be developed entirely based on historical data for brown fields. The run-time of AI-based Reservoir Models that provide complete field responses is measured in seconds rather than hours and days (even for a multi-million grid block reservoir). Therefore, providing means for fast track reservoir analysis and AI-assisted history matching are intrinsic characteristics of these models. AI-based Reservoir Models can, in some cases, completely substitute numerical reservoir simulation models, work side by side but completely independent or be integrated with them in order to increase their productivity. Advantages associated with AI-based Reservoir Models are short development time, low development cost, fast track analysis and practical capability to quantify the uncertainties associated with the static model. AI-based Reservoir Model includes a novel design tool for comprehensive analysis of the full field and design of field development strategies to meet operational targets. They have open data requirement architecture that can accommodate a wide variety of data from pressure tests to seismic.

Keywords: Reservoir Simulation and Modeling
[320] Rabeah Al-Zaidy, Benjamin C.M. Fung, Amr M. Youssef, and Francis Fortin. Mining criminal networks from unstructured text documents. Digital Investigation, 8(3–4):147 - 160, 2012. [ bib | DOI | http ]
Digital data collected for forensics analysis often contain valuable information about the suspects’ social networks. However, most collected records are in the form of unstructured textual data, such as e-mails, chat messages, and text documents. An investigator often has to manually extract the useful information from the text and then enter the important pieces into a structured database for further investigation by using various criminal network analysis tools. Obviously, this information extraction process is tedious and error-prone. Moreover, the quality of the analysis varies by the experience and expertise of the investigator. In this paper, we propose a systematic method to discover criminal networks from a collection of text documents obtained from a suspect’s machine, extract useful information for investigation, and then visualize the suspect’s criminal network. Furthermore, we present a hypothesis generation approach to identify potential indirect relationships among the members in the identified networks. We evaluated the effectiveness and performance of the method on a real-life cybercrimine case and some other datasets. The proposed method, together with the implemented software tool, has received positive feedback from the digital forensics team of a law enforcement unit in Canada.

Keywords: Forensic analysis
[321] Joris D’hondt, Paul-Armand Verhaegen, Joris Vertommen, Dirk Cattrysse, and Joost R. Duflou. Topic identification based on document coherence and spectral analysis. Information Sciences, 181(18):3783 - 3797, 2011. [ bib | DOI | http ]
In a world with vast information overload, well-optimized retrieval of relevant information has become increasingly important. Dividing large, multiple topic spanning documents into sets of coherent subdocuments facilitates the information retrieval process. This paper presents a novel technique to automatically subdivide a textual document into consistent components based on a coherence quantification function. This function is based on stem or term chains linking document entities, such as sentences or paragraphs, based on the reoccurrences of stems or terms. Applying this function on a document results in a coherence graph of the document linking its entities. Spectral graph partitioning techniques are used to divide this coherence graph into a number of subdocuments. A novel technique is introduced to obtain the most suitable number of subdocuments. These subdocuments are an aggregation of (not necessarily adjacent) entities. Performance tests are conducted in test environments based on standardized datasets to prove the algorithm’s capabilities. The relevance of these techniques for information retrieval and text mining is discussed.

Keywords: Topic identification
[322] Zhengxing Huang, Xudong Lu, and Huilong Duan. Mining association rules to support resource allocation in business process management. Expert Systems with Applications, 38(8):9483 - 9490, 2011. [ bib | DOI | http ]
Resource allocation is of great importance for business process management. In business process execution, a set of rules that specify resource allocation is always implied. Although many approaches have been offered to support resource allocation, they are not sufficient to derive interesting resource allocation rules which ensure that each activity is performed by suitable resource. Hence, this paper introduces an association rule mining based approach to mine interesting resource allocation rules from event log. The idea is to concern the ordered correlations between items in event log, and then to present two efficient algorithms to mine real “interesting” rules. The event log of radiology CT-scan examination process provided by the Chinese Huzhou hospital is used to verify the proposed approach. The evaluation results showed that the proposed approach not only is able to extract the rules more efficient and much faster, but also can discover more important resource allocation rules.

Keywords: Association rules
[323] Gülser Köksal, İnci Batmaz, and Murat Caner Testik. A review of data mining applications for quality improvement in manufacturing industry. Expert Systems with Applications, 38(10):13448 - 13467, 2011. [ bib | DOI | http ]
Many quality improvement (QI) programs including six sigma, design for six sigma, and kaizen require collection and analysis of data to solve quality problems. Due to advances in data collection systems and analysis tools, data mining (DM) has widely been applied for {QI} in manufacturing. Although a few review papers have recently been published to discuss {DM} applications in manufacturing, these only cover a small portion of the applications for specific {QI} problems (quality tasks). In this study, an extensive review covering the literature from 1997 to 2007 and several analyses on selected quality tasks are provided on {DM} applications in the manufacturing industry. The quality tasks considered are; product/process quality description, predicting quality, classification of quality, and parameter optimisation. The review provides a comprehensive analysis of the literature from various points of view: data handling practices, {DM} applications for each quality task and for each manufacturing industry, patterns in the use of {DM} methods, application results, and software used in the applications are analysed. Several summary tables and figures are also provided along with the discussion of the analyses and results. Finally, conclusions and future research directions are presented.

Keywords: Knowledge discovery in databases
[324] Shahria Movafaghi and Jack Bullock. Sentiment web mining architecture - shahriar movafaghi. Procedia - Social and Behavioral Sciences, 26(0):191 - 197, 2011. The 2nd Collaborative Innovation Networks Conference - {COINs2010}. [ bib | DOI | http ]
In this paper we discuss the architecture of the system under development the purpose of which is to capture the sentiment of web users regarding any topic such as retail products, financial instruments (FI), or social issues like immigration. The first step is knowledge acquisition. A Sentiment Web Mining (SWM) system requires acquisition of knowledge from several sources on the web. Such knowledge may be found on blogs, social networks, email, or online news. A {SWM} system has customization and personalization capabilities. For our purposes, customization occurs when the {SWM} user can change his/her preferences to select specific sites to be used for data mining and evaluation. Personalization occurs when the system decides which sites to be used for data mining based on the user profile. The user profile dynamically changes depending on the type of user request from the system and the specific sites the user visits to verify the result of the {SWM} system. The second step is knowledge storage, which involves the creation of a database. Appropriate web sites will be indexed and tagged. Taxonomy is the hardest part of this step. In this paper we will demonstrate a unique way of tagging the knowledge obtained from the web. The third step is the knowledge analysis/data mining. A {SWM} system will use a series of off–the-shelf knowledge analysis/data mining tools including {SWM} knowledge analysis/data mining engine which is based on web services technology. The type of questions used can be: 1) The volume of sentiment for a particular topic; 2) The intensity of sentiment (good or bad) for a particular topic; 3) The interrelationship between the writers of material written on the web, especially if the writer is anonymous; 4) Who is/are the leader(s) of the sentiment? If the information is maliciously posted on the web the user may want to pursue it through legal means.The last step is dissemination of knowledge to the user(s). A {SWM} system uses third party visualization tools as well as web based user interfaces and reports that are written internally. The presentation component of a {SWM} system is decoupled from other components, namely, the process component, business rule component, and data access component for ease of maintainability.

Keywords: Web Mining
[325] Kyle H. Ambert and Aaron M. Cohen. Chapter six - text-mining and neuroscience. In Elissa J. Chesler and Melissa A. Haendel, editors, Bioinformatics of Behavior: Part 1, volume 103 of International Review of Neurobiology, pages 109 - 132. Academic Press, 2012. [ bib | DOI | http ]
The wealth and diversity of neuroscience research are inherent characteristics of the discipline that can give rise to some complications. As the field continues to expand, we generate a great deal of data about all aspects, and from multiple perspectives, of the brain, its chemistry, biology, and how these affect behavior. The vast majority of research scientists cannot afford to spend their time combing the literature to find every article related to their research, nor do they wish to spend time adjusting their neuroanatomical vocabulary to communicate with other subdomains in the neurosciences. As such, there has been a recent increase in the amount of informatics research devoted to developing digital resources for neuroscience research. Neuroinformatics is concerned with the development of computational tools to further our understanding of the brain and to make sense of the vast amount of information that neuroscientists generate (French & Pavlidis, 2007). Many of these tools are related to the use of textual data. Here, we review some of the recent developments for better using the vast amount of textual information generated in neuroscience research and publication and suggest several use cases that will demonstrate how bench neuroscientists can take advantage of the resources that are available.

Keywords: Neuroinformatics
[326] Yao-Te Wang and Anthony J.T. Lee. Mining web navigation patterns with a path traversal graph. Expert Systems with Applications, 38(6):7112 - 7122, 2011. [ bib | DOI | http ]
Understanding the navigational behaviour of website visitors is a significant factor of success in the emerging business models of electronic commerce and even mobile commerce. However, Web traversal patterns obtained by traditional Web usage mining approaches are ineffective for the content management of websites. They do not provide the big picture of the intentions of the visitors. The Web navigation patterns, termed throughout-surfing patterns (TSPs) as defined in this paper, are a superset of Web traversal patterns that effectively display the trends toward the next visited Web pages in a browsing session. {TSPs} are more expressive for understanding the purposes of website visitors. In this paper, we first introduce the concept of throughout-surfing patterns and then present an efficient method for mining the patterns. We propose a compact graph structure, termed a path traversal graph, to record information about the navigation paths of website visitors. The graph contains the frequent surfing paths that are required for mining TSPs. In addition, we devised a graph traverse algorithm based on the proposed graph structure to discover the TSPs. The experimental results show the proposed mining method is highly efficient to discover TSPs.

Keywords: Web log mining
[327] Enrique García, Cristóbal Romero, Sebastián Ventura, and Carlos de Castro. A collaborative educational association rule mining tool. The Internet and Higher Education, 14(2):77 - 88, 2011. Web mining and higher education: Introduction to the special issue. [ bib | DOI | http ]
This paper describes a collaborative educational data mining tool based on association rule mining for the ongoing improvement of e-learning courses and allowing teachers with similar course profiles to share and score the discovered information. The mining tool is oriented to be used by non-expert instructors in data mining so its internal operation has to be transparent to the user and the instructor can focus on the analysis of the results and make decisions about how to improve the e-learning course. In this paper, a data mining tool is described in a tutorial way and some examples of rules discovered in an adaptive web-based course are shown and explained.

Keywords: Educational data mining tool
[328] Takayuki Taniya, Susumu Tanaka, Yumi Yamaguchi-Kabata, Hideki Hanaoka, Chisato Yamasaki, Harutoshi Maekawa, Roberto A. Barrero, Boris Lenhard, Milton W. Datta, Mary Shimoyama, Roger Bumgarner, Ranajit Chakraborty, Ian Hopkinson, Libin Jia, Winston Hide, Charles Auffray, Shinsei Minoshima, Tadashi Imanishi, and Takashi Gojobori. A prioritization analysis of disease association by data-mining of functional annotation of human genes. Genomics, 99(1):1 - 9, 2012. [ bib | DOI | http ]
Complex diseases result from contributions of multiple genes that act in concert through pathways. Here we present a method to prioritize novel candidates of disease-susceptibility genes depending on the biological similarities to the known disease-related genes. The extent of disease-susceptibility of a gene is prioritized by analyzing seven features of human genes captured in H-InvDB. Taking rheumatoid arthritis (RA) and prostate cancer (PC) as two examples, we evaluated the efficiency of our method. Highly scored genes obtained included {TNFSF12} and {OSM} as candidate disease genes for {RA} and PC, respectively. Subsequent characterization of these genes based upon an extensive literature survey reinforced the validity of these highly scored genes as possible disease-susceptibility genes. Our approach, Prioritization {ANalysis} of Disease Association (PANDA), is an efficient and cost-effective method to narrow down a large set of genes into smaller subsets that are most likely to be involved in the disease pathogenesis.

Keywords: Disease
[329] Stefan Lessmann, Marco Caserta, and Idel Montalvo Arango. Tuning metaheuristics: A data mining based approach for particle swarm optimization. Expert Systems with Applications, 38(10):12826 - 12838, 2011. [ bib | DOI | http ]
The paper is concerned with practices for tuning the parameters of metaheuristics. Settings such as, e.g., the cooling factor in simulated annealing, may greatly affect a metaheuristic’s efficiency as well as effectiveness in solving a given decision problem. However, procedures for organizing parameter calibration are scarce and commonly limited to particular metaheuristics. We argue that the parameter selection task can appropriately be addressed by means of a data mining based approach. In particular, a hybrid system is devised, which employs regression models to learn suitable parameter values from past moves of a metaheuristic in an online fashion. In order to identify a suitable regression method and, more generally, to demonstrate the feasibility of the proposed approach, a case study of particle swarm optimization is conducted. Empirical results suggest that characteristics of the decision problem as well as search history data indeed embody information that allows suitable parameter values to be determined, and that this type of information can successfully be extracted by means of nonlinear regression models.

Keywords: Metaheuristics
[330] Eyke Hüllermeier. Fuzzy sets in machine learning and data mining. Applied Soft Computing, 11(2):1493 - 1505, 2011. The Impact of Soft Computing for the Progress of Artificial Intelligence. [ bib | DOI | http ]
Machine learning, data mining, and several related research areas are concerned with methods for the automated induction of models and the extraction of interesting patterns from empirical data. Automated knowledge acquisition of that kind has been an essential aspect of artificial intelligence since a long time and has more recently also attracted considerable attention in the fuzzy sets community. This paper briefly reviews some typical applications and highlights potential contributions that fuzzy set theory can make to machine learning, data mining, and related fields. In this connection, some advantages of fuzzy methods for representing and mining vague patterns in data are especially emphasized.

Keywords: Fuzzy sets
[331] Michael Raedel, Andrea Hartmann, Steffen Bohm, and Michael H. Walter. Three-year outcomes of root canal treatment: Mining an insurance database. Journal of Dentistry, 43(4):412 - 417, 2015. [ bib | DOI | http ]
There is doubt whether success rates of root canal treatments reported from clinical trials are achievable outside of standardized study populations. The aim of this study was to analyse the outcome of a large number of root canal treatments conducted in general practice. Methods The data was collected from the digital database of a major German national health insurance company. All teeth with complete treatment data were included. Only patients who had been insurance members for the whole 3-year period from 2010 to 2012 were eligible. Kaplan–Meier survival analyses were conducted based on completed root canal treatments. Target events were re-interventions as (1) retreatment of the root canal treatment, (2) apical root resection (apicoectomy) and (3) extraction. The influences of vitality status and root numbers on survival were tested with the log-rank test. Results A total of 556,067 root canal treatments were included. The cumulative overall survival rate for all target events combined was 84.3% for 3 years. The survival rate for nonvital teeth (82.6%) was significantly lower than for vital teeth (85.6%; p < 0.001). The survival rate for single rooted teeth (83.4%) was significantly lower than for multi-rooted teeth (85.5%; p < 0.001). The most frequent target event was extraction followed by apical root resection and retreatment. Conclusions Based on these 3-year outcomes, root canal treatment is considered a reliable treatment in practice routine under the conditions of the German national health insurance system. Clinical significance Root canal treatment can be considered as a reliable treatment option suitable to salvage most of the affected teeth. This statement applies to treatments that in the vast majority of cases were delivered by general practitioners under the terms and conditions of a nationwide health insurance system.

Keywords: Health services research
[332] Álvaro Machado Dias, Carlos Gustavo Mansur, Martin Myczkowski, and Marco Marcolin. Whole field tendencies in transcranial magnetic stimulation: A systematic review with data and text mining. Asian Journal of Psychiatry, 4(2):107 - 112, 2011. [ bib | DOI | http ]
Background Transcranial magnetic stimulation (TMS) has played an important role in the fields of psychiatry, neurology and neuroscience, since its emergence in the mid-1980s; and several high quality reviews have been produced since then. Most high quality reviews serve as powerful tools in the evaluation of predefined tendencies, but they cannot actually uncover new trends within the literature. However, special statistical procedures to ‘mine’ the literature have been developed which aid in achieving such a goal. Objectives This paper aims to uncover patterns within the literature on {TMS} as a whole, as well as specific trends in the recent literature on {TMS} for the treatment of depression. Methods Data mining and text mining. Results Currently there are 7299 publications, which can be clustered in four essential themes. Considering the frequency of the core psychiatric concepts within the indexed literature, the main results are: depression is present in 13.5% of the publications; Parkinson's disease in 2.94%; schizophrenia in 2.76%; bipolar disorder in 0.158%; and anxiety disorder in 0.142% of all the publications indexed in PubMed. Several other perspectives are discussed in the article.

Keywords: {TMS}
[333] Wei Chen and Parvathi Chundi. Extracting hot spots of topics from time-stamped documents. Data & Knowledge Engineering, 70(7):642 - 660, 2011. [ bib | DOI | http ]
Identifying time periods with a burst of activities related to a topic has been an important problem in analyzing time-stamped documents. In this paper, we propose an approach to extract a hot spot of a given topic in a time-stamped document set. Topics can be basic, containing a simple list of keywords, or complex. Logical relationships such as and, or, and not are used to build complex topics from basic topics. A concept of presence measure of a topic based on fuzzy set theory is introduced to compute the amount of information related to the topic in the document set. Each interval in the time period of the document set is associated with a numeric value which we call the discrepancy score. A high discrepancy score indicates that the documents in the time interval are more focused on the topic than those outside of the time interval. A hot spot of a given topic is defined as a time interval with the highest discrepancy score. We first describe a naive implementation for extracting hot spots. We then construct an algorithm called {EHE} (Efficient Hot Spot Extraction) using several efficient strategies to improve performance. We also introduce the notion of a topic {DAG} to facilitate an efficient computation of presence measures of complex topics. The proposed approach is illustrated by several experiments on a subset of the TDT-Pilot Corpus and {DBLP} conference data set. The experiments show that the proposed {EHE} algorithm significantly outperforms the naive one, and the extracted hot spots of given topics are meaningful.

Keywords: Scan statistic
[334] Catarina Dudas, Marcus Frantzén, and Amos H.C. Ng. A synergy of multi-objective optimization and data mining for the analysis of a flexible flow shop. Robotics and Computer-Integrated Manufacturing, 27(4):687 - 695, 2011. Conference papers of Flexible Automation and Intelligent Manufacturing Intelligent manufacturing and services. [ bib | DOI | http ]
A method for analyzing production systems by applying multi-objective optimization and data mining techniques on discrete-event simulation models, the so-called Simulation-based Innovization (SBI) is presented in this paper. The aim of the {SBI} analysis is to reveal insight on the parameters that affect the performance measures as well as to gain deeper understanding of the problem, through post-optimality analysis of the solutions acquired from multi-objective optimization. This paper provides empirical results from an industrial case study, carried out on an automotive machining line, in order to explain the {SBI} procedure. The {SBI} method has been found to be particularly suitable in this case study as the three objectives under study, namely total tardiness, makespan and average work-in-process, are in conflict with each other. Depending on the system load of the line, different decision variables have been found to be influencing. How the {SBI} method is used to find important patterns in the explored solution set and how it can be valuable to support decision making in order to improve the scheduling under different system loadings in the machining line are addressed.

Keywords: Data mining
[335] Yi Peng, Yong Zhang, Yu Tang, and Shiming Li. An incident information management framework based on data integration, data mining, and multi-criteria decision making. Decision Support Systems, 51(2):316 - 327, 2011. Multiple Criteria Decision Making and Decision Support Systems. [ bib | DOI | http ]
An effective incident information management system needs to deal with several challenges. It must support heterogeneous distributed incident data, allow decision makers (DMs) to detect anomalies and extract useful knowledge, assist {DMs} in evaluating the risks and selecting an appropriate alternative during an incident, and provide differentiated services to satisfy the requirements of different incident management phases. To address these challenges, this paper proposes an incident information management framework that consists of three major components. The first component is a high-level data integration module in which heterogeneous data sources are integrated and presented in a uniform format. The second component is a data mining module that uses data mining methods to identify useful patterns and presents a process to provide differentiated services for pre-incident and post-incident information management. The third component is a multi-criteria decision-making (MCDM) module that utilizes {MCDM} methods to assess the current situation, find the satisfactory solutions, and take appropriate responses in a timely manner. To validate the proposed framework, this paper conducts a case study on agrometeorological disasters that occurred in China between 1997 and 2001. The case study demonstrates that the combination of data mining and {MCDM} methods can provide objective and comprehensive assessments of incident risks.

Keywords: Incident information management
[336] Alan L. Porter and Nils C. Newman. Mining external r&d. Technovation, 31(4):171 - 176, 2011. Managing Technology. [ bib | DOI | http ]
Open Innovation presses the case for timely and thorough intelligence concerning research and development activities conducted outside one’s organization. To take advantage of this wealth of R&D, one needs to establish a systematic “tech mining” process. We propose a 5-stage framework that extends literature review into research profiling and pattern recognition to answer posed technology management questions. Ultimately one can even discover new knowledge by screening research databases. Once one determines the value in mining external R&D, tough issues remain to be overcome. Technology management has developed a culture that relies more on intuition than on evidence. Changing that culture and implementing effective technical intelligence capabilities is worth the effort. P&G's reported gains in innovation call attention to the huge payoff potential.

Keywords: Tech mining
[337] Daniel E. O'Leary. Blog mining-review and extensions: “from each according to his opinion”. Decision Support Systems, 51(4):821 - 830, 2011. Recent Advances in Data, Text, and Media Mining & Information Issues in Supply Chain and in Service System Design. [ bib | DOI | http ]
Blogs provide a type of website that contains information and personal opinions of the individual authors. The purpose of this paper is to review some of the literature aimed at gathering opinion, sentiment and information from blogs. This paper also extends the previous literature in a number of directions, extending the use of knowledge from tags on blogs, finding the need for domain specific terms to capture a richer understanding of mood of a blog and finding a relationship between information in message boards and blogs. The relationship between blog chatter and sales, and blogs and public image are also examined.

Keywords: Blogs
[338] Pierre Cardol. Mitochondrial nadh:ubiquinone oxidoreductase (complex i) in eukaryotes: A highly conserved subunit composition highlighted by mining of protein databases. Biochimica et Biophysica Acta (BBA) - Bioenergetics, 1807(11):1390 - 1397, 2011. [ bib | DOI | http ]
Complex I (NADH:ubiquinone oxidoreductase) is the largest enzyme of the mitochondrial respiratory chain. Compared to its bacterial counterpart which encompasses 14–17 subunits, mitochondrial complex I has almost tripled its subunit composition during evolution of eukaryotes, by recruitment of so-called accessory subunits, part of them being specific to distinct evolutionary lineages. The increasing availability of numerous broadly sampled eukaryotic genomes now enables the reconstruction of the evolutionary history of this large protein complex. Here, a combination of profile-based sequence comparisons and basic structural properties analyses at the protein level enabled to pinpoint homology relationships between complex I subunits from fungi, mammals or green plants, previously identified as subunits. In addition, homologs of at least 40 mammalian complex I subunits are present in representatives of all major eukaryote assemblages, half of them having not been investigated so far (Excavates, Chromalveolates, Amoebozoa). This analysis revealed that complex I was subject to a phenomenal increase in size that predated the diversification of extant eukaryotes, followed by very few lineage-specific additions/losses of subunits. The implications of this subunit conservation for studies of complex I are discussed.

Keywords: Mitochondrial NADH:ubiquinone oxidoreductase
[339] Clement Jonquet, Paea LePendu, Sean Falconer, Adrien Coulet, Natalya F. Noy, Mark A. Musen, and Nigam H. Shah. {NCBO} resource index: Ontology-based search and mining of biomedical resources. Web Semantics: Science, Services and Agents on the World Wide Web, 9(3):316 - 324, 2011. Semantic Web Dynamics Semantic Web Challenge, 2010. [ bib | DOI | http ]
The volume of publicly available data in biomedicine is constantly increasing. However, these data are stored in different formats and on different platforms. Integrating these data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology (NCBO), we have developed the Resource Index – a growing, large-scale ontology-based index of more than twenty heterogeneous biomedical resources. The resources come from a variety of repositories maintained by organizations from around the world. We use a set of over 200 publicly available ontologies contributed by researchers in various domains to annotate the elements in these resources. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies, in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics “under the hood.”

Keywords: Ontology-based indexing
[340] Ling Chen, Mingqi Lv, Qian Ye, Gencai Chen, and John Woodward. A personal route prediction system based on trajectory data mining. Information Sciences, 181(7):1264 - 1284, 2011. [ bib | DOI | http ]
This paper presents a system where the personal route of a user is predicted using a probabilistic model built from the historical trajectory data. Route patterns are extracted from personal trajectory data using a novel mining algorithm, Continuous Route Pattern Mining (CRPM), which can tolerate different kinds of disturbance in trajectory data. Furthermore, a client–server architecture is employed which has the dual purpose of guaranteeing the privacy of personal data and greatly reducing the computational load on mobile devices. An evaluation using a corpus of trajectory data from 17 people demonstrates that {CRPM} can extract longer route patterns than current methods. Moreover, the average correct rate of one step prediction of our system is greater than 71%, and the average Levenshtein distance of continuous route prediction of our system is about 30% shorter than that of the Markov model based method.

Keywords: Data mining
[341] Angel Cruz-Roa, Juan C. Caicedo, and Fabio A. González. Visual pattern mining in histology image collections using bag of features. Artificial Intelligence in Medicine, 52(2):91 - 106, 2011. Artificial Intelligence in Medicine {AIME} 2009. [ bib | DOI | http ]
The paper addresses the problem of finding visual patterns in histology image collections. In particular, it proposes a method for correlating basic visual patterns with high-level concepts combining an appropriate image collection representation with state-of-the-art machine learning techniques. Methodology The proposed method starts by representing the visual content of the collection using a bag-of-features strategy. Then, two main visual mining tasks are performed: finding associations between visual-patterns and high-level concepts, and performing automatic image annotation. Associations are found using minimum-redundancy-maximum-relevance feature selection and co-clustering analysis. Annotation is done by applying a support-vector-machine classifier. Additionally, the proposed method includes an interpretation mechanism that associates concept annotations with corresponding image regions. The method was evaluated in two data sets: one comprising histology images from the different four fundamental tissues, and the other composed of histopathology images used for cancer diagnosis. Different visual-word representations and codebook sizes were tested. The performance in both concept association and image annotation tasks was qualitatively and quantitatively evaluated. Results The results show that the method is able to find highly discriminative visual features and to associate them to high-level concepts. In the annotation task the method showed a competitive performance: an increase of 21% in f-measure with respect to the baseline in the histopathology data set, and an increase of 47% in the histology data set. Conclusions The experimental evidence suggests that the bag-of-features representation is a good alternative to represent visual content in histology images. The proposed method exploits this representation to perform visual pattern mining from a wider perspective where the focus is the image collection as a whole, rather than individual images.

Keywords: Collection-based image analysis
[342] Jonas Poelmans, Marc M. Van Hulle, Stijn Viaene, Paul Elzinga, and Guido Dedene. Text mining with emergent self organizing maps and multi-dimensional scaling: A comparative study on domestic violence. Applied Soft Computing, 11(4):3870 - 3876, 2011. [ bib | DOI | http ]
In this paper we compare the usability of {ESOM} and {MDS} as text exploration instruments in police investigations. We combine them with traditional classification instruments such as the {SVM} and Naïve Bayes. We perform a case of real-life data mining using a dataset consisting of police reports describing a wide range of violent incidents that occurred during the year 2007 in the Amsterdam-Amstelland police region (The Netherlands). We compare the possibilities offered by the {ESOM} and {MDS} for iteratively enriching our feature set, discovering confusing situations, faulty case labelings and significantly improving the classification accuracy. The results of our research are currently operational in the Amsterdam-Amstelland police region for upgrading the employed domestic violence definition, for improving the training of police officers and for developing a highly accurate and comprehensible case triage model.

Keywords: Emergent self organizing map (ESOM)
[343] Tak chung Fu. A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1):164 - 181, 2011. [ bib | DOI | http ]
Time series is an important class of temporal data objects and it can be easily obtained from scientific and financial applications. A time series is a collection of observations made chronologically. The nature of time series data includes: large in data size, high dimensionality and necessary to update continuously. Moreover time series data, which is characterized by its numerical and continuous nature, is always considered as a whole instead of individual numerical field. The increasing use of time series data has initiated a great deal of research and development attempts in the field of data mining. The abundant research on time series data mining in the last decade could hamper the entry of interested researchers, due to its complexity. In this paper, a comprehensive revision on the existing time series data mining research is given. They are generally categorized into representation and indexing, similarity measure, segmentation, visualization and mining. Moreover state-of-the-art research issues are also highlighted. The primary objective of this paper is to serve as a glossary for interested researchers to have an overall picture on the current time series data mining development and identify their potential research direction to further investigation.

Keywords: Time series data mining
[344] David Laurence. Establishing a sustainable mining operation: an overview. Journal of Cleaner Production, 19(2–3):278 - 284, 2011. [ bib | DOI | http ]
In a review of the literature on sustainability in mining, it was found that there is limited guidance for mine operators to put sustainability frameworks and theory into action on the ground. This paper argues that operators can improve the sustainability of their mine sites by ensuring that leading practices are implemented in five areas. In addition to the widely-accepted dimensions of Environment, Economic and Community, Safety and Resource Efficiency must be addressed. The need for highlighting these additional elements is demonstrated in an analysis of over one thousand unplanned or prematurely closed mines over the past 30 years.

Keywords: Sustainable mining
[345] Duc-Thuan Vo and Cheol-Young Ock. Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Systems with Applications, 42(3):1684 - 1698, 2015. [ bib | DOI | http ]
Classification of short text is challenging due to data sparseness, which is a typical characteristic of short text. In this paper, we propose methods for enhancing features using topic models, which make short text seem less sparse and more topic-oriented for classification. We exploited topic model analysis based on Latent Dirichlet Allocation for enriched datasets, and then we presented new methods for enhancing features by combining external texts from topic models that make documents more effective for classification. In experiments, we utilized the title contents of scientific articles as short text documents, and then enriched these documents using topic models from various types of universal datasets for classification in order to show that our approach performs efficiently.

Keywords: Data sparseness
[346] Muamer N. Mohammad, Norrozila Sulaiman, and Osama Abdulkarim Muhsin. A novel intrusion detection system by using intelligent data mining in weka environment. Procedia Computer Science, 3(0):1237 - 1242, 2011. World Conference on Information Technology. [ bib | DOI | http ]
Nowadays, the using of intelligent data mining approaches to predict intrusion in local area networks has been increasing rapidly. In this paper, an improved approach for Intrusion Detection System (IDS) based on combining data mining and expert system is presented and implemented in WEKA. The taxonomy consists of a classification of the detection principle as well as certain {WEKA} aspects of the intrusion detection system such as open-source data mining. The combining methods may give better performance of {IDS} systems, and make the detection more effective. The result of the evaluation of the new design produced a better result in terms of detection efficiency and false alarm rate from the existing problems. This presents useful information in intrusion detection.

Keywords: Data mining
[347] George Tzanis, Ioannis Kavakiotis, and Ioannis Vlahavas. Polya-iep: A data mining method for the effective prediction of polyadenylation sites. Expert Systems with Applications, 38(10):12398 - 12408, 2011. [ bib | DOI | http ]
This paper presents a study on polyadenylation site prediction, which is a very important problem in bioinformatics and medicine, promising to give a lot of answers especially in cancer research. We describe a method, called PolyA-iEP, that we developed for predicting polyadenylation sites and we present a systematic study of the problem of recognizing mRNA 3′ ends which contain a polyadenylation site using the proposed method. PolyA-iEP is a modular system consisting of two main components that both contribute substantially to the descriptive and predictive potential of the system. In specific, PolyA-iEP exploits the advantages of emerging patterns, namely high understandability and discriminating power and the strength of a distance-based scoring method that we propose. The extracted emerging patterns may span across many elements around the polyadenylation site and can provide novel and interesting biological insights. The outputs of these two components are finally combined by a classifier in a highly effective framework, which in our setup reaches 93.7% of sensitivity and 88.2% of specificity. PolyA-iEP can be parameterized and used for both descriptive and predictive analysis. We have experimented with Arabidopsis thaliana sequences for evaluating our method and we have drawn important conclusions.

Keywords: Data mining
[348] Yunfei Yin, Guanghong Gong, and Liang Han. Experimental study on fighters behaviors mining. Expert Systems with Applications, 38(5):5737 - 5747, 2011. [ bib | DOI | http ]
Effective prediction for fighters behaviors is crucial for air-combats as well as for many other game fields. In this paper, we present three patterns to predict the behaviors of fighters that are the ActionStreams pattern, the Owner_Actions pattern and the Time_Owner_Actions pattern, where: (1) ActionStreams pattern is a coarse granular for describing the fighter’s behaviors with action identifier whereas without distinguishing the time and the executor/owner; (2) Owner_Actions pattern is a finer granular for describing the fighter’s behaviors with the action identifier and the executor whereas without distinguishing the time; and (3) Time_Owner_Actions pattern encapsulates the action identifier, the time, and also the executor. Based on such fighters’ behaviors patterns, we explore the data structures used to store and the satisfied properties used to mine; and further, by designing and implementing the relevant mining/processing algorithms and systems, we have discovered some experience patterns of the fighters’ behaviors and have conducted certain valid predictions for the fighters’ behaviors. We also present the experimental results conducted on the simulation platform of the air to air combats. The results show that our method is effective.

Keywords: Experimental study
[349] Guangbing Yang, Dunwei Wen, Kinshuk, Nian-Shing Chen, and Erkki Sutinen. A novel contextual topic model for multi-document summarization. Expert Systems with Applications, 42(3):1340 - 1352, 2015. [ bib | DOI | http ]
Information overload becomes a serious problem in the digital age. It negatively impacts understanding of useful information. How to alleviate this problem is the main concern of research on natural language processing, especially multi-document summarization. With the aim of seeking a new method to help justify the importance of similar sentences in multi-document summarizations, this study proposes a novel approach based on recent hierarchical Bayesian topic models. The proposed model incorporates the concepts of n-grams into hierarchically latent topics to capture the word dependencies that appear in the local context of a word. The quantitative and qualitative evaluation results show that this model has outperformed both hLDA and {LDA} in document modeling. In addition, the experimental results in practice demonstrate that our summarization system implementing this model can significantly improve the performance and make it comparable to the state-of-the-art summarization systems.

Keywords: Multi-document summarization
[350] Ruibin Geng, Indranil Bose, and Xi Chen. Prediction of financial distress: An empirical study of listed chinese companies using data mining. European Journal of Operational Research, 241(1):236 - 247, 2015. [ bib | DOI | http ]
The deterioration in profitability of listed companies not only threatens the interests of the enterprise and internal staff, but also makes investors face significant financial loss. It is important to establish an effective early warning system for prediction of financial crisis for better corporate governance. This paper studies the phenomenon of financial distress for 107 Chinese companies that received the label ‘special treatment’ from 2001 to 2008 by the Shanghai Stock Exchange and the Shenzhen Stock Exchange. We use data mining techniques to build financial distress warning models based on 31 financial indicators and three different time windows by comparing these 107 firms to a control group of firms. We observe that the performance of neural networks is more accurate than other classifiers, such as decision trees and support vector machines, as well as an ensemble of multiple classifiers combined using majority voting. An important contribution of the paper is to discover that financial indicators, such as net profit margin of total assets, return on total assets, earnings per share, and cash flow per share, play an important role in prediction of deterioration in profitability. This paper provides a suitable method for prediction of financial distress for listed companies in China.

Keywords: Chinese companies
[351] Hsin-Chang Yang, Han-Wei Hsiao, and Chung-Hong Lee. Multilingual document mining and navigation using self-organizing maps. Information Processing & Management, 47(5):647 - 666, 2011. Managing and Mining Multilingual Documents. [ bib | DOI | http ]
One major approach for information finding in the {WWW} is to navigate through some Web directories and browse them until the goal pages were found. However, such directories are generally constructed manually and may have disadvantages of narrow coverage and inconsistency. Besides, most of existing directories provide only monolingual hierarchies that organized Web pages in terms that a user may not be familiar with. In this work, we will propose an approach that could automatically arrange multilingual Web pages into a multilingual Web directory to break the language barriers in Web navigation. In this approach, a self-organizing map is constructed to train each set of monolingual Web pages and obtain two feature maps, which reveal the relationships among Web pages and thematic keywords, respectively, for such language. We then apply a hierarchy generation process on these maps to obtain the monolingual hierarchy for these Web pages. A hierarchy alignment method is then applied on these monolingual hierarchies to discover the associations between nodes in different hierarchies. Finally, a multilingual Web directory is constructed according to such associations. We applied the proposed approach on a set of Web pages and obtained interesting result that demonstrates the feasibility of our method in multilingual Web navigation.

Keywords: Multilingual Web page navigation
[352] Yanwei Pang, Qiang Hao, Yuan Yuan, Tanji Hu, Rui Cai, and Lei Zhang. Summarizing tourist destinations by mining user-generated travelogues and photos. Computer Vision and Image Understanding, 115(3):352 - 363, 2011. Special issue on Feature-Oriented Image and Video Computing for Extracting Contexts and Semantics. [ bib | DOI | http ]
Automatically summarizing tourist destinations with both textual and visual descriptions is highly desired for online services such as travel planning, to facilitate users to understand the local characteristics of tourist destinations. Travelers are contributing a great deal of user-generated travelogues and photos on the Web, which contain abundant travel-related information and cover various aspects (e.g., landmarks, styles, activities) of most locations in the world. To leverage the collective knowledge of travelers for destination summarization, in this paper we propose a framework which discovers location-representative tags from travelogues and then select relevant and representative photos to visualize these tags. The learnt tags and selected photos are finally organized appropriately to provide an informative summary which describes a given destination both textually and visually. Experimental results based on a large collection of travelogues and photos show promising results on destination summarization.

Keywords: Travelogue mining
[353] Dirk Thorleuchter, Dirk Van den Poel, and Anita Prinzie. Mining ideas from textual information. Expert Systems with Applications, 37(10):7182 - 7188, 2010. [ bib | DOI | http ]
This approach introduces idea mining as process of extracting new and useful ideas from unstructured text. We use an idea definition from technique philosophy and we focus on ideas that can be used to solve technological problems. The rationale for the idea mining approach is taken over from psychology and cognitive science and follows how persons create ideas. To realize the processing, we use methods from text mining and text classification (tokenization, term filtering methods, Euclidean distance measure etc.) and combine them with a new heuristic measure for mining ideas. As a result, the idea mining approach extracts automatically new and useful ideas from an user given text. We present these problem solution ideas in a comprehensible way to support users in problem solving. This approach is evaluated with patent data and it is realized as a web-based application, named ‘Technological Idea Miner’ that can be used for further testing and evaluation.

Keywords: Idea mining
[354] W.M.P. van der Aalst, M.H. Schonenberg, and M. Song. Time prediction based on process mining. Information Systems, 36(2):450 - 475, 2011. Special Issue: Semantic Integration of Data, Multimedia, and Services. [ bib | DOI | http ]
Process mining allows for the automated discovery of process models from event logs. These models provide insights and enable various types of model-based analysis. This paper demonstrates that the discovered process models can be extended with information to predict the completion time of running instances. There are many scenarios where it is useful to have reliable time predictions. For example, when a customer phones her insurance company for information about her insurance claim, she can be given an estimate for the remaining processing time. In order to do this, we provide a configurable approach to construct a process model, augment this model with time information learned from earlier instances, and use this to predict e.g., the completion time. To provide meaningful time predictions we use a configurable set of abstractions that allow for a good balance between “overfitting” and “underfitting”. The approach has been implemented in ProM and through several experiments using real-life event logs we demonstrate its applicability.

Keywords: Process mining
[355] Hyunggon Park, Deepak S. Turaga, Olivier Verscheure, and Mihaela van der Schaar. Foresighted tree configuration games in resource constrained distributed stream mining sensors. Ad Hoc Networks, 9(4):497 - 513, 2011. Multimedia Ad Hoc and Sensor Networks. [ bib | DOI | http ]
We consider the problem of optimizing stream mining applications that are constructed as tree topologies of classifiers and deployed on a set of resource constrained and distributed processing nodes (or sensors). The optimization involves selecting appropriate false-alarm detection tradeoffs (operating points) for each classifier to minimize an end-to-end misclassification penalty, while satisfying resource constraints. We design distributed solutions, by defining tree configuration games, where individual classifiers configure themselves to maximize an appropriate local utility. We define the local utility functions and determine the information that needs to be exchanged across classifiers in order to design the distributed solutions. We analytically show that there is a unique pure strategy Nash equilibrium in operating points, which guarantees convergence of the proposed approach. We develop both myopic strategy, where the utility is purely local to the current classifier, and foresighted strategy, where the utility includes impact of classifier’s actions on successive classifiers. We analytically show that actions determined based on foresighted strategies improve the end-to-end performance of the classifier tree, by deriving an associated probability bound. We also investigate the impact of resource constraints on the classifier action selections for each strategy, and the corresponding application performance. We propose a learning-based approach, which enables each classifier to effectively adapt to the dynamic changes of resource constraints. We evaluate the performance of our solutions on an application for sports scene classification. We show that foresighted strategies result in better performance than myopic strategies in both resource unconstrained and resource constrained scenarios, and asymptotically approach the centralized optimal solution. We also show that the proposed distributed solutions outperform the centralized solution based on the Sequential Quadratic Programming on average in resource unconstrained scenarios.

Keywords: Resource constrained stream mining sensors
[356] Xin Li and Zhi-Hong Deng. Mining frequent patterns from network flows for monitoring network. Expert Systems with Applications, 37(12):8850 - 8860, 2010. [ bib | DOI | http ]
Because of the varying and dynamic characteristics of network traffic, such as fast transfer, huge volume, shot-lived, inestimable and infinite, it is a serious challenge for network administrators to monitor network traffic in real time and judge whether the whole network works well. Currently, most of the existing techniques in this area are based on signature training, learning or matching, which may be too complicated to satisfy timely requirements. Other statistical methods including sampling, hashing or counting are all approximate methods and compute an incomplete set of results. Since the main objective of network monitoring is to discover and understand the active events that happen frequently and may influence or even ruin the total network. So in the paper we aim to use the technique of frequent pattern mining to find out these events. We first design a sliding window model to make sure the mining result novel and integrated; then, under the consideration of the distribution and fluidity of network flows, we develop a powerful class of algorithms that contains vertical re-mining algorithm, multi-pattern re-mining algorithm, fast multi-pattern capturing algorithm and fast multi-pattern capturing supplement algorithm to deal with a series of problems when applying frequent pattern mining algorithm in network traffic analysis. Finally, we develop a monitoring system to evaluate our algorithms on real traces collected from the campus network of Peking University. The results show that some given algorithms are effective enough and our system can definitely identify a lot of potentially very valuable information in time which greatly help network administrators to understand regular applications and detect network anomalies. So the research in this paper not only provides a new application area for frequent pattern mining, but also provides a new technique for network monitoring.

Keywords: Network monitoring
[357] Pauray S.M. Tsai. Mining frequent itemsets in data streams using the weighted sliding window model. Expert Systems with Applications, 36(9):11617 - 11625, 2009. [ bib | DOI | http ]
In recent years, data stream mining has become an important research topic. With the emergence of new applications, the data we process are not again static, but the continuous dynamic data stream. Examples include network traffic analysis, Web click stream mining, network intrusion detection, and on-line transaction analysis. In this paper, we propose a new framework for data stream mining, called the weighted sliding window model. The proposed model allows the user to specify the number of windows for mining, the size of a window, and the weight for each window. Thus users can specify a higher weight to a more significant data section, which will make the mining result closer to user’s requirements. Based on the weighted sliding window model, we propose a single pass algorithm, called WSW, to efficiently discover all the frequent itemsets from data streams. By analyzing data characteristics, an improved algorithm, called WSW-Imp, is developed to further reduce the time of deciding whether a candidate itemset is frequent or not. Empirical results show that WSW-Imp outperforms {WSW} under the weighted sliding window model.

Keywords: Data mining
[358] Wu Xie, Huimin Zhang, Bizhong Wei, and Guanghai Fang. Data mining of graduation project selection database. Procedia Engineering, 15(0):4008 - 4011, 2011. {CEIS} 2011. [ bib | DOI | http ]
In order to improve the quality of graduate project, a database system of hundreds of graduation project selection results was established in C# language of Visual Studio. The information system was data mined with {ID3} algorithm, and a decision tree is gained for researching these graduation projects choices. The mining results of software testing demonstrate that the quality of graduation project selection is associated with the difficulty, project direction and major direction mostly, which guilds the students to choose their suitable graduation projects in time, improving the efficiency and quality of graduation project selection greatly.

Keywords: Data mining
[359] Ying Wang, Jie Li, and Xinbo Gao. Latent feature mining of spatial and marginal characteristics for mammographic mass classification. Neurocomputing, 144(0):107 - 118, 2014. [ bib | DOI | http ]
Mass classification is one of the key procedures in mammography computer-aided diagnosis (CAD) system, which is widely applied to help improving clinic diagnosis performance. In literature, classical mass classification systems always employ a large number and types of features for discriminating masses. This will produce higher computational complexity. And the incompatibility among various features also may introduce some negative impact to classification accuracy. Furthermore, latent characteristics of masses are seldom considered in the present scheme, which are useful to reveal hidden distribution pattern of masses. For the above purpose, the paper proposes a new mammographic mass classification scheme. Mammograms are detected and segmented first for obtaining region of interests with masses (ROIms). Then Latent Dirichlet Allocation (LDA) is introduced to find hidden topic distribution of ROIms. A special spatial pyramid structure is proposed and incorporated with {LDA} for capturing latent spatial characteristics of ROIms. For mining latent statistical marginal characteristics of masses, local patches on segmented boundary are extracted to construct a special document for LDA. Finally, all the latent topics will be combined, analyzed and classified by employing the {SVM} classifier. The experimental results on a dataset in {DDSM} demonstrate the effectiveness and efficiency of the proposed classification scheme.

Keywords: Mammographic mass classification
[360] A.K. Chandanan and M.K. Shukla. Removal of duplicate rules for association rule mining from multilevel dataset. Procedia Computer Science, 45(0):143 - 149, 2015. International Conference on Advanced Computing Technologies and Applications (ICACTA). [ bib | DOI | http ]
Association rules are one of the most researched areas of data mining. This is useful in the marketing and retailing strategies. Association mining is to retrieval of a set of attributes shared with a large number of objects in a given database. There are many potential application areas for association rule approach which include design, layout, and customer segregation and so on. The redundancy in association rules affects the quality of the information presented. The goal of redundancy elimination is to improve the quality and usefulness of the rules. Our work aims is to remove hierarchical duplicacy in multi-level, thus reducing the size of the rule set to improve the quality and usefulness without any loss.

Keywords: Association Rule
[361] Anat Cohen and Rafi Nachmias. What can instructors and policy makers learn about web-supported learning through web-usage mining. The Internet and Higher Education, 14(2):67 - 76, 2011. Web mining and higher education: Introduction to the special issue. [ bib | DOI | http ]
This paper focuses on a Web-log based tool for evaluating pedagogical processes occurring in Web-supported academic instruction and students' attitudes. The tool consists of computational measures which demonstrate what instructors and policy makers can learn about Web-supported instruction through Web-usage mining. The tool can provide different measures and reports for instructors at the micro level, and for policy makers at the macro level. The instructors' reports provide feedback relating to the pedagogical processes in their course Websites in comparison to other similar courses on campus. The policy makers' reports provide data about the extent of use of course Websites across the campus, the benefits of such use, and the return on investment. This paper describes the tool and its computational measures as well as its implementation, first on a sample course and next on 3453 course Websites at Tel-Aviv University.

Keywords: Web-supported learning
[362] Kaiquan Xu, Stephen Shaoyi Liao, Jiexun Li, and Yuxia Song. Mining comparative opinions from customer reviews for competitive intelligence. Decision Support Systems, 50(4):743 - 754, 2011. Enterprise Risk and Security Management: Data, Text and Web Mining. [ bib | DOI | http ]
Competitive Intelligence is one of the key factors for enterprise risk management and decision support. However, the functions of Competitive Intelligence are often greatly restricted by the lack of sufficient information sources about the competitors. With the emergence of Web 2.0, the large numbers of customer-generated product reviews often contain information about competitors and have become a new source of mining Competitive Intelligence. In this study, we proposed a novel graphical model to extract and visualize comparative relations between products from customer reviews, with the interdependencies among relations taken into consideration, to help enterprises discover potential risks and further design new products and marketing strategies. Our experiments on a corpus of Amazon customer reviews show that our proposed method can extract comparative relations more accurately than the benchmark methods. Furthermore, this study opens a door to analyzing the rich consumer-generated data for enterprise risk management.

Keywords: Opinion mining
[363] Ximing Li, Jihong Ouyang, and Xiaotang Zhou. Supervised topic models for multi-label classification. Neurocomputing, 149, Part B(0):811 - 819, 2015. [ bib | DOI | http ]
Recently, some publications indicated that the generative modeling approaches, i.e., topic models, achieved appreciated performance on multi-label classification, especially for skewed data sets. In this paper, we develop two supervised topic models for multi-label classification problems. The two models, i.e., Frequency-LDA (FLDA) and Dependency-Frequency-LDA (DFLDA), extend Latent Dirichlet Allocation (LDA) via two observations, i.e., the frequencies of the labels and the dependencies among different labels. We train the models by the Gibbs sampler algorithm. The experiment results on well known collections demonstrate that our two models outperform the state-of-the-art approaches.

Keywords: Supervised topic model
[364] Jung-Lung Hsu, Huey-Wen Chou, and Hsiu-Hua Chang. Eduminer: Using text mining for automatic formative assessment. Expert Systems with Applications, 38(4):3431 - 3439, 2011. [ bib | DOI | http ]
Formative assessment and summative assessment are two widely accepted approaches of assessment. While summative assessment is a typically formal assessment used at the end of a lesson or course, formative assessment is an ongoing process of monitoring learners’ progresses of knowledge construction. Although empirical evidence has acknowledged that formal assessment is indeed superior to summative assessment, current e-learning assessment systems however seldom provide plausible solutions for conducting formative assessment. The major bottleneck of putting formative assessment into practice lies in its labor-intensive and time-consuming nature, which makes it hardly a feasible way of achievement evaluation especially when there are usually a large number of learners in e-learning environment. In this regard, this study developed EduMiner to relieve the burdens imposed on instructors and learners by capitalizing a series of text mining techniques. An empirical study was held to examine effectiveness and to explore outcomes of the features that EduMiner supported. In this study 56 participants enrolling in a “Human Resource Management” course were randomly divided into either experimental groups or control groups. Results of this study indicated that the algorithms introduced in this study serve as a feasible approach for conducting formative assessment in e-learning environment. In addition, learners in experimental groups were highly motivated to phrase the contents with higher-order level of cognition. Therefore a timely feedback of visualized representations is beneficial to facilitate online learners to express more in-depth ideas in discourses.

Keywords: E-learning
[365] Massimo Brescia, Giuseppe Longo, and Fabio Pasian. Mining knowledge in astrophysical massive data sets. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 623(2):845 - 849, 2010. 1rs International Conference on Frontiers in Diagnostics Technologies. [ bib | DOI | http ]
Modern scientific data mainly consist of huge data sets gathered by a very large number of techniques and stored in much diversified and often incompatible data repositories. More in general, in the e-science environment, it is considered as a critical and urgent requirement to integrate services across distributed, heterogeneous, dynamic “virtual organizations” formed by different resources within a single enterprise. In the last decade, Astronomy has become an immensely data-rich field due to the evolution of detectors (plates to digital to mosaics), telescopes and space instruments. The Virtual Observatory approach consists of the federation under common standards of all astronomical archives available worldwide, as well as data analysis, data mining and data exploration applications. The main drive behind such an effort is that once the infrastructure is complete, it will allow a new type of multi-wavelength, multi-epoch science, which can only be barely imagined. Data mining, or knowledge discovery in databases, while being the main methodology to extract the scientific information contained in such Massive Data Sets (MDS), poses crucial problems since it has to orchestrate complex problems posed by transparent access to different computing environments, scalability of algorithms, reusability of resources, etc. In the present paper we summarize the present status of the {MDS} in the Virtual Observatory and what is currently done and planned to bring advanced data mining methodologies in the case of the {DAME} (DAta Mining and Exploration) project.

Keywords: Astrophysics
[366] Colleen McCue. Chapter 3 - data mining and predictive analytics. In Colleen McCue, editor, Data Mining and Predictive Analysis (Second Edition), pages 31 - 48. Butterworth-Heinemann, Boston, second edition edition, 2015. [ bib | DOI | http ]
Data mining and predictive analytics support the discovery and characterization of trends, patterns, and relationships in data through the use of exploratory graphics in combination with advanced statistical modeling, machine learning, and artificial intelligence. Results can be understood in terms of their contribution to confirmation and discovery. Confirmation involves the validation, extension, and operationalization of what we know, and discovery includes the identification of new trends, patterns, and relationships.

Keywords: Discovery
[367] Tutut Herawan and Mustafa Mat Deris. A soft set approach for association rules mining. Knowledge-Based Systems, 24(1):186 - 195, 2011. [ bib | DOI | http ]
In this paper, we present an alternative approach for mining regular association rules and maximal association rules from transactional datasets using soft set theory. This approach is started by a transformation of a transactional dataset into a Boolean-valued information system. Since the “standard” soft set deals with such information system, thus a transactional dataset can be represented as a soft set. Using the concept of parameters co-occurrence in a transaction, we define the notion of regular and maximal association rules between two sets of parameters, also their support, confidence and maximal support, maximal confidences, respectively properly using soft set theory. The results show that the soft regular and soft maximal association rules provide identical rules as compared to the regular and maximal association rules.

Keywords: Association rules mining
[368] Hui Cao, Gangquan Si, Yanbin Zhang, and Lixin Jia. Enhancing effectiveness of density-based outlier mining scheme with density-similarity-neighbor-based outlier factor. Expert Systems with Applications, 37(12):8090 - 8101, 2010. [ bib | DOI | http ]
This paper proposes a density-similarity-neighbor-based outlier mining algorithm for the data preprocess of data mining technique. First, the concept of k-density of an object is presented and the similar density series (SDS) of the object is established based on the changes of the k-density and the neighbors k-densities of the object. Second, the average series cost (ASC) of the object is obtained based on the weighted sum of the distance between the two adjacent objects in {SDS} of the object. Finally, the density-similarity-neighbor-based outlier factor (DSNOF) of the object is calculated by using both the {ASC} of the object and the {ASC} of k-distance neighbors of the object, and the degree of the object being an outlier is indicated by the DSNOF. The experiments are performed on synthetic and real datasets to evaluate the effectiveness and the performance of the proposed algorithm. The experiments results verify that the proposed algorithm has higher quality of outlier mining and do not increase the algorithm complexity.

Keywords: Outlier mining
[369] Tarek F. Gharib, Hamed Nassar, Mohamed Taha, and Ajith Abraham. An efficient algorithm for incremental mining of temporal association rules. Data & Knowledge Engineering, 69(8):800 - 815, 2010. [ bib | DOI | http ]
This paper presents the concept of temporal association rules in order to solve the problem of handling time series by including time expressions into association rules. Actually, temporal databases are continually appended or updated so that the discovered rules need to be updated. Re-running the temporal mining algorithm every time is ineffective since it neglects the previously discovered rules, and repeats the work done previously. Furthermore, existing incremental mining techniques cannot deal with temporal association rules. In this paper, an incremental algorithm to maintain the temporal association rules in a transaction database is proposed. The algorithm benefits from the results of earlier mining to derive the final mining output. The experimental results on both the synthetic and the real dataset illustrate a significant improvement over the conventional approach of mining the entire updated database.

Keywords: Temporal Association Rules (TAR)
[370] Julie Moeyersoms, Enric Junqué de Fortuny, Karel Dejaeger, Bart Baesens, and David Martens. Comprehensible software fault and effort prediction: A data mining approach. Journal of Systems and Software, 100(0):80 - 90, 2015. [ bib | DOI | http ]
Software fault and effort prediction are important tasks to minimize costs of a software project. In software effort prediction the aim is to forecast the effort needed to complete a software project, whereas software fault prediction tries to identify fault-prone modules. In this research both tasks are considered, thereby using different data mining techniques. The predictive models not only need to be accurate but also comprehensible, demanding that the user can understand the motivation behind the model's prediction. Unfortunately, to obtain predictive performance, comprehensibility is often sacrificed and vice versa. To overcome this problem, we extract trees from well performing Random Forests (RFs) and Support Vector Machines for regression (SVRs) making use of a rule extraction algorithm ALPA. This method builds trees (using C4.5 and REPTree) that mimic the black-box model (RF, SVR) as closely as possible. The proposed methodology is applied to publicly available datasets, complemented with new datasets that we have put together based on the Android repository. Surprisingly, the trees extracted from the black-box models by {ALPA} are not only comprehensible and explain how the black-box model makes (most of) its predictions, but are also more accurate than the trees obtained by working directly on the data.

Keywords: Rule extraction
[371] Amelia Zafra, Eva L. Gibaja, and Sebastián Ventura. Multiple instance learning with multiple objective genetic programming for web mining. Applied Soft Computing, 11(1):93 - 102, 2011. [ bib | DOI | http ]
This paper introduces a multi-objective grammar based genetic programming algorithm, MOG3P-MI, to solve a Web Mining problem from the perspective of multiple instance learning. This algorithm is evaluated and compared to other algorithms that were previously used to solve this problem. Computational experiments show that the MOG3P-MI algorithm obtains the best results, adds comprehensibility and clarity to the knowledge discovery process and overcomes the main drawbacks of previous techniques obtaining solutions which maintain a balance between conflicting measurements like sensitivity and specificity.

Keywords: Multi-instance learning
[372] Ying-Hong Liang, Jin xiang Li, Liang Ye, Ke Chen, and Cui zhen Guo. The chinese unknown term translation mining with supervised candidate term extraction strategy. Procedia Engineering, 15(0):1388 - 1392, 2011. {CEIS} 2011. [ bib | DOI | http ]
Most researchers extracted candidate term using unsupervised method. In this paper, a supervised candidate term extraction method is proposed. It combines English Part of Speech and headword expansion chunking strategy. Firstly, it retrieves bilingual snippets from web by term expansion, then, crawls Chinese-English pages and screens out Chinese words from the Chinese-English pages, lastly, a headword expansion chunking strategy is used to identify English phrases and the English Noun phrases and Verb phrases are selected. These selected English phrases serve as the last candidate term for term translation mining. Experimental results show that the supervised candidate term extraction method improves the top 10 inclusion rate by 1.6% than baseline system, which verifies that the supervised candidate term extraction method is effective.

Keywords: {OOV} term
[373] G. Ram Kumar, K. Sakthivel, R.M. Sundaram, C.N. Neeraja, S.M. Balachandran, N. Shobha Rani, B.C. Viraktamath, and M.S. Madhav. Allele mining in crops: Prospects and potentials. Biotechnology Advances, 28(4):451 - 461, 2010. [ bib | DOI | http ]
Enormous sequence information is available in public databases as a result of sequencing of diverse crop genomes. It is important to use this genomic information for the identification and isolation of novel and superior alleles of agronomically important genes from crop gene pools to suitably deploy for the development of improved cultivars. Allele mining is a promising approach to dissect naturally occurring allelic variation at candidate genes controlling key agronomic traits which has potential applications in crop improvement programs. It helps in tracing the evolution of alleles, identification of new haplotypes and development of allele-specific markers for use in marker-assisted selection. Realizing the immense potential of allele mining, concerted allele mining efforts are underway in many international crop research institutes. This review examines the concepts, approaches and applications of allele mining along with the challenges associated while emphasizing the need for more refined ‘mining’ strategies for accelerating the process of allele discovery and its utilization in molecular breeding.

Keywords: Allele mining
[374] Lingfeng Niu, Jianmin Wu, and Yong Shi. Second-order mining for active collaborative filtering. Procedia Computer Science, 4(0):1726 - 1734, 2011. Proceedings of the International Conference on Computational Science, {ICCS} 2011. [ bib | DOI | http ]
Active learning for collaborative filtering tasks draws many attentions from the research community. It can capture the user's interest with greatly reduced labeling burden for the online user. High quality recommendation can thus be made with good user experience. In this paper we address the efficiency challenge of current active learning methods for online and interactive applications by using the second-order mining techniques. According to the global latent semantic model learnt from the feedbacks of historical users to items, we propose an intuitive and efficient query strategy for the item selection for new active user. The time complexity in each query is reduced greatly to constant O(1). Experimental results on the public available data sets show the efficiency and effectiveness of our method.

Keywords: Second-order Mining
[375] Vincent S. Tseng and Chao-Hui Lee. Effective temporal data classification by integrating sequential pattern mining and probabilistic induction. Expert Systems with Applications, 36(5):9524 - 9532, 2009. [ bib | DOI | http ]
Data classification is an important topic in the field of data mining due to its wide applications. A number of related methods have been proposed based on the well-known learning models such as decision tree or neural network. Although data classification was widely discussed, relatively few studies explored the topic of temporal data classification. Most of the existing researches focused on improving the accuracy of classification by using statistical models, neural network, or distance-based methods. However, they cannot interpret the results of classification to users. In many research cases, such as gene expression of microarray, users prefer the classification information above a classifier only with a high accuracy. In this paper, we propose a novel pattern-based data mining method, namely classify-by-sequence (CBS), for classifying large temporal datasets. The main methodology behind the {CBS} is integrating sequential pattern mining with probabilistic induction. The {CBS} has the merit of simplicity in implementation and its pattern-based architecture can supply clear classification information to users. Through experimental evaluation, the {CBS} was shown to deliver classification results with high accuracy under two real time series datasets. In addition, we designed a simulator to evaluate the performance of {CBS} under datasets with different characteristics. The experimental results show that {CBS} can discover the hidden patterns and classify data effectively by utilizing the mined sequential patterns.

Keywords: Temporal data
[376] Tamer Uçar and Adem Karahoca. Predicting existence of mycobacterium tuberculosis on patients using data mining approaches. Procedia Computer Science, 3(0):1404 - 1411, 2011. World Conference on Information Technology. [ bib | DOI | http ]
A correct diagnosis of tuberculosis (TB) can be only stated by applying a medical test to patient’s phlegm. The result of this test is obtained after a time period of about 45 days. The purpose of this study is to develop a data mining(DM) solution which makes diagnosis of tuberculosis as accurate as possible and helps deciding if it is reasonable to start tuberculosis treatment on suspected patients without waiting the exact medical test results or not. In this research, we proposed the use of Sugeno-type “adaptive-network-based fuzzy inference system” (ANFIS) to predict the existence of mycobacterium tuberculosis. 667 different patient records which are obtained from a clinic are used in the entire process of this research. Each of the patient records consist of 30 separate input parameters. {ANFIS} model is generated by using 500 of those records. We also implemented a multilayer perceptron and {PART} model using the same data set. The {ANFIS} model classifies the instances with an {RMSE} of 18% whereas Multilayer Perceptron does the same classification with an {RMSE} of % 19 and {PART} algorithm with an {RMSE} of % 20. {ANFIS} is an accurate and reliable method when compared with Multilayer Perceptron and {PART} algorithms for classification of tuberculosis patients. This study has contribution on forecasting patients before the medical tests.

Keywords: Tuberculosis
[377] Jimmy Secretan, Michael Georgiopoulos, Anna Koufakou, and Kel Cardona. Aphid: An architecture for private, high-performance integrated data mining. Future Generation Computer Systems, 26(7):891 - 904, 2010. [ bib | DOI | http ]
While the emerging field of privacy preserving data mining (PPDM) will enable many new data mining applications, it suffers from several practical difficulties. {PPDM} algorithms are challenging to develop and computationally intensive to execute. Developers need convenient abstractions to simplify the engineering of {PPDM} applications. The individual parties involved in the data mining process need a way to bring high-performance, parallel computers to bear on the computationally intensive parts of the {PPDM} tasks. This paper discusses {APHID} (Architecture for Private and High-performance Integrated Data mining), a practical architecture and software framework for developing and executing large scale {PPDM} applications. At one tier, the system supports simplified use of cluster and grid resources, and at another tier, the system abstracts communication for easy {PPDM} algorithm development. This paper offers a detailed analysis of the challenges in developing {PPDM} algorithms with existing frameworks, and motivates the design of a new infrastructure based on these challenges.

Keywords: Data mining
[378] Michel J. Anzanello, Flavio S. Fogliatto, and Karina Rossini. Data mining-based method for identifying discriminant attributes in sensory profiling. Food Quality and Preference, 22(1):139 - 148, 2011. [ bib | DOI | http ]
Selection of attributes from a group of candidates to be assessed through sensory analysis is an important issue when planning sensory panels. In attribute selection it is desirable to reduce the list of those to be presented to panelists to avoid fatigue, minimize costs and save time. In some applications the goal is to keep attributes that are relevant and non-redundant in the sensory characterization of products. In this paper, however, we are interested in keeping attributes that best discriminate between products. For that we present a data mining-based method for attribute selection in descriptive sensory panels, such as those used in the Quantitative Descriptive Analysis. The proposed method is implemented using Principal Component Analysis and the k-Nearest Neighbor classification technique, in conjunction with Pareto Optimal analysis. Objectives are (i) to identity the set of attributes that best discriminate samples analyzed in the panel, and (ii) to indicate the group of panelists that provide consistent evaluations. The method is illustrated through a case study where beef cubes in stew, used as combat ration by the American Army, are characterized in sensory panels using the Spectrum protocol.

Keywords: Attribute selection
[379] Xuezhong Zhou, Yonghong Peng, and Baoyan Liu. Text mining for traditional chinese medical knowledge discovery: A survey. Journal of Biomedical Informatics, 43(4):650 - 660, 2010. [ bib | DOI | http ]
Extracting meaningful information and knowledge from free text is the subject of considerable research interest in the machine learning and data mining fields. Text data mining (or text mining) has become one of the most active research sub-fields in data mining. Significant developments in the area of biomedical text mining during the past years have demonstrated its great promise for supporting scientists in developing novel hypotheses and new knowledge from the biomedical literature. Traditional Chinese medicine (TCM) provides a distinct methodology with which to view human life. It is one of the most complete and distinguished traditional medicines with a history of several thousand years of studying and practicing the diagnosis and treatment of human disease. It has been shown that the {TCM} knowledge obtained from clinical practice has become a significant complementary source of information for modern biomedical sciences. {TCM} literature obtained from the historical period and from modern clinical studies has recently been transformed into digital data in the form of relational databases or text documents, which provide an effective platform for information sharing and retrieval. This motivates and facilitates research and development into knowledge discovery approaches and to modernize TCM. In order to contribute to this still growing field, this paper presents (1) a comparative introduction to {TCM} and modern biomedicine, (2) a survey of the related information sources of TCM, (3) a review and discussion of the state of the art and the development of text mining techniques with applications to TCM, (4) a discussion of the research issues around {TCM} text mining and its future directions.

Keywords: Text mining
[380] C. Parra, C. Otalora, A. Forero, and M. Devy. 2 - robots for non-conventional de-mining processes: From remote control to autonomy. In Y. Baudoin and Maki K. Habib, editors, Using Robots in Hazardous Environments, pages 32 - 62. Woodhead Publishing, 2011. [ bib | DOI | http ]
This chapter aims to show the results of the perceptual strategy of a global project being developed at Bogota in Colombia by Pontificia Universidad Javeriana and at Toulouse in France by LAAS/CNRS. This cooperation project, funded by the Colombian-French ECOS-NORD program, includes an investigation about mine sensing technologies, path planning and robotic platforms among others. The chapter gives the general context and what are the main challenges when coping with humanitarian demining, presents a method based on the analysis of multisensory fused data to improve the landmine detection and its implementation on an embedded system and presents two robotics demining platforms called {URSULA} and AMARANTA. These robots have been designed and built as mine hunting platforms to be used in developing countries: so these platforms could be cheap and easy-to-use solutions for humanitarian demining. The robots contain the perceptual capabilities described earlier in this chapter.

Keywords: robotics
[381] Yong Joon Lee, Jun Wook Lee, Duck Jin Chai, Bu Hyun Hwang, and Keun Ho Ryu. Mining temporal interval relational rules from temporal data. Journal of Systems and Software, 82(1):155 - 167, 2009. Special Issue: Software Performance - Modeling and Analysis. [ bib | DOI | http ]
Temporal data mining is still one of important research topic since there are application areas that need knowledge from temporal data such as sequential patterns, similar time sequences, cyclic and temporal association rules, and so on. Although there are many studies for temporal data mining, they do not deal with discovering knowledge from temporal interval data such as patient histories, purchaser histories, and web logs etc. We propose a new temporal data mining technique that can extract temporal interval relation rules from temporal interval data by using Allen’s theory: a preprocessing algorithm designed for the generalization of temporal interval data and a temporal relation algorithm for mining temporal relation rules from the generalized temporal interval data. This technique can provide more useful knowledge in comparison with conventional data mining techniques.

Keywords: Data mining
[382] Bruce Reiner. New strategies for medical data mining, part 1: Dynamic and performance-based reimbursement. Journal of the American College of Radiology, 7(12):975 - 979, 2010. [ bib | DOI | http ]
The current professional reimbursement model within medicine was created more than 20 years ago in response to physician dissatisfaction and health care inflationary pressures. Despite many resulting improvements, several deficiencies currently exist within the current reimbursement model, related to transparency, accountability, and quality. As the tenets of evidence-based medicine and pay for performance become ingrained within health care delivery, it would be beneficial to modify the existing reimbursement model to reflect these principles. The opportunity to accomplish this goal is advanced through the continued evolution of information systems technologies and data mining. The author discusses the existing deficiencies in medical reimbursement and makes a number of recommendations for improvement. The ultimate goal is to incorporate objective and standardized data into a transparent and readily accessible database, which can be used to enhance performance, education, and informed decision making.

Keywords: Reimbursement
[383] Amrutha Benny and Mintu Philip. Keyword based tweet extraction and detection of related topics. Procedia Computer Science, 46(0):364 - 371, 2015. Proceedings of the International Conference on Information and Communication Technologies, {ICICT} 2014, 3-5 December 2014 at Bolgatty Palace & Island Resort, Kochi, India. [ bib | DOI | http ]
Twitter is a micro blogging site that helps the transfer of information as short length tweets. The large quantum of information makes it necessary to find out methods and tools to summarize them. Our research work is to propose a method, which collect tweets using a specific keyword and then, summarizes them to find out topics related to that keyword. The topic detection is done by using clusters of frequent patterns. Already existing pattern oriented topic detection techniques suffer from the wrong correlation problem of patterns. In this paper, we propose two algorithms, {TDA} (Topic Detection using AGF) and {TCTR} (Topic Clustering and Tweet Retrieval), which will help to overcome this problem. From various experimental results, it is observed that the proposed method can maintain good performance irrespective of the size of the data set.

Keywords: Twitter
[384] Chun-Ling Chen, Frank S.C. Tseng, and Tyne Liang. An integration of wordnet and fuzzy association rule mining for multi-label document clustering. Data & Knowledge Engineering, 69(11):1208 - 1226, 2010. Special issue on contribution of ontologies in designing advanced information systems. [ bib | DOI | http ]
With the rapid growth of text documents, document clustering has become one of the main techniques for organizing large amount of documents into a small number of meaningful clusters. However, there still exist several challenges for document clustering, such as high dimensionality, scalability, accuracy, meaningful cluster labels, overlapping clusters, and extracting semantics from texts. In order to improve the quality of document clustering results, we propose an effective Fuzzy-based Multi-label Document Clustering (FMDC) approach that integrates fuzzy association rule mining with an existing ontology WordNet to alleviate these problems. In our approach, the key terms will be extracted from the document set, and the initial representation of all documents is further enriched by using hypernyms of WordNet in order to exploit the semantic relations between terms. Then, a fuzzy association rule mining algorithm for texts is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, each document is dispatched into more than one target cluster by referring to these candidate clusters, and then the highly similar target clusters are merged. We conducted experiments to evaluate the performance based on Classic, Re0, R8, and WebKB datasets. The experimental results proved that our approach outperforms the influential document clustering methods with higher accuracy. Therefore, our approach not only provides more general and meaningful labels for documents, but also effectively generates overlapping clusters.

Keywords: Fuzzy association rule mining
[385] R.J. Kuo, C.M. Pai, R.H. Lin, and H.C. Chu. The integration of association rule mining and artificial immune network for supplier selection and order quantity allocation. Applied Mathematics and Computation, 250(0):958 - 972, 2015. [ bib | DOI | http ]
This study firstly uses one of the association rule mining techniques, a TD-FP-growth algorithm, to select the important suppliers from the existing suppliers and determine the importance of each supplier. A hybrid artificial immune network (Opt-aiNet) and particle swarm optimization (PSO) (aiNet-PSO) is then proposed to allocate the order quantity for the key suppliers at minimum cost. In order to verify the proposed method, a case company’s daily purchasing ledger is used, with emphasis on the consumer electronic product manufacturers. The computational results indicate that the TD-FP-growth algorithm can select the key suppliers using the historical data. The proposed hybrid method also provides a cheaper solution than a genetic algorithm, particle swam optimization, or an artificial immune system.

Keywords: Supplier selection
[386] Jong Hwan Suh, Chung Hoon Park, and Si Hyun Jeon. Applying text and data mining techniques to forecasting the trend of petitions filed to e-people. Expert Systems with Applications, 37(10):7255 - 7268, 2010. [ bib | DOI | http ]
As the Internet has been the virtual place where citizens are united and their opinions are promptly shifted into the action, two way communications between the government sector and the citizen have been more important among activities of e-Government. Hence, Anti-corruption and Civil Rights Commission (ACRC) in the Republic of Korea has constructed the online petition portal system named e-People. In addition, the nation’s Open Innovation through e-People has gained increasing attention. That is because e-People can be applied for the virtual space where citizens participate in improving the national law and policy by simply filing petitions to e-People as the voice of the nation. However, currently there are problems and challenging issues to be solved until e-People can function as the virtual space for the nation’s Open Innovation based on petitions collected from citizens. First, there is no objective and systematic method for analyzing a large number of petitions filed to e-People without a lot of manual works of petition inspectors. Second, e-People is required to forecast the trend of petitions filed to e-People more accurately and quickly than petition inspectors for making a better decision on the national law and policy strategy. Therefore, in this paper, we propose the framework of applying text and data mining techniques not only to analyze a large number of petitions filed to e-People but also to predict the trend of petitions. In detail, we apply text mining techniques to unstructured data of petitions to elicit keywords from petitions and identify groups of petitions with the elicited keywords. Moreover, we apply data mining techniques to structured data of the identified petition groups on purpose to forecast the trend of petitions. Our approach based on applying text and data mining techniques decreases time-consuming manual works on reading and classifying a large number of petitions, and contributes to increasing accuracy in evaluating the trend of petitions. Eventually, it helps petition inspectors to give more attention on detecting and tracking important groups of petitions that possibly grow as nationwide problems. Further, the petitions ordered by their petition groups’ trend values can be used as the baseline for making a better decision on the national law and policy strategy.

Keywords: Text mining
[387] Álvaro Villalba, Juan Luis Pérez, David Carrera, Carlos Pedrinaci, and Luca Panziera. servioticy and iserve: A scalable platform for mining the iot. Procedia Computer Science, 52(0):1022 - 1027, 2015. The 6th International Conference on Ambient Systems, Networks and Technologies (ANT-2015), the 5th International Conference on Sustainable Energy Information Technology (SEIT-2015). [ bib | DOI | http ]
In the last years, Internet of Things (IoT) and Big Data platforms are clearly converging in terms of technologies, problems and approaches. IoT ecosystems generate a vast amount of data that needs to be stored and processed, becoming a Big Data problem. In this paper we present a platform that is specifically designed for mining the information associated to the IoT, including both sensors data and meta-data. The platform is composed of two major components: servIoTicy for storing and processing data, and iServe for the publication and discovery of sensors meta-data. The former provides capabilities to ingest, transform on real time and query data generated by sensors; the latter provides capabilities to publish, discover and use sensors based on semantic information associated to them. Both components are clearly designed for scalability, as any IoT cloud deployment requires. Both servIoTicy and iServe are available as an open source projects.

Keywords: Internet of Things
[388] Hui Li, Jie Sun, and Jian Wu. Predicting business failure using classification and regression tree: An empirical comparison with popular classical statistical methods and top classification mining methods. Expert Systems with Applications, 37(8):5895 - 5904, 2010. [ bib | DOI | http ]
Predicting business failure is a very critical task for government officials, stock holders, managers, employees, investors and researchers, especially in nowadays competitive economic environment. Several top 10 data mining methods have become very popular alternatives in business failure prediction (BFP), e.g., support vector machine and k nearest neighbor. In comparison with the other classification mining methods, advantages of classification and regression tree (CART) methods include: simplicity of results, easy implementation, nonlinear estimation, being non-parametric, accuracy and stable. However, there are seldom researches in the area of {BFP} that witness the applicability of CART, another method among the top 10 algorithms in data mining. The aim of this research is to explore the performance of {BFP} using the commonly discussed data mining technique of CART. To demonstrate the effectiveness of {BFP} using CART, business failure predicting tasks were performed on the data set collected from companies listed in the Shanghai Stock Exchange and Shenzhen Stock Exchange. Thirty times’ hold-out method was employed as the assessment, and the two commonly used methods in the top 10 data mining algorithms, i.e., support vector machine and k nearest neighbor, and the two baseline benchmark methods from statistic area, i.e., multiple discriminant analysis (MDA) and logistics regression, were employed as comparative methods. For comparative methods, stepwise method of {MDA} was employed to select optimal feature subset. Empirical results indicated that the optimal algorithm of {CART} outperforms all the comparative methods in terms of predictive performance and significance test in short-term {BFP} of Chinese listed companies.

Keywords: Business failure prediction (BFP)
[389] Gottfried Vossen. The process mining manifesto—an interview with wil van der aalst. Information Systems, 37(3):288 - 290, 2012. [ bib | DOI | http ]
The {IEEE} Task Force on Process Mining has recently published its Process Mining Manifesto (PMM) in an effort to promote the topic of process mining. As this topic touches a number of areas in computer science, the editors of Information Systems have decided to conduct an interview with the person in charge of the task force, Prof. Wil van der Aalst of Eindhoven University of Technology (TU/e) in the Netherlands

[390] Nissim Matatov, Lior Rokach, and Oded Maimon. Privacy-preserving data mining: A feature set partitioning approach. Information Sciences, 180(14):2696 - 2720, 2010. Including Special Section on Hybrid Intelligent Algorithms and Applications. [ bib | DOI | http ]
In privacy-preserving data mining (PPDM), a widely used method for achieving data mining goals while preserving privacy is based on k-anonymity. This method, which protects subject-specific sensitive data by anonymizing it before it is released for data mining, demands that every tuple in the released table should be indistinguishable from no fewer than k subjects. The most common approach for achieving compliance with k-anonymity is to replace certain values with less specific but semantically consistent values. In this paper we propose a different approach for achieving k-anonymity by partitioning the original dataset into several projections such that each one of them adheres to k-anonymity. Moreover, any attempt to rejoin the projections, results in a table that still complies with k-anonymity. A classifier is trained on each projection and subsequently, an unlabelled instance is classified by combining the classifications of all classifiers. Guided by classification accuracy and k-anonymity constraints, the proposed data mining privacy by decomposition (DMPD) algorithm uses a genetic algorithm to search for optimal feature set partitioning. Ten separate datasets were evaluated with {DMPD} in order to compare its classification performance with other k-anonymity-based methods. The results suggest that {DMPD} performs better than existing k-anonymity-based algorithms and there is no necessity for applying domain dependent knowledge. Using multiobjective optimization methods, we also examine the tradeoff between the two conflicting objectives in PPDM: privacy and predictive performance.

Keywords: Data mining
[391] Mona Alkhattabi, Daniel Neagu, and Andrea Cullen. Assessing information quality of e-learning systems: a web mining approach. Computers in Human Behavior, 27(2):862 - 873, 2011. Web 2.0 in Travel and Tourism: Empowering and Changing the Role of Travelers. [ bib | DOI | http ]
E-learning systems provide a promising solution as an information exchanging channel. Improved technologies could mean faster and easier access to information but do not necessarily ensure the quality of this information; for this reason it is essential to develop valid and reliable methods of quality measurement and carry out careful information quality evaluations. This paper proposes an assessment model for information quality in e-learning systems based on the quality framework we proposed previously: the proposed framework consists of 14 quality dimensions grouped in three quality factors: intrinsic, contextual representation and accessibility. We use the relative importance as a parameter in a linear equation for the measurement scheme. Formerly, we implemented a goal-question-metrics approach to develop a set of quality metrics for the identified quality attributes within the proposed framework. In this paper, the proposed metrics were computed to produce a numerical rating indicating the overall information quality published in a particular e-learning system. The data collection and evaluation processes were automated using a web data extraction technique and results on a case study are discussed. This assessment model could be useful to e-learning systems designers, providers and users as it provides a comprehensive indication of the quality of information in such systems.

Keywords: E-learning
[392] Amine Trabelsi and Osmar R. Zaïane. Extraction and clustering of arguing expressions in contentious text. Data & Knowledge Engineering, nil(0):-, 2015. [ bib | DOI | http ]
This work proposes an unsupervised method intended to enhance the quality of opinion mining in contentious text. It presents a Joint Topic Viewpoint (JTV) probabilistic model to analyze the underlying divergent arguing expressions that may be present in a collection of contentious documents. The conceived {JTV} has the potential of automatically carrying the tasks of extracting associated terms denoting an arguing expression, according to the hidden topics it discusses and the embedded viewpoint it voices. Furthermore, JTV's structure enables the unsupervised grouping of obtained arguing expressions according to their viewpoints, using a proposed constrained clustering algorithm which is an adapted version of the constrained k-means clustering (COP-KMEANS). Experiments are conducted on three types of contentious documents (polls, online debates and editorials), through six different contentious data sets. Quantitative evaluations of the topic modeling output, as well as the constrained clustering results show the effectiveness of the proposed method to fit the data and generate distinctive patterns of arguing expressions. Moreover, it empirically demonstrates a better clustering of arguing expressions over state-of-the art and baseline methods. The qualitative analysis highlights the coherence of clustered arguing expressions of the same viewpoint and the divergence of opposing ones.

Keywords: Contention analysis
[393] Roger S. Debreceny and Glen L. Gray. Data mining journal entries for fraud detection: An exploratory study. International Journal of Accounting Information Systems, 11(3):157 - 181, 2010. 2009 Research Symposium on Information Integrity & Information Systems Assurance. [ bib | DOI | http ]
Fraud detection has become a critical component of financial audits and audit standards have heightened emphasis on journal entries as part of fraud detection. This paper canvasses perspectives on applying data mining techniques to journal entries. In the past, the impediment to researching journal entry data mining is getting access to journal entry data sets, which may explain why the published research in this area is a null set. For this project, we had access to journal entry data sets for 29 different organizations. Our initial exploratory test of the data sets had interesting preliminary findings. (1) For all 29 entities, the distribution of first digits of journal dollar amounts differed from that expected by Benford's Law. (2) Regarding last digits, unlike first digits, which are expected to have a logarithmic distribution, the last digits would be expected to have a uniform distribution. Our test found that the distribution was not uniform for many of the entities. In fact, eight entities had one number whose frequency was three times more than expected. (3) We compared the number of accounts related to the top five most frequently occurring three last digit combinations. Four entities had a very high occurrences of the most frequent three digit combinations that involved only a small set of accounts, one entity had a low occurrences of the most frequent three digit combination that involved a large set of accounts and 24 had a low occurrences of the most frequent three digit combinations that involved a small set of accounts. In general, the first four entities would probably pose the highest risk of fraud because it could indicate that the fraudster is covering up or falsifying a particular class of transactions. In the future, we will apply more data mining techniques to discover other patterns and relationships in the data sets. We also want to seed the dataset with fraud indicators (e.g., pairs of accounts that would not be expected in a journal entry) and compare the sensitivity of the different data mining techniques to find these seeded indicators.

Keywords: Fraud
[394] Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu. Linguistic data mining with fuzzy fp-trees. Expert Systems with Applications, 37(6):4560 - 4567, 2010. [ bib | DOI | http ]
Due to the increasing occurrence of very large databases, mining useful information and knowledge from transactions is evolving into an important research area. In the past, many algorithms were proposed for mining association rules, most of which were based on items with binary values. Transactions with quantitative values are, however, commonly seen in real-world applications. In this paper, the frequent fuzzy pattern tree (fuzzy FP-tree) is proposed for extracting frequent fuzzy itemsets from the transactions with quantitative values. When extending the FP-tree to handle fuzzy data, the processing becomes much more complex than the original since fuzzy intersection in each transaction has to be handled. The fuzzy FP-tree construction algorithm is thus designed, and the mining process based on the tree is presented. Experimental results on three different numbers of fuzzy regions also show the performance of the proposed approach.

Keywords: Fuzzy data mining
[395] V. Subramaniyaswamy, V. Vijayakumar, R. Logesh, and V. Indragandhi. Intelligent travel recommendation system by mining attributes from community contributed photos. Procedia Computer Science, 50(0):447 - 455, 2015. Big Data, Cloud and Computing Challenges. [ bib | DOI | http ]
This paper purposes a system which helps user in finding tourist locations that he/she might likes to visit a place from available user contributed photos of that place available on photo sharing websites. This paper describes methods used to mine demographic information and provide travel recommendation to users. This paper also describes an algorithm adaboost to classify data and Bayesian Learning model for predicting desired location to a user based on his/her preferences.

Keywords: Personalized Travel Recommendation
[396] Lidong Bing, Shan Jiang, Wai Lam, Yan Zhang, and Shoaib Jameel. Adaptive concept resolution for document representation and its applications in text mining. Knowledge-Based Systems, 74(0):1 - 13, 2015. [ bib | DOI | http ]
It is well-known that synonymous and polysemous terms often bring in some noise when we calculate the similarity between documents. Existing ontology-based document representation methods are static so that the selected semantic concepts for representing a document have a fixed resolution. Therefore, they are not adaptable to the characteristics of document collection and the text mining problem in hand. We propose an Adaptive Concept Resolution (ACR) model to overcome this problem. {ACR} can learn a concept border from an ontology taking into the consideration of the characteristics of the particular document collection. Then, this border provides a tailor-made semantic concept representation for a document coming from the same domain. Another advantage of {ACR} is that it is applicable in both classification task where the groups are given in the training document set and clustering task where no group information is available. The experimental results show that {ACR} outperforms an existing static method in almost all cases. We also present a method to integrate Wikipedia entities into an expert-edited ontology, namely WordNet, to generate an enhanced ontology named WordNet-Plus, and its performance is also examined under the {ACR} model. Due to the high coverage, WordNet-Plus can outperform WordNet on data sets having more fresh documents in classification.

Keywords: Adaptive Concept Resolution
[397] Ilkyu Kim. Is korean -(n)un a topic marker? on the nature of -(n)un and its relation to information structure. Lingua, 154(0):87 - 109, 2015. [ bib | DOI | http ]
Basic categories of information structure (e.g. topic, focus, contrast) are known to be crosslinguistically expressed by various linguistic devices such as special intonation contours, syntactic mechanisms, and morphological markers. However, the nature of the relation between the categories and their linguistic “markers” has been rarely discussed. To be more specific, despite the rich literature on information structure, whether the categories are directly or indirectly related to their markers has not been of much interest to linguists until quite recently. The main purpose of this paper is to unveil the nature of the relation between Korean -(n)un and the information-structural notions related to it, namely, topic and contrast. Based on a corpus study, it will be claimed that -(n)un is not a topic/contrast marker per se but its function is to impose salience on a discourse referent. Topicality and contrast, widely assumed to be directly marked by -(n)un, will be claimed to be only derived from the interaction of the proposed meaning of -(n)un and various syntactic, semantic, and pragmatic factors. Consequently, this paper provides a strong support for recent attempts to show that the information-structural categories are merely pragmatic effects rather than stable and discrete universal primitives.

Keywords: Information structure
[398] Petr Hájek, Martin Holeňa, and Jan Rauch. The {GUHA} method and its meaning for data mining. Journal of Computer and System Sciences, 76(1):34 - 48, 2010. Special Issue on Intelligent Data Analysis. [ bib | DOI | http ]
The paper presents the history and present state of the {GUHA} method, its theoretical foundations and its relation and meaning for data mining.

Keywords: {GUHA} method
[399] Mieke Jans, Nadine Lybaert, and Koen Vanhoof. Internal fraud risk reduction: Results of a data mining case study. International Journal of Accounting Information Systems, 11(1):17 - 41, 2010. [ bib | DOI | http ]
Corporate fraud represents a huge cost to the current economy. Academic literature has demonstrated how data mining techniques can be of value in the fight against fraud. This research has focused on fraud detection, mostly in a context of external fraud. In this paper, we discuss the use of a data mining approach to reduce the risk of internal fraud. Reducing fraud risk involves both detection and prevention. Accordingly, a descriptive data mining strategy is applied as opposed to the widely used prediction data mining techniques in the literature. The results of using a multivariate latent class clustering algorithm to a case company's procurement data suggest that applying this technique in a descriptive data mining approach is useful in assessing the current risk of internal fraud. The same results could not be obtained by applying a univariate analysis.

Keywords: Internal fraud
[400] Nan Li and Desheng Dash Wu. Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48(2):354 - 368, 2010. [ bib | DOI | http ]
Text sentiment analysis, also referred to as emotional polarity computation, has become a flourishing frontier in the text mining community. This paper studies online forums hotspot detection and forecast using sentiment analysis and text mining approaches. First, we create an algorithm to automatically analyze the emotional polarity of a text and to obtain a value for each piece of text. Second, this algorithm is combined with K-means clustering and support vector machine (SVM) to develop unsupervised text mining approach. We use the proposed text mining approach to group the forums into various clusters, with the center of each representing a hotspot forum within the current time span. The data sets used in our empirical studies are acquired and formatted from Sina sports forums, which spans a range of 31 different topic forums and 220,053 posts. Experimental results demonstrate that {SVM} forecasting achieves highly consistent results with K-means clustering. The top 10 hotspot forums listed by {SVM} forecasting resembles 80% of K-means clustering results. Both {SVM} and K-means achieve the same results for the top 4 hotspot forums of the year.

Keywords: Text mining
[401] Y. Baudoin, M.K. Habib, and I. Doroftei. 1 - introduction: Mobile robotics systems for humanitarian de-mining and risky interventions. In Y. Baudoin and Maki K. Habib, editors, Using Robots in Hazardous Environments, pages 3 - 31. Woodhead Publishing, 2011. [ bib | DOI | http ]
Dirty, dangerous and dull tasks, all of which are found in landmine detection, can be greatly aided by tele-operation. It is very desirable to remove the operator from the vicinity of the landmine and from the repetitive, boring operations that lead to loss of attention and potential injury. Tele-operated platforms naturally support multiple sensors and data fusion, which are necessary for reliable detection. Tele-operation of handheld sensors or multisensor-heads can enhance the detection process by allowing more precise scanning, which is useful for optimization of the signal processing algorithms. This chapter summarizes the technologies and experiences presented during seven {IARP} workshops {HUDEM} and three {IARP} workshops RISE, based on general considerations and illustrated by some contributions of our own laboratory, located at the Royal Military Academy of Brussels, focusing on the detection of unexploded devices and the implementation of mobile robotics systems on minefields.

Keywords: mobile robotics
[402] Arnaud Quirin, Oscar Cordón, Benjamín Vargas-Quesada, and Félix de Moya-Anegón. Graph-based data mining: A new tool for the analysis and comparison of scientific domains represented as scientograms. Journal of Informetrics, 4(3):291 - 312, 2010. [ bib | DOI | http ]
The creation of some kind of representations depicting the current state of Science (or scientograms) is an established and beaten track for many years now. However, if we are concerned with the automatic comparison, analysis and understanding of a set of scientograms, showing for instance the evolution of a scientific domain or a face-to-face comparison of several countries, the task is titanically complex as the amount of data to analyze becomes huge and complex. In this paper, we aim to show that graph-based data mining tools are useful to deal with scientogram analysis. Subdue, the first algorithm proposed in the graph mining area, has been chosen for this purpose. This algorithm has been customized to deal with three different scientogram analysis tasks regarding the evolution of a scientific domain over time, the extraction of the common research categories substructures in the world, and the comparison of scientific domains between different countries. The outcomes obtained in the developed experiments have clearly demonstrated the potential of graph mining tools in scientogram analysis.

Keywords: Domain analysis
[403] Yuchul Jung, Jihee Ryu, Kyung min Kim, and Sung-Hyon Myaeng. Automatic construction of a large-scale situation ontology by mining how-to instructions from the web. Web Semantics: Science, Services and Agents on the World Wide Web, 8(2–3):110 - 124, 2010. Bridging the Gap—Data Mining and Social Network Analysis for Integrating Semantic Web and Web 2.0 The Future of Knowledge Dissemination: The Elsevier Grand Challenge for the Life Sciences. [ bib | DOI | http ]
With the growing interests in semantic web services and context-aware computing, the importance of ontologies, which enable us to perform context-aware reasoning, has been accepted widely. While domain-specific and general-purpose ontologies have been developed, few attempts have been made for a situation ontology that can be employed directly to support activity-oriented context-aware services. In this paper, we propose an approach to automatically constructing a large-scale situation ontology by mining large-scale web resources, eHow and wikiHow, which contain an enormous amount of how-to instructions (e.g., “How to install a car amplifier”). The construction process is guided by a situation model derived from the procedural knowledge available in the web resources. Two major steps involved are: (1) action mining that extracts pairs of a verb and its ingredient (i.e., objects, location, and time) from individual instructional steps (e.g., <disconnect, ground cable>) and forms goal-oriented situation cases using the results and (2) normalization and integration of situation cases to form the situation ontology. For validation, we measure accuracy of the action mining method and show how our situation ontology compares in terms of coverage with existing large-scale ontology-like resources constructed manually. Furthermore, we show how it can be utilized for two applications: service recommendation and service composition.

Keywords: Automatic ontology construction
[404] Shyi-Ming Chen and Shih-Ming Bai. Using data mining techniques to automatically construct concept maps for adaptive learning systems. Expert Systems with Applications, 37(6):4496 - 4503, 2010. [ bib | DOI | http ]
It is obvious that to construct concept maps for providing the learning guidance to learners is an important research topic of adaptive learning systems. Because the existing method to construct concept maps only considers the association rules that questions are not correctly answered, it will miss some information about questions that are correctly answered by the learners. Moreover, the existing method also has the drawback that it will build unnecessary relationships or lose some relationships between concepts in the constructed concept maps. In this paper, we present a new method to automatically construct concept maps based on data mining techniques for adaptive learning systems. The proposed method can overcome the drawbacks of the existing method. It provides us a useful way to automatically construct concept maps in adaptive learning systems.

Keywords: Adaptive learning systems
[405] Ugo Erra, Sabrina Senatore, Fernando Minnella, and Giuseppe Caggianese. Approximate tf–idf based on topic extraction from massive message stream using the {GPU}. Information Sciences, 292(0):143 - 161, 2015. [ bib | DOI | http ]
The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset. To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs). This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure.

Keywords: Twitter
[416] Kristin Rieping, Gwenn Englebienne, and Ben Kröse. Behavior analysis of elderly using topic models. Pervasive and Mobile Computing, 15(0):181 - 199, 2014. Special Issue on Information Management in Mobile Applications Special Issue on Data Mining in Pervasive Environments. [ bib | DOI | http ]
This paper describes two new topic models for the analysis of human behavior in homes that are equipped with sensor networks. The models are based on Latent Dirichlet Allocation (LDA) topic models and can detect patterns in sensor data in an unsupervised manner. LDA–Gaussian, the first variation of the model, is a combination of a Gaussian Mixture Model and the {LDA} model. Here the multinomial distribution that is normally used in the {LDA} model is replaced by a set of Gaussian distributions. LDA–Poisson, the second variation of the model, uses a set of Poisson distribution to model the observations. The Poisson distribution is better suited to handle counts of stochastic events but less well-suited to model time. For this we use the von Mises distribution, resulting in ‘LDA–Poisson–von-Mises’. The parameters of the models are determined with an EM-algorithm. The models are evaluated on more than 450 days of real-world sensor data, gathered in the homes of five elderly people, and are compared with a baseline approach where standard k-means clustering is used to quantize the data. We show that the new models find more meaningful topics than the baseline and that a semantic description of these topics can be given. We also evaluated the models quantitatively, using perplexity as measure for the model fit. Both LDA–Gaussian and LDA–Poisson result in much better models than the baseline, and our experiments show that, of the proposed models, the LDA–Poisson–von-Mises model performs best.

Keywords: Activity discovery
[417] Tung-Cheng Hsieh and Tzone-I Wang. A mining-based approach on discovering courses pattern for constructing suitable learning path. Expert Systems with Applications, 37(6):4156 - 4167, 2010. [ bib | DOI | http ]
In recent years, browser has become one of the most popular tools for searching information on the Internet. Although a person can conveniently find and download specific learning materials to gain fragmented knowledge, most of the materials are imperfect and have no particular order in the content. Therefore, most of the self-directed learners spend most of time in surveying and choosing the right learning materials collected from the Internet. This paper develops a web-based learning support system that harnesses two approaches, the learning path constructing approach and the learning object recommending approach. With collected documents and a learning subject from a learner, the system first discovers some candidate courses by using a data mining approach based on the Apriori algorithm. Next, the leaning path constructing approach, based on the Formal Concept Analysis, builds a Concept Lattice, using keywords extracted from some selected documents, to form a relationship hierarchy of all the concepts represented by the keywords. It then uses {FCA} to further compute mutual relationships among documents to decide a suitable learning path. For a chosen learning path, the support system uses both the preference-based and the correlation-based algorithms for recommending the most suitable learning objects or documents for each unit of the courses in order to facilitate more efficient learning for the learner. This e-learning support system can be embedded in any information retrieval system for surfers to do more efficient learning on the Internet.

Keywords: Self-directed learner
[418] Tae goun Kim. Efficient management of marine resources in conflict: An empirical study of marine sand mining, korea. Journal of Environmental Management, 91(1):78 - 86, 2009. [ bib | DOI | http ]
This article develops a dynamic model of efficient use of exhaustible marine sand resources in the context of marine mining externalities. The classical Hotelling extraction model is applied to sand mining in Ongjin, Korea and extended to include the estimated marginal external costs that mining imposes on marine fisheries. The socially efficient sand extraction plan is compared with the extraction paths suggested by scientific research. If marginal environmental costs are correctly estimated, the developed efficient extraction plan considering the resource rent may increase the social welfare and reduce the conflicts among the marine sand resource users. The empirical results are interpreted with an emphasis on guidelines for coastal resource management policy.

Keywords: Hotelling extraction rule
[419] Julia Seiter, Oliver Amft, Mirco Rossi, and Gerhard Tröster. Discovery of activity composites using topic models: An analysis of unsupervised methods. Pervasive and Mobile Computing, 15(0):215 - 227, 2014. Special Issue on Information Management in Mobile Applications Special Issue on Data Mining in Pervasive Environments. [ bib | DOI | http ]
In this work we investigate unsupervised activity discovery approaches using three topic model (TM) approaches, based on Latent Dirichlet Allocation (LDA), n -gram TM (NTM), and correlated TM (CTM). While {LDA} structures activity primitives, {NTM} adds primitive sequence information, and {CTM} exploits co-occurring topics. We use an activity composite/primitive abstraction and analyze three public datasets with different properties that affect the discovery, including primitive rate, activity composite specificity, primitive sequence similarity, and composite-instance ratio. We compare the activity composite discovery performance among the {TM} approaches and against a baseline using k -means clustering. We provide guidelines for method and optimal {TM} parameter selection, depending on data properties and activity primitive noise. Results indicate that {TMs} can outperform k -means clustering up to 17%, when composite specificity is low. LDA-based {TMs} showed higher robustness against noise compared to other {TMs} and k -means.

Keywords: Activity routines
[458] Deng-Yiv Chiu and Ya-Chen Pan. Topic knowledge map and knowledge structure constructions with genetic algorithm, information retrieval, and multi-dimension scaling method. Knowledge-Based Systems, 67(0):412 - 428, 2014. [ bib | DOI | http ]
This work presents a novel automated approach to construct topic knowledge maps with knowledge structures, followed by its application to an internationally renowned journal. Knowledge structures are diagrams showing the important components of knowledge in study. Knowledge maps identify the locations of objects and illustrate the relationship among objects. In our study, the important components derived from knowledge structures are used as objects to be spotted in a topic knowledge map. The purpose of our knowledge structures is to find out the major topics serving as subjects of article collections as well as related methods employed in the published papers. The purpose of topic knowledge maps is to transform high-dimensional objects (topic, paper, and cited frequency) into a 2-dimensional space to help understand complicated relatedness among high-dimensional objects, such as the related degree between an article and a topic. First, we adopt independent chi-square test to examine the independence of topics and apply genetic algorithm to choose topics selection with best fitness value to construct knowledge structures. Additionally, high-dimensional relationships among objects are transformed into a 2-dimensional space using the multi-dimension scaling method. The optimal transformation coordinate matrix is also determined by using a genetic algorithm to preserve the original relations among objects and construct appropriate topic knowledge maps.

Keywords: Knowledge structure
[478] Shu-Hsien Liao, Wen-Jung Chang, and Chai-Chen Lee. Mining marketing maps for business alliances. Expert Systems with Applications, 35(3):1338 - 1350, 2008. [ bib | DOI | http ]
A business can strengthen its competitive advantage and increase its market share by forming a strategic alliance. With the help of alliances, businesses can bring to bear significant resources beyond the capabilities of the individual co-operating firms. Thus how to effectively evaluate and select alliance partners is an important task for businesses because a successful corporation partner selection can therefore reduce the possible risk and avoid failure results on business alliance. This paper proposes the Apriori algorithm as a methodology of association rules for data mining, which is implemented for mining marketing map knowledge from customers. Knowledge extraction from marketing maps is illustrated as knowledge patterns and rules in order to propose suggestions for business alliances and possible co-operation solutions. Finally, this study suggests that integration of different research factors, variables, theories, and methods for investigating this research topic of business alliance could improve research results and scope.

Keywords: Business alliance

This file was generated by bibtex2html 1.96. and compiled by Subasish Das


-☆-._.-★_°☆ ‧°‧°☆∴° ★-._. ¤º…`•.¸.•´ ☆™

Find me on twitter, linkedIn, or .