Lookahead selective sampling for incomplete data

L. Abdallah; I. Shimshoni

Download PDF - Lookahead selective sampling for incomplete data, Opens in new tab

ArticleOriginal scientific text

Title

Lookahead selective sampling for incomplete data

Authors ¹, ²

Affiliations

Management Information Systems, The Max Stern Yezreel Valley College, Emek Yezraeel, 1930600, Israel; Department of Mathematics and Computer Science, College of Sakhnin for Teacher Education, Sakhnin, 20173, Israel
Department of Information Systems, University of Haifa, Haifa, 199, Israel

Abstract

Missing values in data are common in real world applications. There are several methods that deal with this problem. In this paper we present lookahead selective sampling (LSS) algorithms for datasets with missing values. We developed two versions of selective sampling. The first one integrates a distance function that can measure the similarity between pairs of incomplete points within the framework of the LSS algorithm. The second algorithm uses ensemble clustering in order to represent the data in a cluster matrix without missing values and then run the LSS algorithm based on the ensemble clustering instance space (LSS-EC). To construct the cluster matrix, we use the k-means and mean shift clustering algorithms especially modified to deal with incomplete datasets. We tested our algorithms on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the LSS and LSS-EC algorithms for incomplete data to two other basic methods. Our experiments show that the suggested selective sampling algorithms outperform the other methods.

Keywords

ENG

POL

selective sampling, missing values, ensemble clustering

Abdallah, L. and Shimshoni, I. (2013). An ensemble-clustering-based distance metric and its applications, International Journal of Business Intelligence and Data Mining 8(3): 264–287.
Abdallah, L. and Shimshoni, I. (2014). Mean shift clustering algorithm for data with missing values, 14th International Conference of DaWaK, Munich, Germany, pp. 426–438.
Abdallah, L. and Shimshoni, I. (2016). k-means over incomplete datasets using mean Euclidean distance, 12th International Conference on Machine Learning and Data Mining, New York, NY, pp. 113–127.
Bai, X., Zhang, M., Wu, Q., Zheng, R., Zhao, H. and Wei, W. (2015). A novel data filling algorithm for incomplete information system based on valued limited tolerance relation, International Journal of Database Theory and Application 8(6): 149–164.
Clark, P.G., Grzymala-Busse, J.W. and Rzasa, W. (2013). Consistency of incomplete data, 2nd International Conference on Data Technologies and Applications, Marrakech, Morocco, pp. 80–87.
Clustering datasets (2008). http://cs.joensuu.fi/sipu/datasets/, University of Eastern Finland, Joensuu.
Dasgupta, S. and Hsu, D. (2008). Hierarchical sampling for active learning, 25th International Conference on Machine Learning, Helsinki, Finland, pp. 208–215.
Dekel, O., Gentile, C. and Sridharan, K. (2012). Selective sampling and active learning from single and multiple teachers, Journal of Machine Learning Research 13(1): 2655–2697.
Donders, A.R.T., van der Heijden, G.J., Stijnen, T. and Moons, K.G. (2006). Review: A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology 59(10): 1087–1091.
Grzymala-Busse, J. and Hu, M. (2001). A comparison of several approaches to missing attribute values in data mining, in W. Ziarko et al. (Eds.), Rough Sets and Current Trends in Computing, Springer, Berlin/Heidelberg, pp. 378–385.
Grzymala-Busse, J.W. (2006). A rough set approach to data with missing attribute values, in J.F. Peters and Y. Yao (Eds.), Rough Sets and Knowledge Technology, Springer, Berlin/Heidelberg, pp. 58–67.
Hospedales, T.M., Gong, S. and Xiang, T. (2013). Finding rare classes: Active learning with generative and discriminative models, IEEE Transactions on Knowledge and Data Engineering 25(2): 374–386.
Ibrahim, J.G., Chen, M.-H., Lipsitz, S.R. and Herring, A.H. (2005). Missing-data methods for generalized linear models: A comparative review, Journal of the American Statistical Association 100(469): 332–346.
Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers, 17th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 3–12.
Li, H., Shi, Y., Liu, Y., Hauptmann, A.G. and Xiong, Z. (2012). Cross-domain video concept detection: A joint discriminative and generative active learning approach, Expert Systems with Applications 39(15): 12220–12228.
Lindenbaum, M., Markovitch, S. and Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers, Machine Learning 54(2): 125–152.
Little, R.J. (1988). Missing-data adjustments in large surveys, Journal of Business & Economic Statistics 6(3): 287–296.
Little, R.J. and Rubin, D.B. (2014). Statistical Analysis with Missing Data, John Wiley & Sons. Hoboken, NJ.
Lughofer, E. (2012). Hybrid active learning for reducing the annotation effort of operators in classification systems, Pattern Recognition 45(2): 884–896.
MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations, 5th Symposium on Math, Statistics, and Probability, Berkeley, CA, USA, pp. 281–297.
Magnani, M. (2004). Techniques for dealing with missing data in knowledge discovery tasks, Obtido 15(01): 2007.
Nowicki, R.K. (2010). On classification with missing data using rough-neuro-fuzzy systems, International Journal of Applied Mathematics and Computer Science 20(1): 55–67, DOI: 10.2478/v10006-010-0004-8.
Nowicki, R.K., Nowak, B.A. and Woźniak, M. (2016). Application of rough sets in k nearest neighbours algorithm for classification of incomplete samples, in S. Kunifuji et al. (Eds.), Knowledge, Information and Creativity Support Systems, Springer, Berlin/Heidelberg, pp. 243–257.
Stefanowski, J. and Tsoukias, A. (2001). Incomplete information tables and rough classification, Computational Intelligence 17(3): 545–566.
Strehl, A. and Ghosh, J. (2002). Cluster ensembles—A knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3: 583–617.
Tan, M. and Schlimmer, J. (1990). Two case studies in cost-sensitive concept acquisition, 8th National Conference on Artificial Intelligence, Boston, MA, USA, pp. 854–860.
Turney, P. (1995). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of Artificial Intelligence Research 2(1): 369–409.
Xu, Z., Akella, R. and Zhang, Y. (2007). Incorporating diversity and density in active learning for relevance feedback, in G. Amati et al. (Eds.), Advances in Information Retrieval, Springer, Berlin/Heidelberg, pp. 246–257.
Zhang, S., Qin, Z., Ling, C. and Sheng, S. (2005). Missing is useful: Missing values in cost-sensitive decision trees, IEEE Transactions on Knowledge and Data Engineering 17(12): 1689–1693.
Zhang, Y., Wen, J., Wang, X. and Jiang, Z. (2014). Semi-supervised learning combining co-training with active learning, Expert Systems with Applications 41(5): 2372–2378.

Additional information

Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę.

Title

Lookahead selective sampling for incomplete data

Affiliations

Abstract

Keywords

Bibliography

Additional information