Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
2014 | 24 | 1 | 133-149
Tytuł artykułu

An algorithm for reducing the dimension and size of a sample for data exploration procedures

Treść / Zawartość
Warianty tytułu
Języki publikacji
The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain's fundamental tasks of clustering, classification and detection of atypical elements (outliers).
Opis fizyczny
  • Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447 Warsaw, Poland
  • Department of Automatic Control and Information Technology, Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland
  • Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447 Warsaw, Poland
  • Department of Automatic Control and Information Technology, Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland
  • Aarts, E., Korst, J. and van Laarhoven, P. (1997). Simulated annealing, in E. Aarts and J. Lenstra (Eds.), Local Search in Combinatorial Optimization, Wiley, Chichester, pp. 91-120.
  • Alba, E. (2005). Parallel Metaheuristics: A New Class of Algorithms, Wiley, New York, NY.
  • Aswani Kumar, C. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval, International Journal of Applied Mathematics and Computer Science 16(4): 551-558.
  • Aswani Kumar, C. (2009). Analysis of unsupervised dimensionality techniques, Computer Science and Information Systems 6(2): 217-227.
  • Azencot, R. (1992). Simulated Annealing: Parallelization Techniques, Wiley, New York, NY.
  • Bartenhagen, C., Klein, H.-U., Ruckert, C., Jiang, X. and Dugas, M. (2010). Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data, BMC Bioinformatics 11, paper no. 567.
  • Bartkuté, V. and Sakalauskas, L. (2009). Statistical inferences for termination of Markov type random search algorithms, Journal of Optimization Theory and Applications 141(3): 475-493.
  • Ben-Ameur, W. (2004). Computing the initial temperature of simulated annealing, Computational Optimization and Applications 29(3): 367-383.
  • Borg, I. and Groenen, P. (2005). Modern Multidimensional Scaling. Theory and Applications, Springer-Verlag, Berlin.
  • Camastra, F. (2003). Data dimensionality estimation methods: A survey, Pattern Recognition 36(12): 2945-2954.
  • Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Łukasik, S. and Żak, S. (2010). Complete gradient clustering algorithm for features analysis of x-ray images, in E. Piątka and J. Kawa (Eds.), Information Technologies in Biomedicine, Vol. 2, Springer-Verlag, Berlin, pp. 15-24.
  • Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems 47(4): 547-553.
  • Cox, T. and Cox, M. (2000). Multidimensional Scaling, Chapman and Hall, Boca Raton, FL.
  • Cunningham, P. (2007). Dimension reduction, Technical report, UCD School of Computer Science and Informatics, Dublin.
  • Czarnowski, I. and Jędrzejowicz, P. (2011). Application of agent-based simulated annealing and tabu search procedures to solving the data reduction problem, International Journal of Applied Mathematics and Computer Science 21(1): 57-68, DOI: 10.2478/v10006-011-0004-3.
  • David, H. and Nagaraja, H. (2003). Order Statistics, Wiley, New York, NY.
  • Deng, Z., Chung, F.-L. and Wang, S. (2008). FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation, Pattern Recognition 41(4): 1363-1372.
  • François, D., Wertz, V. and Verleysen, M. (2007). The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering 19(7): 873-886.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration in images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721-741.
  • Gendreau, M. and Potvin, J.-Y. (2010). Handbook of Metaheuristics, Springer, New York, NY.
  • Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA.
  • Ingber, L. (1996). Adaptive simulated annealing (ASA): Lessons learned, Control and Cybernetics 25(1): 33-54.
  • Inza, I., Larranaga, P., Etxeberria, R. and Sierra, B. (2000). Feature subset selection by Bayesian network-based optimization, Artificial Intelligence 123(1-2): 157-184.
  • Ishibuchi, H., Nakashima, T. and Murata, T. (2001). Three-objective genetics-based machine learning for linguistic rule extraction, Information Sciences 136(1-4): 109-133.
  • Kerdprasop, K., Kerdprasop, N. and Sattayatham, P. (2005). Weighted k-means for density-biased clustering, in A. Tjoa and J. Trujillo (Eds.), Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science, Vol. 3589, Springer-Verlag, Berlin pp. 488-497.
  • Kulczycki, P. (2005). Kernel Estimators in System Analysis, WNT, Warsaw, (in Polish).
  • Kulczycki, P. (2008). Kernel estimators in industrial applications, in B. Prasad (Ed.), Soft Computing Applications in Industry, Springer-Verlag, Berlin, pp. 69-91.
  • Kulczycki, P. and Charytanowicz, M. (2010). A complete gradient clustering algorithm formed with kernel estimators, International Journal of Applied Mathematics and Computer Science 20(1): 123-134, DOI: 10.2478/v10006-010-0009-3.
  • Kulczycki, P. and Kowalski, P. (2011). Bayes classification of imprecise information of interval type, Control and Cybernetics 40(1): 101-123.
  • Kulczycki, P. and Łukasik, S. (2014). Reduction of dimension and size of data set by parallel fast simulated annealing, in L.T. Koczy, C.R. Pozna, R. Claudiu and J. Kacprzyk (Eds.), Issues and Challenges of Intelligent Systems and Computational Intelligence, Springer-Verlag, Berlin, pp. 273-292.
  • Kuo, Y. (2010). Using simulated annealing to minimize fuel consumption for the time-dependent vehicle routing problem, Computers & Industrial Engineering 59(1): 157-165.
  • Łukasik, S. and Kulczycki, P. (2011). An algorithm for sample and data dimensionality reduction using fast simulated annealing, in J. Tang, I. King, L. Chen and J. Wang (Eds.), Advanced Data Mining and Applications, Lecture Notes in Computer Science, Vol. 7120, Springer-Verlag, Berlin, pp. 152-161.
  • Łukasik, S. and Kulczycki, P. (2013). Using topology preservation measures for multidimensional intelligent data analysis in the reduced feature space, in L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh and J. Zurada (Eds.), Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, Vol. 7895, Springer-Verlag, Berlin, pp. 184-193.
  • Maaten, van der, L. (2009). Feature Extraction from Visual Data, Ph.D. thesis, Tilburg University, Tilburg.
  • Mangasarian, O. and Wolberg, W. (1990). Cancer diagnosis via linear programming, SIAM News 23(5): 1-18.
  • Mitra, P., Murthy, C. and Pal, S. (2002). Density-based multiscale data condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6): 734-747.
  • Nam, D., Lee, J.-S. and Park, C. (2004). n-dimensional Cauchy neighbor generation for the fast simulated annealing, IEICE Transactions on Information and Systems E87D(11): 2499-2502.
  • Oliveira, J. and Pedrycz, W. (Eds.) (2007). Advances in Fuzzy Clustering and Its Applications, Wiley, Chichester.
  • Pal, S. and Mitra, P. (2004). Pattern Recognition Algorithms for Data Mining, Chapman and Hall, Boca Raton, FL.
  • Parvin, H., Alizadeh, H. and Minati, B. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336): 846-850.
  • Parvin, H., Alizadeh, H. and Minati, B. (2010). A modification on k-nearest neighbor classifier, Global Journal of Computer Science and Technology 10(14): 37-41.
  • Sait, S. and Youssef, H. (2000). Iterative Computer Algorithms with Applications in Engineering: Solving Combinatorial Optimization Problems, IEEE Computer Society Press, Los Alamitos, CA.
  • Sammon, J. (1969). A nonlinear mapping for data structure analysis, IEEE Transactions on Computers 18(5): 401-409.
  • Saxena, A., Pal, N. and Vora, M. (2010). Evolutionary methods for unsupervised feature selection using Sammon's stress function, Fuzzy Information and Engineering 2(3): 229-247.
  • Strickert, M., Teichmann, S., Sreenivasulu, N. and Seiffert, U. (2005). DIPPP online self-improving linear map for distance-preserving data analysis, 5th Workshop on SelfOrganizing Maps, WSOM'05, Paris, France, pp. 661-668.
  • Sumi, S.M., Zaman, M.F. and Hirose, H. (2012). A rainfall forecasting method using machine learning models and its application to the Fukuoka city case, International Journal of Applied Mathematics and Computer Science 22(4): 841-854, DOI: 10.2478/v10006-012-0062-1.
  • Szu, H. and Hartley, R. (1987). Fast simulated annealing, Physics Letters A 122(3-4): 157-162.
  • Tian, T., Wilcox, R. and James, G. (2010). Data reduction in classification: A simulated annealing based projection method, Statistical Analysis and Data Mining 3(5): 319-331.
  • UC Irvine Machine Learning Repository (2013).
  • Vanstrum, M. and Starks, S. (1981). An algorithm for optimal linear maps, Southeastcon Conference, Huntsville, AL, USA, pp. 106-110.
  • Wand, M. and Jones, M. (1995). Kernel Smoothing, Chapman and Hall, London.
  • Wilson, D. and Martinez, T. (2000). Reduction techniques for instance-based learning algorithms, Machine Learning 38(3): 257-286.
  • Xu, R. and Wunsch, D. (2009). Clustering, Wiley, Hoboken, NJ.
  • Zhigljavsky, A. and Žilinskas, A. (2008). Stochastic Global Optimization, Springer-Verlag, Berlin.
Typ dokumentu
Identyfikator YADDA
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.