An algorithm for reducing the dimension and size of a sample for data exploration procedures

Kulczycki, Piotr; Łukasik, Szymon

doi:10.2478/amcs-2014-0011

Artykuł - szczegóły

Czasopismo

International Journal of Applied Mathematics and Computer Science

2014 | 24 | 1 | 133-149

Tytuł artykułu

An algorithm for reducing the dimension and size of a sample for data exploration procedures

Autorzy

Piotr Kulczycki , Szymon Łukasik

Treść / Zawartość

Pełne teksty:

http://matwbn.icm.edu.pl/ksiazki/amc/amc24/amc24111.pdf [zdalny]

Warianty tytułu

Języki publikacji

EN

Abstrakty

EN

The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain's fundamental tasks of clustering, classification and detection of atypical elements (outliers).

Słowa kluczowe

EN

dimension reduction sample size reduction linear transformation simulated annealing data mining

Wydawca

University of Zielona Gora Press

Czasopismo

International Journal of Applied Mathematics and Computer Science

Rocznik

2014

Tom

24

Numer

1

Strony

133-149

Opis fizyczny

Daty

wydano

2014

otrzymano

2013-04-06

poprawiono

2013-11-10

Twórcy

autor

Piotr Kulczycki

Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447 Warsaw, Poland
Department of Automatic Control and Information Technology, Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland

autor

Szymon Łukasik

Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447 Warsaw, Poland
Department of Automatic Control and Information Technology, Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland

Bibliografia

Aarts, E., Korst, J. and van Laarhoven, P. (1997). Simulated annealing, in E. Aarts and J. Lenstra (Eds.), Local Search in Combinatorial Optimization, Wiley, Chichester, pp. 91-120.
Alba, E. (2005). Parallel Metaheuristics: A New Class of Algorithms, Wiley, New York, NY.
Aswani Kumar, C. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval, International Journal of Applied Mathematics and Computer Science 16(4): 551-558.
Aswani Kumar, C. (2009). Analysis of unsupervised dimensionality techniques, Computer Science and Information Systems 6(2): 217-227.
Azencot, R. (1992). Simulated Annealing: Parallelization Techniques, Wiley, New York, NY.
Bartenhagen, C., Klein, H.-U., Ruckert, C., Jiang, X. and Dugas, M. (2010). Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data, BMC Bioinformatics 11, paper no. 567.
Bartkuté, V. and Sakalauskas, L. (2009). Statistical inferences for termination of Markov type random search algorithms, Journal of Optimization Theory and Applications 141(3): 475-493.
Ben-Ameur, W. (2004). Computing the initial temperature of simulated annealing, Computational Optimization and Applications 29(3): 367-383.
Borg, I. and Groenen, P. (2005). Modern Multidimensional Scaling. Theory and Applications, Springer-Verlag, Berlin.
Camastra, F. (2003). Data dimensionality estimation methods: A survey, Pattern Recognition 36(12): 2945-2954.
Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Łukasik, S. and Żak, S. (2010). Complete gradient clustering algorithm for features analysis of x-ray images, in E. Piątka and J. Kawa (Eds.), Information Technologies in Biomedicine, Vol. 2, Springer-Verlag, Berlin, pp. 15-24.
Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems 47(4): 547-553.
Cox, T. and Cox, M. (2000). Multidimensional Scaling, Chapman and Hall, Boca Raton, FL.
Cunningham, P. (2007). Dimension reduction, Technical report, UCD School of Computer Science and Informatics, Dublin.
Czarnowski, I. and Jędrzejowicz, P. (2011). Application of agent-based simulated annealing and tabu search procedures to solving the data reduction problem, International Journal of Applied Mathematics and Computer Science 21(1): 57-68, DOI: 10.2478/v10006-011-0004-3.
David, H. and Nagaraja, H. (2003). Order Statistics, Wiley, New York, NY.
Deng, Z., Chung, F.-L. and Wang, S. (2008). FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation, Pattern Recognition 41(4): 1363-1372.
François, D., Wertz, V. and Verleysen, M. (2007). The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering 19(7): 873-886.
Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration in images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721-741.
Gendreau, M. and Potvin, J.-Y. (2010). Handbook of Metaheuristics, Springer, New York, NY.
Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA.
Ingber, L. (1996). Adaptive simulated annealing (ASA): Lessons learned, Control and Cybernetics 25(1): 33-54.
Inza, I., Larranaga, P., Etxeberria, R. and Sierra, B. (2000). Feature subset selection by Bayesian network-based optimization, Artificial Intelligence 123(1-2): 157-184.
Ishibuchi, H., Nakashima, T. and Murata, T. (2001). Three-objective genetics-based machine learning for linguistic rule extraction, Information Sciences 136(1-4): 109-133.
Kerdprasop, K., Kerdprasop, N. and Sattayatham, P. (2005). Weighted k-means for density-biased clustering, in A. Tjoa and J. Trujillo (Eds.), Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science, Vol. 3589, Springer-Verlag, Berlin pp. 488-497.
Kulczycki, P. (2005). Kernel Estimators in System Analysis, WNT, Warsaw, (in Polish).
Kulczycki, P. (2008). Kernel estimators in industrial applications, in B. Prasad (Ed.), Soft Computing Applications in Industry, Springer-Verlag, Berlin, pp. 69-91.
Kulczycki, P. and Charytanowicz, M. (2010). A complete gradient clustering algorithm formed with kernel estimators, International Journal of Applied Mathematics and Computer Science 20(1): 123-134, DOI: 10.2478/v10006-010-0009-3.
Kulczycki, P. and Kowalski, P. (2011). Bayes classification of imprecise information of interval type, Control and Cybernetics 40(1): 101-123.
Kulczycki, P. and Łukasik, S. (2014). Reduction of dimension and size of data set by parallel fast simulated annealing, in L.T. Koczy, C.R. Pozna, R. Claudiu and J. Kacprzyk (Eds.), Issues and Challenges of Intelligent Systems and Computational Intelligence, Springer-Verlag, Berlin, pp. 273-292.
Kuo, Y. (2010). Using simulated annealing to minimize fuel consumption for the time-dependent vehicle routing problem, Computers & Industrial Engineering 59(1): 157-165.
Łukasik, S. and Kulczycki, P. (2011). An algorithm for sample and data dimensionality reduction using fast simulated annealing, in J. Tang, I. King, L. Chen and J. Wang (Eds.), Advanced Data Mining and Applications, Lecture Notes in Computer Science, Vol. 7120, Springer-Verlag, Berlin, pp. 152-161.
Łukasik, S. and Kulczycki, P. (2013). Using topology preservation measures for multidimensional intelligent data analysis in the reduced feature space, in L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh and J. Zurada (Eds.), Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, Vol. 7895, Springer-Verlag, Berlin, pp. 184-193.
Maaten, van der, L. (2009). Feature Extraction from Visual Data, Ph.D. thesis, Tilburg University, Tilburg.
Mangasarian, O. and Wolberg, W. (1990). Cancer diagnosis via linear programming, SIAM News 23(5): 1-18.
Mitra, P., Murthy, C. and Pal, S. (2002). Density-based multiscale data condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6): 734-747.
Nam, D., Lee, J.-S. and Park, C. (2004). n-dimensional Cauchy neighbor generation for the fast simulated annealing, IEICE Transactions on Information and Systems E87D(11): 2499-2502.
Oliveira, J. and Pedrycz, W. (Eds.) (2007). Advances in Fuzzy Clustering and Its Applications, Wiley, Chichester.
Pal, S. and Mitra, P. (2004). Pattern Recognition Algorithms for Data Mining, Chapman and Hall, Boca Raton, FL.
Parvin, H., Alizadeh, H. and Minati, B. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336): 846-850.
Parvin, H., Alizadeh, H. and Minati, B. (2010). A modification on k-nearest neighbor classifier, Global Journal of Computer Science and Technology 10(14): 37-41.
Sait, S. and Youssef, H. (2000). Iterative Computer Algorithms with Applications in Engineering: Solving Combinatorial Optimization Problems, IEEE Computer Society Press, Los Alamitos, CA.
Sammon, J. (1969). A nonlinear mapping for data structure analysis, IEEE Transactions on Computers 18(5): 401-409.
Saxena, A., Pal, N. and Vora, M. (2010). Evolutionary methods for unsupervised feature selection using Sammon's stress function, Fuzzy Information and Engineering 2(3): 229-247.
Strickert, M., Teichmann, S., Sreenivasulu, N. and Seiffert, U. (2005). DIPPP online self-improving linear map for distance-preserving data analysis, 5th Workshop on SelfOrganizing Maps, WSOM'05, Paris, France, pp. 661-668.
Sumi, S.M., Zaman, M.F. and Hirose, H. (2012). A rainfall forecasting method using machine learning models and its application to the Fukuoka city case, International Journal of Applied Mathematics and Computer Science 22(4): 841-854, DOI: 10.2478/v10006-012-0062-1.
Szu, H. and Hartley, R. (1987). Fast simulated annealing, Physics Letters A 122(3-4): 157-162.
Tian, T., Wilcox, R. and James, G. (2010). Data reduction in classification: A simulated annealing based projection method, Statistical Analysis and Data Mining 3(5): 319-331.
UC Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/. (2013).
Vanstrum, M. and Starks, S. (1981). An algorithm for optimal linear maps, Southeastcon Conference, Huntsville, AL, USA, pp. 106-110.
Wand, M. and Jones, M. (1995). Kernel Smoothing, Chapman and Hall, London.
Wilson, D. and Martinez, T. (2000). Reduction techniques for instance-based learning algorithms, Machine Learning 38(3): 257-286.
Xu, R. and Wunsch, D. (2009). Clustering, Wiley, Hoboken, NJ.
Zhigljavsky, A. and Žilinskas, A. (2008). Stochastic Global Optimization, Springer-Verlag, Berlin.

Typ dokumentu

Bibliografia

Identyfikatory

DOI

10.2478/amcs-2014-0011

Identyfikator YADDA

bwmeta1.element.bwnjournal-article-amcv24i1p133bwm

Artykuł - szczegóły

Czasopismo

International Journal of Applied Mathematics and Computer Science

Tytuł artykułu

An algorithm for reducing the dimension and size of a sample for data exploration procedures

Autorzy

Treść / Zawartość

Warianty tytułu

Języki publikacji

Abstrakty

Słowa kluczowe

Wydawca

Czasopismo

Rocznik

Tom

Numer

Strony

Opis fizyczny

Daty

Twórcy

Bibliografia

Typ dokumentu

Bibliografia

Identyfikatory

Identyfikator YADDA