Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation

Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation*


Nidia Rodríguez Mazahua
Lisbeth Rodríguez Mazahua
Giner Alor Hernández

One of the main problems faced by Data Warehouse designers is fragmentation.
Several studies have proposed data mining-based horizontal fragmentation methods.
However, not exists a horizontal fragmentation technique that uses a decision tree. This paper presents the analysis of different decision tree algorithms to select the best one to implement the fragmentation method. Such analysis was performed under version 3.9.4 of Weka, considering four evaluation metrics (Precision, ROC Area, Recall and F-measure) for different selected data sets using the Star Schema Benchmark. The results showed that the two best algorithms were J48 and Random Forest in most cases; nevertheless, J48 was selected because it is more efficient in building the model.

Palabras clave


Los datos de descargas todavía no están disponibles.


Biografía del autor/a / Ver

Nidia Rodríguez Mazahua, Tecnológico Nacional de México

Mtra. in Administrative Engineering

Lisbeth Rodríguez Mazahua, Tecnológico Nacional de México

PhD in Computer Science

Asdrúbal López Chau, Centro Universitario UAEM

PhD in Computer Science

Giner Alor Hernández, Tecnológico Nacional de México

PhD of Science in the specialty of Electrical Engineering


Barkhordari, M. and Niamanesh, M. (2018). Chabok: A Map-Reduce based method to solve data warehouse problems. Journal of Big Data, 5(40), 1-25.

Barr, M., Boukhalfa, K. and Bouibede, K. (2018). Bi- Objective Optimization Method for Horizontal Fragmentation Problem in Relational Data Warehouses as a Linear Programming Problem. Applied Artificial Intelligence, 32(9-10), 907-923.

Boissier, M. and Kurzynski, D. (2018). Workload- Driven Horizontal Partitioning and Pruning for Large HTAP Systems. In IEEE 34th International Conference on Data Engineering Workshops (ICDEW), Paris, France.

Costa, M.R. et al. (2016). Spatial data warehouses and spatial OLAP come towards the cloud: Design and performance. Distributed and Parallel Databases, 34(3), 425-461.

Dean, J. (2014). Big Data, Data Mining, and Machine Learning Value Creation for Business Leaders and Practitioners. New Jersey, USA: John Wiley & Sons.

Ettaoufik, A. and Ouzzif, M. (2017). Web Service for Incremental and Automatic Data Warehouses Fragmentation. International Journal of Advanced Computer Science and Applications, 8(6), 1-10.

Han, J., Kamber, M. and Pei, J. (2012). Data Mining Concepts and Techniques. Burlington, USA: Morgan Kaufmann Publishers.

Hilprecht, B., Carsten, B. and Uwe, R. (2019). Learning a Partitioning Advisor with Deep Reinforcement Learning. Recovered from pdf/1904.01279.pdf.

Hulten, G., Spencer, L. and Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Kechar, M. and Nait-Bahloul, S. (2019). Bringing Together Physical Design and Fast Querying of Large Data Warehouses: A New Data Partitioning Strategy. In BDIoT'19: Proceedings of the 4th International Conference on Big Data and Internet of Things, Rabat, Morocco.

Kotsiantis, S., Tsekouras, G. and Pintelas, P. (2005). Local Bagging of Decision Stumps. In Ali, M. and Esposito, F. (Eds.), Innovations in Applied Artificial Intelligence (pp. 377-391). Berlin, Germany: Springer.

Landwehr, N., Hall, M. and Frank, E. (2005). Logistic Model Trees. Machine Learning, 59(1-2), 161-205. Letrache, K., El Beggar, O. and Ramdani, M. (2019). OLAP cube partitioning based on association rules method. Applied Intelligence, 49(2), 420-434.

Louppe, G. (2015). Understanding Random Forests: From Theory to Practice. Liège, Belgium: Universidad of Liège.

Nam, Y.-M., Kim, M.-S. and Han, D. (2018). A Graph- Based Database Partitioning Method for Parallel OLAP Query Processing. In IEEE 34th International Conference on Data Engineering (ICDE), Paris, France.

Ozsu, M.T. and Valduriez, P. (2020). Principles of Distributed Database Systems. Geneva, Switzerland: Springer Nature Switzerland AG.

Ramdane, Y. et al. (2019). SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop. In Ordonez, C. et al. (Eds.), Big Data Analytics and Knowledge Discovery (pp. 189-205). Berlin, Germany: Springer.

Ramdane, Y. et al. (2019). SkipSJoin: A New Physical Design for Distributed Big Data Warehouses in Hadoop. In Laender, A.H.F. et al. (Eds.), Conceptual Modeling (pp. 255-263). Berlin, Germany: Springer.

Rodríguez, L. et al. (2014). Horizontal Partitioning of Multimedia Databases Using Hierarchical Agglomerative Clustering. In Gelbukh, A. et al. (Eds.), Nature-Inspired Computation and Machine Learning (pp. 296-309). Cham, Switzerland: Springer

Saeh, I.S. et al. (2016). Static Security classification and Evaluation classifier design in electric power grid with presence of PV power plants using C-4.5. Renewable and Sustainable Energy Reviews, 56, 283-290.

Shi, L. et al. (2018). Signal prediction based on boosting and decision stump. International Journal of Computational Science and Engineering, 16(2), 117-122.

Witten, I.H., Frank, E. and Hall, M. (2011). Data Mining Practical Machine Learning Tools and Techniques. New York, USA: Elsevier.

Citado por

Artículos más leídos del mismo autor/a