ARTÍCULOS ORIGINALES
ISSN 2389-8186
E-ISSN 2389-8194
Vol.7, No. 2-1
Julio-diciembre de 2020
doi: https://doi.org/10.16967/23898186.667
pp. 31-43
rpe.ceipa.edu.co
* The authors are very grateful to Tecnológico Nacional de México for supporting this work. Also, this research paper was sponsored by
the CONACYT.
** Mtra. in Administrative Engineering. Tecnológico Nacional de México, Veracruz, México. E-mail: dci.nrodriguez@ito-depi.edu.mx.
ORCID: 0000-0002-8227-9476. Google Scholar: https://scholar.google.es/citations?hl=es&user=27GDu_gAAAAJ.
*** PhD in Computer Science. Tecnológico Nacional de México, Veracruz, México. E-mail: lrodriguezm@ito-depi.edu.mx.
ORCID: 0000-0002-9861-3993. Google Scholar: https://scholar.google.es/citations?user=2hZw4HAAAAAJ&hl=es.
**** PhD in Computer Science. Centro Universitario UAEM Zumpango, Zumpango, México. E-mail: alchau@uaemex.mx.
ORCID: 0000-0001-5254-0939. Google Scholar: https://scholar.google.es/citations?user=FqdMAaQAAAAJ&hl=es&oi=sra.
***** PhD of Science in the specialty of Electrical Engineering. Tecnológico Nacional de México, Veracruz, México.
E-mail: galorh@orizaba.tecnm.mx. ORCID: 0000-0003-3296-0981. Google Scholar: https://scholar.google.es/citations?hl=es&user=8Z-
gf4KwAAAAJ.
Comparative Analysis of Decision
Tree Algorithms for Data Warehouse
Fragmentation*
NIDIA RODRÍGUEZ MAZAHUA**
LISBETH RODRÍGUEZ MAZAHUA***
ASDRÚBAL LÓPEZ CHAU****
GINER ALOR HERNÁNDEZ*****
ISSN 2389-8186
E-ISSN 2389-8194
Vol.7, No. 2-1
Julio-diciembre de 2020
doi: https://doi.org/10.16967/23898186.667
COMO CITAR ESTE ARTÍCULO
How to cite this article:
Rodríguez, N. et al. (2020).
Comparative Analysis of Decision
Tree Algorithms for Data Warehouse
Fragmentation. Revista Perspectiva
Empresarial, 7(2-1), 31-43.
Recibido: 20 de agosto de 2020
Aceptado: 07 de diciembre de 2020
ABSTRACT
One of the main problems faced by Data Warehouse designers is fragmentation.
Several studies have proposed data mining-based horizontal fragmentation methods.
However, not exists a horizontal fragmentation technique that uses a decision tree. This
paper presents the analysis of dierent decision tree algorithms to select the best one to
implement the fragmentation method. Such analysis was performed under version 3.9.4
of Weka, considering four evaluation metrics (Precision, ROC Area, Recall and F-measure)
for dierent selected data sets using the Star Schema Benchmark. The results showed that
the two best algorithms were J48 and Random Forest in most cases; nevertheless, J48 was
selected because it is more ecient in building the model.
KEY WORDS
Data analysis, computer systems, databases, articial intelligence, decision
making.
Análisis comparativo de algoritmos de árboles de decisión
para la fragmentación de almacenes de datos
RESUMEN
Uno de los principales problemas a los que se enfrentan los diseñadores
de almacenes de datos es la fragmentación. Varios estudios han propuesto métodos de
fragmentación horizontal basados en la minería de datos. No obstante, no existe una
técnica de fragmentación horizontal que utilice un árbol de decisión. Este trabajo presenta
el análisis de diferentes algoritmos de árboles de decisión con el n de seleccionar el mejor
para implementar el método de fragmentación. Dicho análisis se realizó bajo la versión
3.9.4 de Weka, considerando cuatro métricas de evaluación (Precision, ROC Area, Recall
y F-measure) para diferentes conjuntos de datos seleccionados utilizando el Star Schema
Benchmark. Los resultados mostraron que los dos mejores algoritmos fueron J48 y Random
Forest en la mayoría de los casos; sin embargo se seleccionó J48 por ser más eciente en
la construcción del modelo.
PALABRAS CLAVE
análisis de datos, sistemas informáticos, bases de datos, inteligencia
articial, toma de decisiones.
33
ARTÍCULOS
NIDIA RODRÍGUEZ MAZAHUA, LISBETH RODRÍGUEZ MAZAHUA, ASDRÚBAL LÓPEZ CHAU, GINER ALOR HERNÁNDEZ
Revista Perspectiva Empresarial, Vol. 7, No. 2-1, julio-diciembre de 2020, 31-43
ISSN 2389-8186, E-ISSN 2389-8194
Análise comparativa de algoritmos de árvores de decisão
para a fragmentação de armazéns de dados
RESUMO
Um dos principais problemas aos que se enfrentam os desenhadores
de armazéns de dados é a fragmentação. Vários estudos hão proposto métodos
de fragmentação horizontal baseados na mineração de dados. Não obstante, não
existe uma técnica de fragmentação horizontal que utilize uma árvore de decisão.
Este trabalho apresenta a análise de diferentes algoritmos de árvores de decisão
com o m de selecionar o melhor para implementar o método de fragmentação. Dita
análise se realizou sob a versão 3.9.4 de Weka, considerando quatro métricas de
avaliação (Precision, ROC Area, Recall e F-measure) para diferentes conjuntos de
dados selecionados utilizando o Star Schema Benchmark. Os resultados mostraram
que os dois melhores algoritmos foram J48 e Random Forest na maioria dos casos;
entretanto se selecionou J48 por ser mais eciente na construção do modelo.
PALAVRAS CHAVE
análise de dados, sistemas informáticos, bases de dados,
inteligência articial, toma de decisões.
34
ARTÍCULOS ORIGINALES
NIDIA RODRÍGUEZ MAZAHUA, LISBETH RODRÍGUEZ MAZAHUA, ASDRÚBAL LÓPEZ CHAU, GINER ALOR HERNÁNDEZ
Revista Perspectiva Empresarial, Vol. 7, No. 2-1, julio-diciembre de 2020, 31-43
ISSN 2389-8186, E-ISSN 2389-8194
Introduction
A Data Warehouse —DW— is a theme-oriented,
integrated, time variable and non-volatile data
collection in support of management’s decision-
making process. Data warehousing provides
architectures and tools for business executives to
systematically organize, understand and use the
data to make strategic decisions. Data warehousing
systems are valuable tools in today’s fast-changing
and competitive world. In recent years, many
companies have spent millions of dollars building
company-wide data warehouses. Many people feel
that with increasing competition across industries,
data warehousing is the newest indispensable
marketing strategy and a retention of customers
way by learning more about their needs (Han,
Kamber and Pei, 2012).
On the other hand, fragmentation is a distributed
database design technique that consists of dividing
each database relation in smaller fragments and
treating each fragment as an object in the database
separately, there are three alternatives for that:
horizontal, vertical and hybrid fragmentation (Ozsu
and Valduriez, 2020).
One of the main problems faced by DW designers
is fragmentation. Several studies have proposed data
mining-based horizontal fragmentation methods,
which focus on optimizing query response time

However, to the best of our knowledge there not
exists a horizontal fragmentation technique that
uses a decision tree to carry out fragmentation.
      
because their construction does not require any
domain knowledge or parameter setting, they can
handle multidimensional data, the learning and

simple and fast, and they have good accuracy (Han,
Kamber and Pei, 2012), and given the importance

obtaining pure partitions (subsets of tuples) in a
data set using measures such as Information Gain,
Gain Ratio and the Gini Index, the aim of this work is
to use decision trees in the DW fragmentation. This
paper presents the analysis of different decision
trees algorithms to select the best one to implement
the fragmentation method performed under version
3.9.4 of Weka, considering four evaluation metrics
(Precision, ROC Area, Recall and F-measure) for
different selected data sets, using the Star Schema
Benchmark —SSB— (Star Scheme Benchmark).
This paper is made up of the following parts: (i)
the introduction; (ii) a review of related works on
DW horizontal fragmentation; (ii) the methodology
used in this work for the analysis of decision tree
algorithms and a description of each algorithm is
given; (iii) reports the preliminary results in the

the future work.
Related Works
Cloud SDW (Spatial DW) and spatial OLAP (On-
line Analytical Processing) as a Service concepts
were presented in Costa et al. (2016). Later those
concepts were used to describe two different
hierarchy-based data partitioning techniques
for the SDW hosted in the cloud: Spatial-based
partitioning and Conventional-based partitioning.

and Ouzzif (2017), consisted of an incremental
horizontal fragmentation technique for the DW
through a web service. The goal was to automate
the implementation of incremental fragmentation
in order to optimize a new query load.
In Barkhordari and Niamanesh (2018), it was
proposed a method called Chabok, which uses two
phase Map-Reduce to solve DW problems with
big data. Chabok fragments horizontally the fact
table. If there are homogeneous nodes, the same
number of records is allocated to each Fact-Mapper
node. As part of their ongoing work on workload
-
driven partitioning (Boissier and Kurzynski, 2018),
implemented an approach called aggressive data
skipping and extended it to handle both analytical
and transactional access patterns. The authors
evaluated their approach with the workload
and data of a production system of a global 2000
company.
Likewise, Barr, Boukhalfa and Bouibede (2018)
used linear programming to solve the NP-hard
problem of determining a horizontal fragmentation
scheme in relational DW. Also, Nam, Kim and
35
ARTÍCULOS
NIDIA RODRÍGUEZ MAZAHUA, LISBETH RODRÍGUEZ MAZAHUA, ASDRÚBAL LÓPEZ CHAU, GINER ALOR HERNÁNDEZ
Revista Perspectiva Empresarial, Vol. 7, No. 2-1, julio-diciembre de 2020, 31-43
ISSN 2389-8186, E-ISSN 2389-8194
Han (2018) proposed a graph-based database
partitioning method called GPT that improves the
performance of queries with less data redundancy.
In Letrache, El Beggar and Ramdani (2018), it was
proposed a dynamic fragmentation strategy for
OLAP cubes, using association rule mining.
On the other hand, Kechar and Nait-Bahloul
(2019) presented a horizontal data partitioning
approach tailored to a large DW, interrogated
through a high number of queries, the idea was to
fragment horizontally only the large fact table based
on partitioning predicates, elected from the set of
selection predicates used by analytic queries. While,
in Ramdane et al. (2019), the authors assured that
horizontal partitioning techniques have been used
for many purposes in big data processing, such as
load balancing, skipping unnecessary data loads,
and guiding the physical design of a DW. Therefore,
they proposed a new data placement strategy in
the Apache Hadoop environment called Smart
Data Warehouse Placement —SDWP—, which
allows performing star join operation in only
one Spark stage. The problem of partitioning and
load balancing in a cluster of homogeneous nodes
was investigated; experiments using the TPC-DS
benchmark, showed that the proposed method
enhances OLAP query performances in terms of
execution time.
Likewise, in Ramdane et al. (2019), authors
mixed a data-driven and a workload-driven model
to create a new scheme for distributed big data
warehouses over Hadoop, called “SkipSJoin.” First,
SkipSJoin builds horizontal fragments (buckets) of
the fact and dimension tables of the DW using a hash-
partitioning method, and distributes these buckets
evenly over the nodes of the cluster. Then, it allows
skipping the scanning of some unnecessary data
blocks, by hash-partitioning some DW tables with

using the TPC-DS benchmark they showed that the
proposal outperforms some approaches in terms
of query execution time.
Finally, in Hilprecht, Carsten and Uwe
(2019), it was introduced that commercial data
analytics products such as Microsoft Azure SQL
Data Warehouse or Amazon Redshift provide
ready-to use-scale-out database solutions for
OLAP-style workloads in the cloud. Whereas the
provisioning of a database cluster is in general
fully automated by cloud providers, customers
still have to make important design decisions
which were traditionally made by the database
administrator such as selecting the partitioning
schemes, therefore, the authors proposed a
learned partitioning advisor for analytical OLAP-
style workloads based on Deep Reinforcement
Learning —DRL—. The leading idea was that a
DRL agent learns its decisions based on experience
by monitoring the rewards for different workloads
and partitioning schemes. The evaluation showed

partitioning that outperform existing approaches
for automated partitioning design but that it can
also easily adjust to different deployments.
Table 1 provides an analysis of the horizontal
fragmentation methods discussed above, in which


Table 1. Comparative table of works on horizontal fragmentation
Work Classication Validation
Costa et al. (2016) Hierarchy-based Spatial Data Warehouse Benchmark (Spadawan)
Ettaouk and Ouzzif (2017) Cost-based Benchmark APB-1
Barkhordari and Niamanesh (2018) Map-Reduce-based Benchmark TPC-DS
Boissier and Kurzynski (2018) Cost-based
Benchmarks TPC-C, TPC-CH (CH-benCHmark), data and
workload of a SAP ERP system of a Global 2000 company.
Barr, Boukhalfa and Bouibede
(2018)
Metaheuristic-based Benchmark APB-1
36
ARTÍCULOS ORIGINALES
NIDIA RODRÍGUEZ MAZAHUA, LISBETH RODRÍGUEZ MAZAHUA, ASDRÚBAL LÓPEZ CHAU, GINER ALOR HERNÁNDEZ
Revista Perspectiva Empresarial, Vol. 7, No. 2-1, julio-diciembre de 2020, 31-43
ISSN 2389-8186, E-ISSN 2389-8194
Work Classication Validation
Nam, Kim and Han (2018)
Cost-based
Graph-based
Benchmark TPC-DS, The Internet Movie DataBase (IMDB)
y BioWarehouse
Letrache, El Beggar and Ramdani
(2018)
Data mining-based Benchmark TPC-DS
Kechar and Nait-Bahloul (2019)
Cost-based
Predicates-based
SSB
Ramdane et al. (2019) Hash-partitioning-based
TPC-DS benchmark using Scala language on a cluster
of homogeneous nodes, a Hadoop-YARN platform, a
Spark engine, and Hive.
Ramdane et al. (2019) Hash-partitioning-based TPC-DS benchmark
Hilprecht, Carsten and Uwe (2019)
Deep Reinforcement Learning-
based
Dierent databases schemata and workloads of varying
complexity.
Source: author own elaboration.
Methodology
In this section, the process followed for
the analysis of the decision tree algorithms is
established; after that, each of the algorithms
available in the version of Weka used are described.
Collection and Preparation of Data
In order to carry out the study of decision tree
algorithms to select the best one to fragment the DW,
we use SSB and PostgreSQL. We constructed eight



the algorithm proposed by Rodríguez et al. (2014)
to build the data sets. The resulting data set for 24
queries and two fragments is visualized in Figure 1.
Figure 1. Data set with 24 queries and 2 fragments. Source: author own elaboration.
37
ARTÍCULOS
NIDIA RODRÍGUEZ MAZAHUA, LISBETH RODRÍGUEZ MAZAHUA, ASDRÚBAL LÓPEZ CHAU, GINER ALOR HERNÁNDEZ
Revista Perspectiva Empresarial, Vol. 7, No. 2-1, julio-diciembre de 2020, 31-43
ISSN 2389-8186, E-ISSN 2389-8194
Application of Decision Tree Algorithms
The seven decision tree algorithms that offer
the version of Weka 3.9.4 were applied to the
eight data sets. A description of the algorithms is
presented below.
Hoeffding Tree: It is an incremental, anytime
decision tree induction algorithm that is capable
of learning from massive data streams, assuming
that the distribution generating examples does not
change over time. Hoeffding trees exploit the fact
that a small sample can often be enough to choose
an optimal splitting attribute. This idea is supported
mathematically by the Hoeffding bound, which

to estimate some statistics within a prescribed
precision (Hulten, Spencer and Domingos, 2001).


regression functions at the leaves. The algorithm can
deal with binary and multi-class target variables,
numeric and nominal attributes and missing values
(Landwehr, Hall and Frank, 2005).
J48: C4.5 Decision Tree is one of the most
broadly used and real world approaches. In C4.5

tree as sets of if-then rules to human readability
improvement. The decision tree is simple to be
understood and interpreted; besides, it can handle
nominal and categorical data and perform well with
large data set in short time. In C4.5 training, the
decision tree is built in a top-down recursive way
(Saeh et al., 2016).
Decision Stump: It is one level decision tree,

on feature values. In a decision stump, each node

and each branch represents a value that the node

root node and sorting them based on their feature
values (Kotsiantis, Tsekouras and Pintelas, 2005;
Shi et al., 2018).
Random Forest: This algorithm uses bootstrap
methods to create an ensemble of trees, one for
each bootstrap sample. Additionally, the variables
eligible to be used in splitting is randomly varied
in order to decorrelate the variables. Once the
forest of trees is created, they vote to determine
the predicted value of input data (Dean, 2014).
Random Tree: It constructs a tree that considers
a given number of random features at each node
(Witten, Frank and Hall, 2011).
REPTree: It builds a decision or regression tree
using information gain-variance reduction and
prunes it using reduced-error pruning. Optimized
for speed, it only sorts values for numeric attributes
once. It deals with missing values by splitting
instances into pieces, as C4.5 does. It can be set
the minimum proportion of training set variance
for a split, and number of folds pruning (Witten,
Frank and Hall, 2011).
Results
After having analyzed the different decision
tree algorithms, the following results were found
for the Area ROC, Precision, Recall and F-measure
metrics. Figure 2 to Figure 5 demonstrate that
considering Recall, Precision, ROC Area and
F-Measure metrics, respectively, for the 24 queries
data sets, J48 algorithm was better for three, four

overcome by Random Forest.
38
ARTÍCULOS ORIGINALES
NIDIA RODRÍGUEZ MAZAHUA, LISBETH RODRÍGUEZ MAZAHUA, ASDRÚBAL LÓPEZ CHAU, GINER ALOR HERNÁNDEZ
Revista Perspectiva Empresarial, Vol. 7, No. 2-1, julio-diciembre de 2020, 31-43
ISSN 2389-8186, E-ISSN 2389-8194
Figure 2. Results of Recall metric for 24 queries data sets. Source: author own elaboration.
Figure 3. Results of Precision metric for 24 queries data sets. Source: author own elaboration.
Figure 4. Results of ROC Area metric for 24 queries dataset. Source: author own elaboration.