02-07-2014, 03:41 PM
Predicting School Failure and Dropout by Using Data Mining Techniques
Predicting School Failure and Dropout.pdf (Size: 266.04 KB / Downloads: 32)
INTRODUCTION
RECENT years have shown a growing interest and concern
in many countries about problem of school failure and
the determination of its main contributing factors [1]. The great
deal of research [2] has been done on identifying the factors
that affect the low performance of students (school failure and
dropout) at different educational levels (primary, secondary
and higher) using the large amount of information that current
computers can store in databases. All these data are a “gold
mine” of valuable information about students. Identify and
find useful information hidden in large databases is a difficult
task [3]. A very promising solution to achieve this goal is the
use of knowledge discovery in databases techniques or data
mining in education, called educational data mining, EDM [4].
This new area of research focuses on the development of
methods to better understand students and the settings in which
they learn [5]. In fact, there are good examples of how to
apply EDM techniques to create models that predict dropping
out and student failure specifically [6]. These works have
shown promising results with respect to those sociological,
economic, or educational characteristics that may be more
relevant in the prediction of low academic performance [7].
It is also important to notice that most of the research on
the application of EDM to resolve the problems of student
failure and drop-outs has been applied primarily to the specific
case of higher education [8] and more specifically to online o
DATA PRE-PROCESSING
Before applying DM algorithm sit is necessary to carry
out some pre-processing tasks such as cleaning, integration,
discretization and variable transformation [13]. It must be
pointed out that very important task in this work was data
pre-processing, due to the quality and reliability of available
information, which directly affects the results obtained. In fact,
some specific pre-processing tasks were applied to prepare all
the previously described data so that the classification task
could be carried out correctly. Firstly, all available data were
integrated into a single dataset. During this process those
students without 100% complete information were eliminated.
All students who did not answer our specific survey or the
CENEVAL survey were excluded. Some modifications were
also made to the values of some attributes. For example,
words that contained the letter “Ñ” were replaced by “N”.
A new attribute of the age of each student in years was
created using the day, month, and year of birth of each student.
Furthermore, the continuous variables were transformed into
discrete variables, which provide a much more comprehensible
view of the data. For example, the numerical values of the
scores obtained by students in each subject were changed to
categorical values in the following way:
INTERPRETATION OF RESULTS
In this section, some examples of different rules discovered
by some of the algorithms are shown in order to compare
their interpretability and usefulness for early identification of
students with risk of failing and for making decisions about
how to help this student. These rules show us the relevant
factors and relationships that lead a student to pass or fail.
CONCLUSION
As we have seen, predicting student failure at school can be
a difficult task not only because it is a multifactor problem (in
which there are a lot of personal, family, social, and economic
factors that can be influential) but also because the available
data are normally imbalanced. To resolve these problems, we
have shown the use of different DM algorithms and approaches
for predicting student failure. We have carried out several
experiments using real data from high school students in
Mexico. We have applied different classification approaches
for predicting the academic status or final student performance
at the end of the course. Furthermore we have shown that some
approaches such as selecting the best attributes, cost-sensitive
classification, and data balancing can also be very useful for
improving accuracy.
It is important to notice that gathering information and
pre-processing data were two very important tasks in this
work. In fact, the quality and the reliability of the used
information directly affects the results obtained. However, this
is an arduous task that involves a lot of time to do. Specifically,
we had to do the pick out of data from a paper and pencil
survey and we had to integrat data from three different sources
to form the final dataset.
In general, regarding the DM approaches used and the
classification result obtained, the main conclusions are as
follows: