Harry Zhang and Jiang Su
Faculty of Computer Science, University of New Brunswick
P.O. Box 4400, Fredericton, NB, Canada E3B 5A3
hzhang@csd.uwo.ca,
WWW home page: http://www.cs.unb.ca/profs/hzhang/
Abstract. It is well-known that naive Bayes performs surprisingly well
in classiflcation, but its probability estimation is poor. In many applications, however, a ranking based on class probabilities is desired. For
example, a ranking of customers in terms of the likelihood that they buy
one’s products is useful in direct marketing. What is the general performance of naive Bayes in ranking? In this paper, we study it by both
empirical experiments and theoretical analysis. Our experiments show
that naive Bayes outperforms C4.4, the most state-of-the-art decisiontree algorithm for ranking. We study two example problems that have
been used in analyzing the performance of naive Bayes in classiflcation
[3]. Surprisingly, naive Bayes performs perfectly on them in ranking,
even though it does not in classiflcation. Finally, we present and prove a
su–cient condition for the optimality of naive Bayes in ranking.
1 Introduction
Naive Bayes is one of the most efiective and e–cient classiflcation algorithms.
In classiflcation learning problems, a learner attempts to construct a classifler
from a given set of training examples with class labels. Assume that A1, A2,¢ ¢ ¢,
An are n attributes. An example E is represented by a vector (a1; a2; ; ¢ ¢ ¢ ; an),
where ai is the value of Ai. Let C represent the class variable, which takes value
+ (the positive class) or ¡ (the negative class). We use c to represent the value
that C takes. A naive Bayesian classifler, or simply naive Bayes, is deflned as:
Cnb(E) = arg
c
max p(c)
nY i
=1
p(aijc): (1)
Because the values of p(aijc) can be estimated from the training examples,
naive Bayes is easy to construct. It is also, however, surprisingly efiective [10].
Naive Bayes is based on the conditional independence assumption that all attributes are independent given the value of the class variable. It is obvious that
the conditional independence assumption is rarely true in reality. Indeed, naive
Bayes is found to work poorly for regression problems [7], and produces poor
probability estimates [1].
Typically, the performance of a classifler is measured by its predictive accuracy (or error rate). Some classiflers, such as naive Bayes and decision trees, also
2 Related Work
The ranking addressed in this paper is based on the class probabilities of examples. If a learning algorithm produces accurate class probability estimates, it
Naive Bayesian Classiflers for Ranking 3
certainly produces an accurate ranking. But the opposite is not true. For example, assume that E+ and E¡ are a positive and a negative example respectively,
and that the actual class probabilities are p(+jE+) = 0:9 and p(+jE¡) = 0:4. An
algorithm that gives class probability estimates: ^ p(+jE+) = 0:5 and ^ p(+jE¡) =
0:45, gives a correct order of E+ and E¡ in the ranking, although the probability
estimates are poor. In the ranking problem, an algorithm tolerates the error of
probability estimates to some extent, which is similar to that in classiflcation.
Recall that a classiflcation algorithm gives the correct classiflcation on an example, as long as the class with the maximum posterior probability estimate is
identical to the actual class.
Naive Bayes is easy to construct and has surprisingly good performance in
classiflcation, even though the conditional independence assumption is rarely
true in real-world applications. On the other hand, naive Bayes is found to
produce poor probability estimates [3]. Some work has been published to improve
its probability estimates. Zadrozny and Elkan [19] propose using a histogram
method to calibrate probability estimation. A more efiective and straightforward
way to improve naive Bayes is to extend its structure to represent dependencies
among attributes [8]. Most of the extensions, however, aim at improving the
predictive accuracy, not at better probability estimation or ranking. Lachiche
and Flach present a method that uses AUC to flnd an optimal threshold for
naive Bayes, and thus improves its classiflcation accuracy [6]. An interesting
question is, what is the performance of naive Bayes in terms of ranking (AUC)?
Decision tree learning algorithms are one of the simplest and most efiective
learning algorithms, widely used in many applications. Traditional decision tree
learning algorithms, such as C4.5, are error-based, and also produce probability
estimates. In decision trees, the class probability p(cjE) of an example E is the
fraction of the examples of class c in the leaf that E falls into. How to build
decision trees with accurate probability estimates is an interesting question.
Unfortunately, traditional decision tree algorithms, such as C4.5, have been
observed to produce poor estimates of probabilities [14, 16]. According to Provost
and Domingos [17], the decision tree representation, however, is not (inherently)
doomed to produce poor probability estimates, and a part of the problem is
that modern decision tree algorithms are biased against building the tree with
accurate probability estimates. They propose the two techniques to improve the
AUC of C4.5: smooth probability estimates by Laplace correction and turning
ofi pruning. The resulting algorithm is called C4.4 [17]. They compared C4.4 to
C4.5 by empirical experiments, and found that C4.4 is a signiflcant improvement
over C4.5 with regard to AUC.
Ling and Yan proposed a method to calibrate the probability estimate generated by C4.5 [12]. Their method does not just determine the class probability
of an example E by the leaf into which it falls. Instead, each leaf in the tree contributes to the probability estimate. Ferri, Flach and Hernandez-Orallo present
a novel algorithm for learning decision trees, which is based on AUC, rather
than entropy. The resulting decision trees have better AUC without sacriflcing
accuracy [5].
4 Theoretical Analysis on the Performance of Naive
Bayes in Ranking
Although naive Bayes performs well in classiflcation, its learnability is very limited. In the binary domain, it can learn only linearly separable functions [4].
Moreover, it cannot learn even all the linearly separable functions. For example,
Domingos and Pazzani [3] discover that several speciflc linear functions are not
learnable by naive Bayes, such as conjunctive concepts and m-of-n concepts. In
other words, naive Bayes is not optimal in learning those concepts. We flnd out,
however, that naive Bayes is optimal in ranking in both conjunctive concepts
and m-of-n concepts. Here the optimality in ranking is deflned as follows.
Deflnition 1. A classifler is called locally optimal on example E in ranking,
1. if E is a positive example, there is no negative example ranked after E; or
2. if E is a negative example, there is no positive example ranked before E.
Deflnition 2. A classifler is called globally optimal in ranking, if it is locally
optimal on all the examples in the example space of a given problem.
When a classifler is globally optimal, the AUC of the ranking produced by it is
always 1.
Naive Bayesian Classiflers for Ranking 7
4.1 Conjunctive concepts
A conjunctive concept is a conjunction of n literals Li, where a literal is a Boolean
attribute or its negation. It has been shown that naive Bayes, as a classifler, is
optimal in learning conjunctive concepts if examples are uniformly distributed
and the training set includes all the 2n possible examples [3]. Let + and ¡
denote the class of C = 1 (true) and the class of C = 0 (false), respectively. In
the training set, only one example that has L1 = L2 = ¢ ¢ ¢ = Ln = 1 is in class +.
Thus, p(+) = 21 n , p(¡) = 2n 2¡ n 1, p(Lij+) = 1, p(L „ij+) = 0, p(L „ij¡) = 2 2n n¡ ¡1 1, and
p(Lij¡) = 2n 2¡ n¡ 1¡ 11. Assume that E is an arbitrary example and m is its number
of the conjunction literals being true. Then, the class probability estimates given
by naive Bayes are
pnb(+jE) = p(+)pm(Lij+)pn¡m(L „ij+)
= ‰0 otherwise, 21 n if m = n (2)
(3)
and
pnb(¡jE) = p(¡)pm(Lij¡)pn¡m(L „ij¡)
=
2n ¡ 1
2n
(2n¡1 ¡ 1
2n ¡ 1
)m( 2n¡1
2n ¡ 1
)n¡m:
It is easy to show that naive Bayes will give the correct classiflcation for all
examples. Let us consider the ranking produced by naive Bayes. For a positive
example E+, we have m = n. The probability pnb(+jE+) is 21 n . For any negative
example E¡, m < n, and pnb(+jE¡) = 0 < 21 n = pnb(+jE+). That means that
naive Bayes never ranks a positive example before a negative example in the
class probability based ranking. Naive Bayes is therefore optimal for conjunctive
concepts under uniform distribution.
If the assumption that examples are uniformly distributed is removed, naive
Bayes gives the correct classiflcation for all the examples in class ¡, given a
su–cient training set. However, for a positive example (m = n), the result will
depend on the class distribution. If p(+) < 21 n , it is possible that naive Bayes
will fail to assign a correct class to a positive example. That means that naive
Bayes is not optimal in classiflcation if the example distribution is not uniform.
However, no matter what the value of p(+) is, pnb(+jE¡) = 0 and pnb(+jE+) =
p(+) > 0. Therefore, naive Bayes is still optimal for conjunctive concepts in
ranking, as shown in the theorem below.
Theorem 1. Naive Bayes is globally optimal in ranking on conjunctive concepts.
4.2 m-of-n concepts
An m-of-n concept is a Boolean function that is true if m or more out of n
Boolean attributes are true. Clearly, it is a linearly separable function. Domingos and Pazzani [3] show that for the concept 8-of-25, when the input Boolean
p(+) =
Pn i=m µn i ¶
2n ;
p(¡) =
Pm i=0 ¡1 µn i ¶
2n ;
p(Ai = 1j+) =
Pn i= ¡ m 1¡1 µn ¡ i 1¶
Pn i=m µn i ¶ ;
p(Ai = 1j¡) =
Pm i=0 ¡2 µn ¡ i 1¶
Pm i=0 ¡1 µn i ¶ :
Let q denote p(Ai = 1j+). Obviously, q > 0:5. The class probability estimate
produced by naive Bayes, denoted by pnb(+jE), is:
pnb(+jE) = p(+)qi(1 ¡ q)(n¡i);
where i is the number of attributes of 1.
Now let us consider the ranking performance of naive Bayes in m-of-n concepts. Assume that E+ is a positive example with k1 attributes of 1, and that
E
¡ is a negative example with k2 attributes of 1. Obviously, k1 ‚ m > k2. Then
we have
pnb(+jE+) ¡ pnb(+jE¡) = p(+)qk2(1 ¡ q)n¡k1(qk1¡k2 ¡ (1 ¡ q)k1¡k2): (4)
Since q > 0:5 and k1 > k2, Equation 4 is always positive. Thus, for m-of-n
concepts, the class probability of a positive example is always greater than the
class probability of a negative example in naive Bayes. Therefore, the ranking
generated by naive Bayes is optimal, as shown in the following theorem.
Theorem 2. Naive Bayes is globally optimal in ranking on m-of-n concepts.
Naive Bayesian Classiflers for Ranking 9
4.3 General Optimality of Naive Bayes
The two example problems in the preceding sections are quite surprising, since it
has been known that, as a classifler, naive Bayes cannot learn all m-of-n concepts
under uniform distribution and cannot learn all conjunctive concepts under some
non-uniform distributions. The rankings generated by naive Bayes, however, are
optimal in both problems. This provides us evidence that naive Bayes performs
well in ranking, in some problems even better than classiflcation.
In our following discussion, we assume that the prior probabilities p(E) of
all examples E are equal. Since p(+jE) = p(+)p(Ej+)
p(E) , thus the ranking is also
determined by p(Ej+).
Now let us consider the general case. Assume that E+ is a positive example
and E
¡ is a negative example. Thus, p(E+j+) > p(E¡j+). Let pnb(Eij+) denote
the probability estimates generated by naive Bayes, i = +; ¡. Let x and y denote
the errors of probability estimates on E+ and E¡ given by naive Bayes. That is:
x = p(E+j+) ¡ pnb(E+j+)
y = p(E¡j+) ¡ pnb(E¡j+)
Naive Bayes generates the correct order for E+ and E¡, if
pnb(E+j+) > pnb(E¡j+):
That is
y ¡ x + (p(E+j+) ¡ p(E¡j+)) > 0: (5)
Assuming that x and y are uniformly distributed, we plot a flgure in which
x any y corresponds to the horizotal and vertical axes respectively, as shown in
Figure 1. The shaded area corresponds to the cases in which Equation 5 is true.
Since p(E+j+) > p(E¡j+), naive Bayes is optimal in more than a half of the
possible area. It is easy to calculate the area of the shaded area, denoted by A.
A = ¡1
2
((p(E+j+) ¡ p(E¡j+)) ¡ 2)2 + 4 (6)
It is interesting to notice that, the greater difierence between p(E+j+) and
p(E¡j+), the greater chance that naive Bayes is optimal. For example, when
p(E+j+) ¡ p(E¡j+) = 0:5, the probability of naive Bayes being optimal is
0.78125.
Now let us assume that all the dependences among attributes are complete.
An attribute Ai is said to depend on Aj completely, if Ai = Aj. If Ai = Aj and
all other attributes are independent, the true probablity p(Ej+) for an example
E = (a1; a2; ¢ ¢ ¢ ; an) is
p(Ej+) = p(aij+) Y
k6=i;j
p(akj+):
The probability pnb(Ej+) given by naive Bayes is
pnb(Ej+) = p(aij+)2 Y
k6=i;j
p(akj+):
10 Harry Zhang and Jiang Su
x
y d
-1
-1
1
1
y=x+d
Fig. 1. A flgure shows the optimality of naive Bayes in a general case, in which d =
p(E¡j+) ¡ p(E+j+), and the shaded area corresponds the optimal area of naive Bayes.
Given two examples E+ = (a+ 1 ; a+ 2 ; ¢ ¢ ¢ ; a+ n ) and E¡ = (a¡ 1 ; a¡ 2 ; ¢ ¢ ¢ ; a¡ n ) belonging to the positive and negative class respectively, we have
p(E+j+) = p(a+ i j+) Y
k6=i;j
p(a+ k j+) > p(E¡j+) = p(a¡ i j+) Y
k6=i;j
p(a¡ k j+):
It is easy to show that, if p(a+ i j+) ‚ 0:5, pnb(E+j+) > pnb(E¡j+). Notice that
E+ is a positive example, it is a reasonable assumption that p(a+ i j+) ‚ 0:5. We
have a formal deflnition on the property of such an attribute value.
Deflnition 3. A value ai of attributes Ai is called indicative to class c, if p(Ai =
aijc) ‚ p(Ai = „ aijc), where a „i is another value of Ai other than ai.
For example, for the problem of m-of-n concepts, p(Ai = 1j+) > p(Ai = 0j+)
for any attribute. So Ai = 1 is indicative to class +. If all the attribute values
of an example are indicative, naive Bayes always gives the optimal ranking for
it, illustrated by the theorem below.
Theorem 3. Naive Bayes is optimal on example E = (a1; a2; ¢ ¢ ¢ ; an) in ranking, if each attribute value of E is indicative to class +.
Proof. By induction on i, the number of pairs of attributes with complete dependence.
When i = 1, it is true from the preceding discussion. Assume that the claim
is true when i = k. That is, if there are k complete dependences among attributes and p(E+j+) > p(E¡j+), then pnb(E+j+) > pnb(E¡j+), where E+ =
(a+ 1 ; a+ 2 ; ¢ ¢ ¢ ; a+ n ) and E¡ = (a¡ 1 ; a¡ 2 ; ¢ ¢ ¢ ; a¡ n ) belong to positive and negative class
respectively. Consider that i = k+1. Assume that the new complete dependence
is between An¡1 and An. Then p(E+j+) > p(E¡j+). Since An¡1 = An,
p(E+j+) = p(E+ ¡ fAn¡1gj+) = p(a+ 1 ; ¢ ¢ ¢ ; a+ n¡2; a+ n j+);
p(E¡j+) = p(E¡ ¡ fAn¡1gj+) = p(a¡ 1 ; ¢ ¢ ¢ ; a¡ n¡2; a¡ n j+):
Naive Bayesian Classiflers for Ranking 11
Since there are only k dependences among A1, ¢ ¢ ¢, An¡2, An, according to
induction hypothesis,
pnb(a+ 1 ; ¢ ¢ ¢ ; a+ n¡2; a+ n j+) > pnb(a¡ 1 ; ¢ ¢ ¢ ; a¡ n¡2; a¡ n j+):
Thus, we have
nY
i=1i6=n¡1
p(a+ i j+) >
nY
i=1i6=n¡1
p(a¡ i j+):
Since all the attribute values of E are indicative, p(a+ n¡1j+) > p(a¡ n¡1j+). Then,
we have
nY i
=1
p(a+ i j+) >
nY i
=1
p(a¡ i j+):
Therefore, pnb(E+j+) > pnb(E¡j+).
Theorem 3 presents a su–cient condition on the local optimality of naive
Bayes. Notice that even when all the attribute values of an example are indictative, it is possible that naive Bayes gives a wrong classiflcation.
5 Conclusion
In this paper, we argue that naive Bayes performs well in ranking, just as it
does in classiflcation. We compare empirically naive Bayes with the state-ofthe-art decision tree learning algorithm C4.4 in terms of ranking, measured by
AUC, and our experiment shows that naive Bayes has some advantage over
C4.4. We investigate two example problems theoretically: conjunctive literals
and m-of-n concepts, which were used to analyze the classiflcation performance
of naive Bayes in [3]. Surprisingly, naive Bayes works perfectly in both problems
with respect to ranking, although it does not perform perfectly in terms of
classiflcation. For more general cases, we propose a su–cient condition for the
local optimality of naive Bayes in ranking.
Generally, the performance of naive Bayes in ranking is similar to that in
classiflcation, in the sense that both tolerate the estimation error of class probabilities to some extent. It is interesting to know which one tolerates error to a
higher extent. Our conjecture is that, for naive Bayes, it might be ranking.
References
1. Bennett, P. N.: Assessing the calibration of Naive Bayes’ posterior estimates. Technical Report No. CMU-CS00-155 (2000)
2. Bradley, A. P.: The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition 30 (1997) 1145-1159
3. Domingos, P., Pazzani M.: Beyond Independence: Conditions for the Optimality
of the Simple Bayesian Classifler. Machine Learning 29 (1997) 103-130
13. Merz, C., Murphy, P., Aha, D.: UCI repository of machine learning databases. Dept of ICS, University of California, Irvine (1997).
http://www.ics.uci.edu/ mlearn/MLRepository.html
14. M. Pazzani, P., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing
misclassiflcation costs. Proceedings of the 11th International conference on Machine
Learning. Morgan Kaufmann (1994) 217-225
15. Provost, F., Fawcett, T.: Analysis and visualization of classifler performance: comparison under imprecise class and cost distribution. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press
(1997) 43-48
16. Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. Proceedings of the Fifteenth International Conference
on Machine Learning. Morgan Kaufmann (1998) 445-453
17. Provost, F. J., Domingos, P.: Tree Induction for Probability-Based Ranking. Machine Learning 52(3) (2003) 199-215
18. Swets, J.: Measuring the accuracy of diagnostic systems. Science 240 (1988) 1285-
1293
19. Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision
trees and naive Bayesian classiflers. Proceedings of the Eighteenth International
conference on Machine Learning. Morgan Kaufmann (2001) 609-616
20. Witten, I. H., Frank, E.: Data Mining {Practical Machine Learning Tools and
Techniques with Java Implementation. Morgan Kaufmann (2000)