• An empirical study of the naive Bayes classifier


  •   
  • FileName: Rish.pdf [read-online]
    • Abstract: The success of naive Bayes in the presence of feature de- pendencies can be explained as ... on the bias of naive Bayes classifier, not on its variance. ...

Download the ebook

An empirical study of the naive Bayes classifier
I. Rish
 
T.J. Watson Research Center
[email protected]
Abstract The success of naive Bayes in the presence of feature de-
pendencies can be explained as follows: optimality in terms
The naive Bayes classifier greatly simplify learn-
of zero-one loss (classification error) is not necessarily related
ing by assuming that features are independent given
to the quality of the fit to a probability distribution (i.e., the
class. Although independence is generally a poor appropriateness of the independence assumption). Rather, an
assumption, in practice naive Bayes often competes optimal classifier is obtained as long as both the actual and
well with more sophisticated classifiers.
estimated distributions agree on the most-probable class [2].
Our broad goal is to understand the data character- For example, [2] prove naive Bayes optimality for some prob-
istics which affect the performance of naive Bayes. lems classes that have a high degree of feature dependencies,
Our approach uses Monte Carlo simulations that al- such as disjunctive and conjunctive concepts.
low a systematic study of classification accuracy However, this explanation is too general and therefore not
for several classes of randomly generated prob- very informative. Ultimately, we would like to understand
lems. We analyze the impact of the distribution the data characteristics which affect the performance of naive
entropy on the classification error, showing that Bayes. While most of the work on naive Bayes compares
low-entropy feature distributions yield good per- its performance to other classifiers on particular benchmark
formance of naive Bayes. We also demonstrate problems (e.g., UCI benchmarks), our approach uses Monte
that naive Bayes works well for certain nearly- Carlo simulations that allow a more systematic study of clas-
functional feature dependencies, thus reaching its sification accuracy on parametric families of randomly gen-
best performance in two opposite cases: completely erated problems. Also, our current analysis is focused only
independent features (as expected) and function- on the bias of naive Bayes classifier, not on its variance.
ally dependent features (which is surprising). An- Namely, we assume an infinite amount of data (i.e., a perfect
other surprising result is that the accuracy of naive knowledge of data distribution) which allows us to separate
Bayes is not directly correlated with the degree the approximation error (bias) of naive Bayes from the error
of feature dependencies measured as the class- induced by training sample set size (variance).
conditional mutual information between the fea-
tures. Instead, a better predictor of naive Bayes ac- We analyze the impact of the distribution entropy
curacy is the amount of information about the class on the classification error, showing that certain almost-
that is lost because of the independence assump- deterministic, or low-entropy, dependencies yield good per-
tion. formance of naive Bayes. (Note that almost-deterministic
dependencies are a common characteristic in many practi-
cal problem domains, such as, for example, computer sys-
1 Introduction tem management and error-correcting codes.) We show that
Bayesian classifiers assign the most likely class to a given the error of naive Bayes vanishes as the entropy
4¨ §%¤£¥¡¢ 3
©© ¦ ¢
example described by its feature vector. Learning such clas- approaches zero. Another class of almost-deterministic de-
sifiers can be greatly simplified by assuming that features are
©  ¦¨  #"£! §¨ §¥¤£¡
¢ ¡    © ¦ ¢ pendencies generalizes functional dependencies between the
independent given class, that is,
©  " 100)' %"$¤
& (((& ¢  ¨ , features. Particularly, we show that naive Bayes works best in
where is a feature vector and is a class. two cases: completely independent features (as expected) and
Despite this unrealistic assumption, the resulting classifier functionally dependent features (which is less obvious), while
known as naive Bayes is remarkably successful in practice, reaching its worst performance between these extremes.
often competing with much more sophisticated techniques [6; We also show that, surprisingly, the accuracy of naive
8; 4; 2]. Naive Bayes has proven effective in many practical Bayes is not directly correlated with the degree of feature de-
applications, including text classification, medical diagnosis, pendencies measured as the class-conditional mutual infor-
and systems performance management [2; 9; 5]. mation between the features,
© ¦7 6 5
 " ¨ 98" 4 %"¢
( and
 "
are
2
T.J. Watson Research Center, 30 Saw Mill River Road, features and
¨ is the class). Instead, our experiments re-
Hawthorne, NY 10532. Phone +1 (914) 784-7431 veal that a better predictor of naive Bayes accuracy can be
41
the loss of information that features contain about the class A4© 7 " &  #"¢ 6 ¥¨¢ 5
@© The probability of a classification error, or
xdx‹
Œˆ of a clas-
when assuming naive Bayes model, namely sifier is defined as ƒ
I)H" G %"¢ 6 F¢ EC5
©©7 & ¨ D B , where is the mutual information be-
EC5
D B „4©%¤¢ q ’%¤¢ ƒ £•© ƒ ¢ Ž
 © ‘ © ¢ ¡ 
tween features and class under naive Bayes assumption. & ˜4€%y¢ ‘’€#y¢ £¡ | W•€WQ%¤£44€%y¢ ’€#y¢ £¡ ”
h © © q  © ƒ ¢ c —  © y  ¢ ¡ © © q ‘ © ƒ ¢ “
This paper is structured as follows. In the next section we i•|
–—
provide necessary background and definitions. Section 3 dis- Ž  Ž y
cusses naive Bayes performance for nearly-deterministic de- © l ƒ¢ | —
where is the expectation over .
l denotes the
pendencies, while Section 4 demonstrates that the “informa- Bayes error (Bayes risk).
tion loss” criterion can be a better error predictor than the We say that classifier is optimal on a given problem if its
ƒ
strength of feature dependencies. A summary and conclu- risk coincides with the Bayes risk. Assuming there is no noise
sions are given in Section 5. (i.e. zero Bayes risk), a concept is called separable by a set of
functions
‡ƒ@ &RSR)&  )© ¢ š ‰ 
hf R ¦if every example
y
2 Definitions and Background © p ¢ ‡‰ Y e ` p ‡'c ™
is classified correctly when using each
š as discriminant
Let
©  " &SRR0' %"QP¤
R& ¢ 
be a vector of observed random vari- functions.
As a measure of dependence between two features
› œ"
ables, called features, where each feature takes values from  T 7 "
its domain . The set of all feature vectors (examples, or
XRRWV£ 
V R and we use the class-conditional mutual information [1],
states), is denoted
¨  T . Let
T U be an un- which can be defined as
observed random variable denoting the class of an example,
& ¨ 98" & œ#"¢ z¨ 98#"¢ €¨ ~œ#"¢ •¨ 9H" 6 œ#"¢
© ¦7 › 3 @© ¦7 3 © ¦› 3 © ¦7 › 5
h f Y R Adab`
R ig@ &SRR0&
where can take one of values
ec Y  " Capi-
¨ ž ¨ Ÿ%ž¢ 3
© ¦
where is the class-conditional entropy of , defined
tal letters, such as , will denote variables, while lower-case
 as:
letters, such as , will denote their values; boldface letters
p R © s¨ 0¤…%ž£¡ “ S¦•© s¨ 0¤…%ž£¢ © s¥¨£¡ ‰@
will denote vectors.
¨‚€#y¢ q
 © hi@ Y &SRSR)& xcwvtsq
f R ˆ  ¦£  ¢ ¥ ˆ  ¦£  ¢ ¡ ¡“ ˆ  ¢ “
A function
© p¢ q , where
e u U r ,
denotes a concept to be learned. Deterministic corre-
7 8" œ" ›
Mutual information is zero when and are mutually in-
sponds to a concept without noise, which always assigns the
¨
dependent given class , and increases with increasing level
same class to a given example (e.g., disjunctive and conjunc- of dependence, reaching the maximum when one feature is a
tive concepts are deterministic). In general, however, a con- © p¢ q deterministic function of the other.
cept can be noisy, yielding a random function .
A classifier is defined by a (deterministic) function
‡†@ Y &RSR)& Ac …U
hf R r „ƒ 3 When does naive Bayes work well? Effects
(a hypothesis) that assigns a class
e u
to any given example. A common approach is to asso-
of some nearly-deterministic dependencies
ciate each class with a discriminant function ,
ˆ W@ &RSR)&  ˆ €%y¢  ‰
© In this section, we discuss known limitations of naive
f Y R e
, and let the classifier select the class with max- Bayes and then some conditions of its optimality and near-
imum discriminant function on a given example:
€%y¢ ƒ
© optimality, that include low-entropy feature distributions and
.
©€y#¢  ‰ G1jh gegfgfgf)d9%— '£•G‡‘
k i e™˜ –‘ ” “’ nearly-functional feature dependencies.
The Bayes classifier
©€#y¢ !ƒ
(that we also call Bayes-optimal
classifier and denote
€#y¢ l 8m
© n
), uses as discriminant functions
3.1 Concepts without noise
the class posterior probabilities given a feature vector, i.e. © l We focus first on concepts with
y e „€ ¦y ˆ s¥¨£¡
 ©  ¢
or for any
f
€o  yz€‡w 1|€o €€}©  ~ysrxˆ v€uFvp§t#y¥¨¢ £‰¡
  o  S{zw ¦¤  © q¨  £¡ ¦¤   € ¢ o
| y  ¢
. Applying Bayes rule gives
y €v¥¤£¡
©y  ¢
and (i.e. no noise), which therefore have zero Bayes risk.
ˆ
The features are assumed to have finite domains ( -th feature ˆ
S{w
y , where ˆ 
has values), and are often called nominal. (A nominal fea-

is identical for all classes, and therefore can be ignored. This ture can be transformed into a numeric one by imposing an
yields Bayes discriminant functions order on its domain.) Our attention will be restricted to bi-
& © …F£d© …¨ „ƒ$¥¤£W‚€#y¢ l 
ˆ  ¨¢ ¡ ˆ  ¦ y  ¢ ¡  © ‰ (1) nary classification problems where the class is either 0 or 1.
Some limitations of naive Bayes are well-known: in case
where
© ˆ s¨ „W$%¤£¡
 ¦ y  ¢
is called the class-conditional prob-
§ v   4&RSR)& ‚
¨ R f
of binary features ( for all ˆ ), it can only
ability distribution (CPD). Thus, the Bayes classifier learn linear discriminant functions [3], and thus it is always
© ˆ …F£d© ˆ …¨ „ƒ$¥¤£¡ ‡£Ii‘ „€%y¢ l ƒ
¨ ¢ ¡  ¦ y  ¢ – ‘ ” “ ’  © (2)
suboptimal for non-linearly separable concepts (the classical ¨
example is XOR function; another one is -of- concepts [7;
§ ª 
© Y
finds the maximum a posteriori probability (MAP) hypothe- 2]). When for some features, naive Bayes is able
y
sis given example . However, direct estimation of
†%¤£¡
 ¢ to learn (some) polynomial discriminant functions [3]; thus,
© ˆ ‡¨ „y
 ¦ from a given set of training examples is hard when polynomial separability is a necessary, although not suffi-
the feature space is high-dimensional. Therefore, approxima- cient, condition of naive Bayes optimality for concepts with
tions are commonly used, such as using the simplifying as- finite-domain features.
sumption that features are independent given the class. This Despite its limitations, naive Bayes was shown to be opti-
€#y¢ ‰ˆ
© m
yields the naive Bayes classifier defined by discrimi- mal for some important classes of concepts that have a high
nant functions degree of feature dependencies, such as disjunctive and con-
R © F¢£¡d© ¨ 97 v8#"£¡   7 W„€%y¢ EB 
Š  © D ‰ junctive concepts [2]. These results can be generalized to
ˆ ¨ ˆ  ¦ p  7 ¢ (3) concepts with any nominal features (see [10] for details):
42
Theorem 1 [10] The naive Bayes classifier is optimal for NBerror, I(X1;X2|C), and H(P(x1|c) vs. P(0) (n=2, m=2, k=10, N=1000)
any two-class concept with nominal features that assigns 3
class 0 to exactly one example, and class 1 to the other ex- NBerr
amples, with probability 1. 1 2.5
I(X1;X2|C)
H(P(x1|c)
The performance of naive Bayes degrades with increas-
ing number of class-0 examples (i.e., with increasing prior
© e †¥¨£¡
 ¢ © e £¡
¢ 2
, also denoted ), as demonstrated in Figure
1a. This figure plots average naive Bayes error computed 1.5
over 1000 problem instances generated randomly for each
© e «¥¨£¡
 ¢
value of . The problem generator, called Zer- 1
oBayesRisk, assumes features (here we only consider two
Y
features), each having values, and varies the number of
 ¬ © vF£¡
class-0 examples from 1 to (so that e g°R  ¨ © ¢ © ¯¥¨£¡ § x¢
varies¢ ®­
0.5
from ˆ ˜® f
to 0.5; the results for e e ©  ‡¥¨£¡
are sym-
metric)2 . As expected, larger e  ¢
(equivalently, larger
0
0.05 0.1 0.15 0.2 0.25
P(0)
0.3 0.35 0.4 0.45
¬), yield a wider range of problems with various dependencies
(a)
among features, which result into increased errors of Bayes; Average errors vs. mutual information (n=2, m=2, k=10)
a closer look at the data shows no other cases of optimality
besides ˜® u•© e ¥¨£¡
ˆ f   ¢ . 0.6
Surprisingly, the strength of inter-feature dependen-
cies, measured as the class-conditional mutual information
©  ¦¨ " 6  #"¢ 5 0.5
(also denoted ), is not a good predictor of
5 s±
­
naive Bayes performance: while average naive Bayes error
increases monotonically with
© e ¢£¡
, the mutual information
error 0.4
is non-monotone, reaching its maximum around .
f R e •© e £¡
 ¢
0.3
This observation is consistent with previous empirical results
on UCI benchmarks [2]) that also revealed low correlation 0.2
between the degree of feature dependence and relative perfor-
mance of naive Bayes with respect to other classifiers, such 0.1
as C4.5, CN2, and PEBLS. boptErr
NBerr
I(X;Y|C)
It turns out that the entropy of class-conditional marginal
©  ¦¨  #"£¡
¢ 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
distributions, , is a better predictor of naive Bayes
© e ¦  #"£¡
¢ delta
performance. Intuitively, low entropy of means that (b)
most of 0s are “concentrated around ” one state (in the limit, Average error difference vs. mutual information (n=2, m=2, k=10)
0.01
this yields the optimality condition stated by Theorem 1). In-
© e ¦  #"£¥¡¢ 3
¢ NBerr−boptErr
I(X;Y|C)/300
deed, plotting average in Figure 1a demonstrates 0.009
that both average error and average entropy increase mono-
© e £¡
¢ 0.008
tonically in . Further discussion of low-entropy distri- 0.007
butions is given next in the more general context of noisy
NBerr−boptErr
0.006
(non-zero Bayes risk) classification problems.
0.005
3.2 Noisy concepts 0.004
Low-entropy feature distributions 0.003
Generally, concepts can be noisy, i.e. can have non-
€ ¦y ˆ †¥¨£¡
©  ¢ 0.002
deterministic and thus a non-zero Bayes risk.
A natural extension of the conditions of Theorem 1 to noisy 0.001
concepts yields low-entropy, or “extreme”, probability distri- 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
butions, having almost all the probability mass concentrated delta
in one state. Indeed, as shown in [10], the independence (c)
assumption becomes more accurate with decreasing entropy
which yields an asymptotically optimal performance of naive Figure 1: (a) results for the generator ZeroBayesRisk
Bayes. Namely, (k=10, 1000 instances): average naive Bayes error
Theorem 2 [10] Given that one of the following conditions (NBerr), class-conditional mutual information between fea-
¨ ³§" 6 ²#"¢ 5
© ¦§ f
hold: tures ( © ), and entropy of marginal distribution,
I© e ¦  #"£¥¡¢ 3
¢
; the error bars correspond to the standard devi-
1
Clearly, this also holds in case of a single example of class 1. ation of each measurement across 1000 problem instances;
2
Note that in all experiments perfect knowledge of data distribu- (b) Results for the generator EXTREME: average Bayes ¨  ´" 6 #"¢ 5
© ¦§ f
tion (i.e., infinite amount of data)is assumed in order to avoid the and naive Bayes errors and average ; (c) results
effect of finite sample size. for the generator FUNC1: average difference between naive i¸ R·
¹·
Bayes error and Bayes error ( - constant for all ), e ¶µ º
and scaled I(X1;X2—C) (divided by 300).
43
© " &SRR0&  #"£¡
R ¢
1. a joint probability distribution is such
 "
feature,
©  %"¢ ™ ¡
and
©  %"¢  ¡
, for class 0 and class 1, re-
 ly 
that
@ » l &RSR)& lR
º §f „© !p © 1p £RSR&¢ ¡ 0&  ¢
for some state spectively. Finally, it creates class-conditional joint feature
, or l p R l p distributions satisfying the following conditions:
& © tdC©  ¢ š „4©  ¢  ‡ ¢ š ¡


Use: 0.4553