GAUSSIAN NAIVE BAYES CLASSIFIER IMPLEMENTATION IN PYTHON

Building Gaussian Naive Bayes Classifier in Python

In this post, we are going to implement the Naive Bayes classifier in Python using my favorite machine learning library scikit-learn. Next, we are going to use the trained Naive Bayes (supervised classification), model to predict the Census Income.

As we discussed the Bayes theorem in naive Bayes classifier post. We hope you know the basics of the Bayes theorem. If not let’s quickly look at the basics of Bayes theorem once.

Bayes’ theorem is based on conditional probability. The conditional probability helps us calculating the probability that something will happen, given that something else has already happened. Not getting let’s understand with few examples.

Conditional Probability Examples

Below are the few examples helps to clearly understand the definition of conditional probability.

  • Purchasing mac book when you already purchased the iPhone.
  • Having a refreshing drink when you are in the movie theater.
  • Buying peanuts when you brought a chilled soft drink.

Conditional Probability Example

Conditional Probability Example

 

Using the Bayes theorem the naive Bayes classifier works. The naive Bayes classifier assumes all the features are independent to each other. Even if the features depend on each other or upon the existence of the other features. Naive Bayes classifier considers all of these properties to independently contribute to the probability that the user buys the  MacBook.

To learn the key concepts related to Naive Bayes. You can read our article on Introduction to Naive Bayes. This will help you understand the core concepts related to Naive Bayes.

In the introduction to Naive Bayes post, we discussed three popular Naive Bayes algorithms:

  • Gaussian Naive Bayes,
  • Multinomial Naive Bayes.
  • Bernoulli Naive Bayes.

As a continues to the Naive Bayes algorithm article. Now we are going to implement Gaussian Naive Bayes on a “Census Income” dataset.

Gaussian Naive Bayes

A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It’s specifically used when the features have continuous values. It’s also assumed that all the features are following a gaussian distribution i.e, normal distribution.

Census Income Dataset

Census Income dataset is to predict whether the income of a person >$50K/yr (greater than $50K/yr) or <=$50K/yr. The data was collected by Barry Becker from 1994 Census dataset.

This dataset was contributed to UCI repository, and It’s openly available at this link. The dataset consists of 15 columns of a mix of discrete as well as continuous data.

Variable Name Variable Range
1. age  [17 – 90]
2. workclass [‘State-gov’, ‘Self-emp-not-inc’, ‘Private’, ‘Federal-gov’, ‘Local-gov’, ‘Self-emp-inc’, ‘Without-pay’, ‘Never-worked’]
3. fnlwgt [77516- 257302]
4. education

[‘Bachelors’, ‘HS-grad’, ’11th’, ‘Masters’, ‘9th’, ‘Some-college’, ‘Assoc-acdm’, ‘Assoc-voc’, ‘7th-8th’, ‘Doctorate’, ‘Prof-school’, ‘5th-6th’, ’10th’, ‘1st-4th’, ‘Preschool’, ’12th’]

5. education_num [1 – 16]
6. marital_status

[‘Never-married’, ‘Married-civ-spouse’, ‘Divorced’, ‘Married-spouse-absent’, ‘Separated’, ‘Married-AF-spouse’, ‘Widowed’]

7. occupation

[‘Adm-clerical’, ‘Exec-managerial’, ‘Handlers-cleaners’, ‘Prof-specialty’, ‘Other-service’, ‘Sales’, ‘Craft-repair’, ‘Transport-moving’, ‘Farming-fishing’, ‘Machine-op-inspct’, ‘Tech-support’, ‘Protective-serv’, ‘Armed-Forces’, ‘Priv-house-serv’]

8. relationship

[‘Not-in-family’, ‘Husband’, ‘Wife’, ‘Own-child’, ‘Unmarried’, ‘Other-relative’]

9. race

[‘White’, ‘Black’, ‘Asian-Pac-Islander’, ‘Amer-Indian-Eskimo’, ‘Other’]

10. sex

[‘Male’, ‘Female’]

11. capital_gain [0 – 99999]
12. capital_loss [0 – 4356]
13. hours_per_week [1 – 99]
14. native_country

[‘United-States’, ‘Cuba’, ‘Jamaica’, ‘India’, ‘Mexico’, ‘South’, ‘Puerto-Rico’, ‘Honduras’, ‘England’, ‘Canada’, ‘Germany’, ‘Iran’, ‘Philippines’, ‘Italy’, ‘Poland’, ‘Columbia’, ‘Cambodia’, ‘Thailand’, ‘Ecuador’, ‘Laos’, ‘Taiwan’, ‘Haiti’, ‘Portugal’, ‘Dominican-Republic’, ‘El-Salvador’, ‘France’, ‘Guatemala’, ‘China’, ‘Japan’, ‘Yugoslavia’, ‘Peru’, ‘Outlying-US(Guam-USVI-etc)’, ‘Scotland’, ‘Trinadad&Tobago’, ‘Greece’, ‘Nicaragua’, ‘Vietnam’, ‘Hong’, ‘Ireland’, ‘Hungary’, ‘Holand-Netherlands’]

15. income

 [‘<=50K’, ‘>50K’]

The final target variable consists of two values: ‘<=50K” & ‘>50K’.

Implementation of Gaussian NB on Census Income dataset

Importing Python Machine Learning Libraries

We need to import pandasnumpy and sklearn libraries. From sklearn, we need to import preprocessing modules like Imputer. The Imputer package helps to impute the missing values.

If you are not setup the python machine learning libraries setup. You can first complete it to run the codes in this articles.

The train_test_split module is for splitting the dataset into training and testing set. The accuracy_score module will be used for calculating the accuracy of our Gaussian Naive Bayes algorithm.

Data Import

For importing the census data, we are using pandas read_csv() method. This method is a very simple and fast method for importing data.

We are passing four parameters. The ‘adult.data’ parameter is the file name. The header parameter is for giving details to pandas that whether the first row of data consists of headers or not. In our dataset, there is no header. So, we are passing None.

The delimiter parameter is for giving the information the delimiter that is separating the data. Here, we are using ‘ *, *’ delimiter. This delimiter is to show delete the spaces before and after the data values. This is very helpful when there is inconsistency in spaces used with data values.

Let’s add headers to our dataframe. The below code snippet can be used to perform this task.

Handling Missing Data

Let’s try to test whether there is any null value in our dataset or not. We can do this using isnull()method.

Script Output

The above output shows that there is no “null” value in our dataset.

Let’s try to test whether any categorical attribute contains a “?” in it or not. At times there exists “?” or ” ” in place of missing values. Using the below code snippet we are going to test whether our adult_df data frame consists of categorical variables with values as “?”.

Script Output

The output of the above code snippet shows that there are 1836 missing values in workclassattribute. 1843 missing values in occupation attribute and 583 values in native_country attribute.

Data preprocessing

For preprocessing, we are going to make a duplicate copy of our original dataframe.We are duplicating adult_df to adult_df_rev dataframe.

As we want to perform some imputation for missing values. Before doing that, we need some summary statistics of our dataframe. For this, we can use describe() method. It can be used to generate various summary statistics, excluding NaN values.

We are passing an “include” parameter with value as “all”, this is used to specify that. we want summary statistics of all the attributes.

You check the basic statistics about the dataset after running the above command. You can spend some time here to get in details about each and ever stats provided.

Data Imputation Step

Now, it’s time to impute the missing values. Some of our categorical values have missing values i.e, “?”. We are going to replace the “?” with the above describe methods top row’s value. For example, we are going to replace the “?” values of workplace attribute with “Private” value.

We have performed data imputation step. 🙂
You can check changes in dataframe by printing  adult_df_rev

For naive Bayes, we need to convert all the data values in one format. We are going to encode all the labels with the value between 0 and n_classes-1.

One-Hot Encoder

For implementing this, we are going to use LabelEncoder of scikit learn library. For encoding, we can also use the One-Hot encoder. It encodes the data into binary format.

Using the above code snippets, we have created multiple categorical columns like “marital_cat”, “race_cat” etc. You can see the top 6 lines of the dataframe using adult_df_rev.head()By printing the  adult_df_rev.head()  result. You will be able to see that all the columns should be reindexed. They are not in proper order. For reindexing the columns, you can use the code snippet provided below:

The output to the above code snippet will show you that all the columns are reindexed properly. I have passed the list of column names as a parameter and axis=1 for reindexing the columns.

Standardization of Data

All the data values of our dataframe are numeric. Now, we need to convert them on a single scale. We can standardize the values.  We can use the below formula for standardization.

{x}_i = \frac{{x}_i - mean(x)} {\sigma(x)}

 

We have converted our data values into standardized values. You can print and check the output of dataframe.

Data Slicing

Let’s split the data into training and test set. We can easily perform this step using sklearn’s train_test_split() method.

Using above code snippet, we have divided the data into features and target set. The feature set consists of 14 columns i.e, predictor variables and target set consists of 1 column with class values.

The features_train & target_train consists of training data and the features_test & target_testconsists of testing data.

Gaussian Naive Bayes Implementation

After completing the data preprocessing. it’s time to implement machine learning algorithm on it. We are going to use sklearn’s GaussianNB module.

We have built a GaussianNB classifier. The classifier is trained using training data. We can use fit()method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.

Accuracy of our Gaussian Naive Bayes model

It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model.

Script Output

Awesome! Our model is giving an accuracy of 80%. This is not bad with a simple implementation. You can create random test datasets and test the model to get know how well the trained Gaussian Naive Bayes model is performing.

We can save the trained scikit-learn model with Python pickle. You can check out how to save the trained scikit-learn model with Python Pickle.

Source: http://dataaspirant.com/2017/02/20/gaussian-naive-bayes-classifier-implementation-python/

Bình luận