import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
#Import the Datasist Library
import datasist as ds


#Read in data set
train_data = pd.read_csv('training.csv')
test_data = pd.read_csv('test.csv')

Quick summary of a data set using the describe function in the structdata module

ds.structdata.describe(train_data)

First five data points


Last five data points


Shape of  data set: (95662, 16)


Size of  data set: 1530592


Data Types
Note: All Non-numerical features are identified as objects in pandas


Column(s) {'TransactionStartTime'} should be in Datetime format. Use the [to_date] function in datasist.feature_engineering to coonvert to Pandas Datetime format


Numerical Features in Data set
['CountryCode', 'Amount', 'Value', 'PricingStrategy', 'FraudResult']


Statistical Description of Columns


Categorical Features in Data set

['TransactionId',
 'BatchId',
 'AccountId',
 'SubscriptionId',
 'CustomerId',
 'CurrencyCode',
 'ProviderId',
 'ProductId',
 'ProductCategory',
 'ChannelId',
 'TransactionStartTime']


Unique class Count of Categorical features


Missing Values in Data

Remove features that contains only one unique field as these features are redundant

#Drop redundant features
ds.feature_engineering.drop_redundant(data=train_data)
ds.feature_engineering.drop_redundant(data=test_data)

Dropped ['CurrencyCode', 'CountryCode']
Dropped ['CurrencyCode', 'CountryCode']

Check for missing values in dataset with the display function

EXPLORATION OF CATEGORICAL FEATURES

cat_feats = ds.structdata.get_cat_feats(train_data)

cat_feats

['TransactionId',
 'BatchId',
 'AccountId',
 'SubscriptionId',
 'CustomerId',
 'ProviderId',
 'ProductId',
 'ProductCategory',
 'ChannelId',
 'TransactionStartTime']

ds.structdata.get_unique_counts(train_data)

From the unique display output, we notice that the TransactionId and BatchId contains too many classes and thus we can drop them

train_data.drop(['TransactionId', 'BatchId'], axis=1, inplace=True)
test_data.drop(['TransactionId', 'BatchId'], axis=1, inplace=True)

VISUALIZATION FOR CATEGORICAL FEATURES

ds.visualizations.countplot(train_data)

Unique Values in AccountId is too large to plot


Unique Values in SubscriptionId is too large to plot


Unique Values in CustomerId is too large to plot


Unique Values in TransactionStartTime is too large to plot

ds.visualizations.class_count(train_data)

Unique classes in AccountId too large
Unique classes in SubscriptionId too large
Unique classes in CustomerId too large
Class Count for ProviderId

Unique classes in ProductId too large
Class Count for ProductCategory

Class Count for ChannelId

Unique classes in TransactionStartTime too large

train_data.columns

Index(['AccountId', 'SubscriptionId', 'CustomerId', 'ProviderId', 'ProductId',
       'ProductCategory', 'ChannelId', 'Amount', 'Value',
       'TransactionStartTime', 'PricingStrategy', 'FraudResult'],
      dtype='object')

ds.visualizations.catbox(data=train_data, target='FraudResult', fig_size=(7,7))

TransactionId feature has too many categories and will not be ploted
BatchId feature has too many categories and will not be ploted
AccountId feature has too many categories and will not be ploted
SubscriptionId feature has too many categories and will not be ploted
CustomerId feature has too many categories and will not be ploted
ProductId feature has too many categories and will not be ploted
TransactionStartTime feature has too many categories and will not be ploted

VISUALIZATION OF NUMERICAL FEATURES

ds.visualizations.histogram(train_data, fig_size=(5,5), bins=5)

ds.visualizations.boxplot(data=train_data, target='FraudResult', fig_size=(5,5))

TIME FEATURES

date_feats = ds.structdata.get_date_cols(train_data)

date_feats

{'TransactionStartTime'}

VISUALIZATION OF TIME FEATURES

train_data[date_feats].head()

num_feats = ds.structdata.get_num_feats(train_data)

ds.timeseries.num_timeplot(data=train_data,num_cols=num_feats, time_col='TransactionStartTime')

The extract_dates function in the timeseries module can extract date information from a datetime column automatically.

train_data = ds.timeseries.extract_dates(data=train_data, date_cols=['TransactionStartTime'])
test_data = ds.timeseries.extract_dates(data=test_data, date_cols=['TransactionStartTime'])

train_data.head(2).T

MERGE DATA AND ENCODE CATEGORICAL FEATURES¶

#perform merge 
all_data, ntrain, ntest = ds.structdata.join_train_and_test(train_data, test_data)

#Label Encode Large Categorical features
large_cats = ['AccountId', 'SubscriptionId', 'CustomerId', 'ProductId']

from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
for cat in large_cats:
    all_data[cat] = lb.fit_transform(all_data[cat])

# One hot encode small categorical features
all_data = pd.get_dummies(all_data, drop_first=True)

#Get traina nd test set back
train = all_data[:ntrain]
test = all_data[ntrain:]

#Get target and drop it
target = train['FraudResult']
train.drop('FraudResult', axis=1, inplace=True)
test.drop('FraudResult', axis=1, inplace=True)

print("Shape of training datasets is {}".format(train.shape))
print("Shape of training target is {}".format(test.shape))
print("Shape of target is {}".format(target.shape))

MODELING¶

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score, train_test_split

X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=2)

rf_model = RandomForestClassifier(n_estimators=100,random_state=232)
lg_model = LogisticRegression(max_iter=100, random_state=2, solver='lbfgs')

Next, we will train a two classifiers and get detailed report with the datasist classification report function. Can be used to pick best model for a task.

lg_model.fit(X_train, y_train, )
pred = lg_model.predict(X_test)

#Get report from true and predicted values
ds.model.get_classification_report(y_test, pred)

Accuracy is  100.0
F1 score is  41.0
Precision is  58.0
Recall is  32.0
****************************************************************************************************
confusion Matrix
                 Score positive    Score negative
Actual positive     28629                13
Actual negative        39                18

rf_model.fit(X_train, y_train)
pred = rf_model.predict(X_test)

#Get report from true and predicted values
ds.model.get_classification_report(y_test, pred)

Accuracy is  100.0
F1 score is  90.0
Precision is  90.0
Recall is  91.0
****************************************************************************************************
confusion Matrix
                 Score positive    Score negative
Actual positive     28636                 6
Actual negative         5                52

From the metrics reported, we can see that the RandomForest model does better tahn Logistic Regression. Next we can use a cross validation function to detect overfitting.

ds.model.train_classifier(train_data=train, target=target, model=rf_model, cross_validate=True, cv=3)

Accuracy is 99.8787
F1_score is 72.1679
Precision is 76.2845
Recall is 79.7035

Finally, we can plot the feature importance to see the most important feature

feats = train.columns
ds.model.plot_feature_importance(estimator=rf_model, col_names=feats)

	TransactionId	BatchId	AccountId	SubscriptionId	CustomerId	CurrencyCode	CountryCode	ProviderId	ProductId	ProductCategory	ChannelId	Amount	Value	TransactionStartTime	PricingStrategy
0	TransactionId_76871	BatchId_36123	AccountId_3957	SubscriptionId_887	CustomerId_4406	UGX	256	ProviderId_6	ProductId_10	airtime	ChannelId_3	1000.0	1000	2018-11-15T02:18:49Z	2
1	TransactionId_73770	BatchId_15642	AccountId_4841	SubscriptionId_3829	CustomerId_4406	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-20.0	20	2018-11-15T02:19:08Z	2
2	TransactionId_26203	BatchId_53941	AccountId_4229	SubscriptionId_222	CustomerId_4683	UGX	256	ProviderId_6	ProductId_1	airtime	ChannelId_3	500.0	500	2018-11-15T02:44:21Z	2
3	TransactionId_380	BatchId_102363	AccountId_648	SubscriptionId_2185	CustomerId_988	UGX	256	ProviderId_1	ProductId_21	utility_bill	ChannelId_3	20000.0	21800	2018-11-15T03:32:55Z	2
4	TransactionId_28195	BatchId_38780	AccountId_4841	SubscriptionId_3829	CustomerId_988	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-644.0	644	2018-11-15T03:34:21Z	2

	TransactionId	BatchId	AccountId	SubscriptionId	CustomerId	CurrencyCode	CountryCode	ProviderId	ProductId	ProductCategory	ChannelId	Amount	Value	TransactionStartTime	PricingStrategy
95657	TransactionId_89881	BatchId_96668	AccountId_4841	SubscriptionId_3829	CustomerId_3078	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-1000.0	1000	2019-02-13T09:54:09Z	2
95658	TransactionId_91597	BatchId_3503	AccountId_3439	SubscriptionId_2643	CustomerId_3874	UGX	256	ProviderId_6	ProductId_10	airtime	ChannelId_3	1000.0	1000	2019-02-13T09:54:25Z	2
95659	TransactionId_82501	BatchId_118602	AccountId_4841	SubscriptionId_3829	CustomerId_3874	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-20.0	20	2019-02-13T09:54:35Z	2
95660	TransactionId_136354	BatchId_70924	AccountId_1346	SubscriptionId_652	CustomerId_1709	UGX	256	ProviderId_6	ProductId_19	tv	ChannelId_3	3000.0	3000	2019-02-13T10:01:10Z	2
95661	TransactionId_35670	BatchId_29317	AccountId_4841	SubscriptionId_3829	CustomerId_1709	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-60.0	60	2019-02-13T10:01:28Z	2

	CountryCode	Amount	Value	PricingStrategy	FraudResult
count	95662.0	9.566200e+04	9.566200e+04	95662.000000	95662.000000
mean	256.0	6.717846e+03	9.900584e+03	2.255974	0.002018
std	0.0	1.233068e+05	1.231221e+05	0.732924	0.044872
min	256.0	-1.000000e+06	2.000000e+00	0.000000	0.000000
25%	256.0	-5.000000e+01	2.750000e+02	2.000000	0.000000
50%	256.0	1.000000e+03	1.000000e+03	2.000000	0.000000
75%	256.0	2.800000e+03	5.000000e+03	2.000000	0.000000
max	256.0	9.880000e+06	9.880000e+06	4.000000	1.000000

	ProviderId
ProviderId_4	38189
ProviderId_6	34186
ProviderId_5	14542
ProviderId_1	5643
ProviderId_3	3084
ProviderId_2	18

	ProductCategory
financial_services	45405
airtime	45027
utility_bill	1920
data_bundles	1613
tv	1279
ticket	216
movies	175
transport	25
other	2

Example use case of datasist in the workflow of a data scientist using a dataset from the Zindi competitive data science platform

MERGE DATA AND ENCODE CATEGORICAL FEATURES¶

MODELING¶

Download Notebook Here

	Data Type
TransactionId	object
BatchId	object
AccountId	object
SubscriptionId	object
CustomerId	object
CurrencyCode	object
CountryCode	int64
ProviderId	object
ProductId	object
ProductCategory	object
ChannelId	object
Amount	float64
Value	int64
TransactionStartTime	object
PricingStrategy	int64
FraudResult	int64

	Feature	Unique Count
0	TransactionId	95662
1	BatchId	94809
2	AccountId	3633
3	SubscriptionId	3627
4	CustomerId	3742
5	CurrencyCode	1
6	ProviderId	6
7	ProductId	23
8	ProductCategory	9
9	ChannelId	4
10	TransactionStartTime	94556

	ChannelId
ChannelId_3	56935
ChannelId_2	37141
ChannelId_5	1048
ChannelId_1	538