Data Mining Operations - Object Recognition Problem

Published on 6 June 2022 at 23:24

"The most important thing that comes out of a mine is the miner" - Anonymous 

 

PS if you want to steal my code skip to the bottom this is a starting on a study piece on using data mining for a specific purpose that I want to work through. I wrote the start of a data mining script and have been wanting to work through the problems of how to do a full data mining project as a generalised problem under negative circumstances as I think this is an overlooked skill. 

 

Foreword

 

I think data analytics has a weird problem that most people don't talk about which I call the object recognition paradox below. Illustrate what I am thinking here is best explained by discussing something I heard about Mossad the Israeli Secret Service.

Mossad has a 10th person rule that during any meeting of ten people or more if everyone agrees on a subject it is the duty of the 10th person to disagree and disagree robustly. The purpose is to dissuade the group think and entertaining anything no matter how unthinkable.

The only way I can think about going about disagreeing with everyone including your boss and remaining both unfired and friends with people involve analysing the problem rationally and presenting data. 

Alongside this, I have had more than one experience where I am given data by an organisation usually for an interview for a "technical question". The problem with this is it is always sprung on you knowing very little of the data even things like what a given field means are opaque, is "C_Identity" and  "Customer ID" the same thing if I find them in two different databases?

So I have been thinking about how to do data analytics where the assumption that the business has the wrong assumptions, bad data and just every negative assumption and look at what statistical tools and methods exist to approach such a project. To this, I turned to data mining and ran straight into the object recognition problem which dovetailed with my thinking.

The Object Recognition Problem is that any interaction with data using a query language is reliant on already knowing what you are querying

So to this end, I have written some code which would generate either a box plot or a scatter graph for every combination of fields within a CSV. Then dump it out to disk. I have used it a few times and I find it very useful when I have no idea where to start with a data set, my problem is if you have 60 fields that are 60^2 plots to render and by the time you have a fair amount and picked out your preferred plots to present a fair time passed. Nonetheless, that's time spent by my CPU and not me and as already said I use it when I don't know where to start.  

 

Object Recognition Problem

 

So I am doing that thing again where I say I am going to do a series on a thing. Odds are I will start another series before I finish this. It is a bad habit of mine...

Though I want to work through the problem of how you would do data analysis as the 10th man and hoping to take the below code and expand it to move through analysing and mining out all the patterns.

I thought this was going to be easy. When I stopped and looked through the literature on data mining I found an interesting revelation about the Object Recognition Problem and reading through the literature I found it interesting how many places where the advice was that such methods were compromised by reliance on object recognition problems i.e. the advice was interestingly the simultaneous admission the best way to tease out the way to approach a data mining project was interviews with the business but what I thought was very interesting was the authors often lamented the problem that such analysis was only needed because the business did not know a thing.

I call this the Business Recognition Paradox that most analytic projects are frustrated by the need to interview and communicate with people that by the nature of the work do not know the answers to the questions being asked.  

I want to work through the problem of using data mining for such circumstances how can we take the 10th person problem and use data mining to rapidly provide insight. You can see below a start on the 10th man problem if you run the code eventually you will have a graph or box plot to review for every combination of relationships within the data.

I want to think through this approach as I think the subject of how to do data analytics starting from nothing is an overlooked project scenario which is much more common than often admitted. So this is a really short post only a notice of intent I had started reading up on the Object Recognition problem and written a lot before I realised I had too much for one post and decided to split up so this has turned into a notice of intent...

 

This is a short post but I spent some time writing you a data mining script... Enjoy as a starting point I hope to return to it and expand.

 

 

Code

 

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data=pd.read_csv(r"C:\Data\TEST\TEST.csv",encoding='ANSI')
print(data.head())
print(data.tail())
for y_nd in data.columns:


try:


sns.scatterplot(data=data, x="Difference In No Returned", y=y_nd)
plt.xlabel("Difference In No Returned")
plt.ylabel(y_nd)
plt.savefig(r'C:\Data\TEST\data_test_'+str("Difference In No Returned")+'_'+str(y_nd)+'.png')
plt.close()


except Exception as err:


print(err)
print(type(err).__name__)
pass


try:


sns.histplot(data=data, x=str("Difference In No Returned"))
plt.savefig(r'C:\Data\TEST\data_test_'+str("Difference In No Returned")+"_histogram.png")
plt.close()


except Exception as err:


print(err)
print(type(err).__name__)
pass


for x_nd in reversed(data.columns):


for y_nd in data.columns:


try:


sns.scatterplot(data=data, x=x_nd, y=y_nd)
plt.xlabel(x_nd)
plt.ylabel(y_nd)
plt.savefig(r'C:\Data\TEST\data_test_'+str(x_nd)+'_'+str(y_nd)+'.png')
plt.close()


except Exception as err:


print(err)
print(type(err).__name__)
pass


try:


sns.histplot(data=data, x=str(x_nd))
plt.savefig(r'C:\Data\TEST\data_test_'+str(x_nd)+"_histogram.png")
plt.close()


except Exception as err:


print(err)
print(type(err).__name__)
pass
count=0


for nan in data[data[x_nd].isna()]:


print(nan)
count+=1
print(x_nd)
print(count)

 

print("legnth")
print(len(data))
print(count)

 

Add comment

Comments

There are no comments yet.

Create Your Own Website With Webador