Data Science

Published on 4 September 2021 at 21:47

“That’s what winter is: an exercise in remembering how to still yourself then how to come pliantly back to life again.”
― Ali Smith, WINTER

This is a blog about AI and despite the fact that a host of tools to create AI are open source many businesses haven't invested in creating dedicated departments nor looking at under what framework they might profess to create these tools. To look in the news one would believe we are living in a immediate data revolution but if we where to poll most businesses we would seem likely to find that many still use the same tools and methodologies they have for years.

I believe this poses a risk because as I attempt to show in this article Data Science when analysed as a Business Analyst has enough key differences that businesses need to ponder to previous IT methods before adopting.

Wikipedia defines data science as follows. Link here Data science - Wikipedia.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,^[1]^[2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data. In recent years, the field of data science has become very important and apparent in relation to the computer science field.

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data.^[3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.^[4]^[5]

Why the Framework Fell

Businesses have long been dominated by the empirical, theory was often seen as secondary as who can argue about theory when something either makes money or doesn't make money. That being said data science being a evidence based method it compliments business well. That being said many businesses will be data shy and their methods will slow their adoption where they bring assumptions that it can be made to work like IT or the likes. To analyse that lets consider the various related fields including Business Analytics.

Gartner defines a Business Analyst as follows Definition of Business Analytics - Gartner Information Technology Glossary.

Business analytics is comprised of solutions used to build analysis models and simulations to create scenarios, understand realities and predict future states. Business analytics includes data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user. These analytics solutions often come with prebuilt industry content that is targeted at an industry business process (for example, claims, underwriting or a specific regulatory requirement).

To contrast this with Data Science where the product is not pre-built but is developed specifically and trained and tailored to the company requirements. Also often to be discussed as Artificial Intelligence there is a added pressure to not be used by the business user but replace the business user in the role. This unfortunately implies outside of pretrained models supplied the tendency to use internal Business Analysts may give preference to shelf bought non AI products even where they have the data to create there own AI or could outsource to a Data Science team.

What is more to illustrate business processes let's discuss functions. Functional programming invented lambda -Calculus in 1936 and lambda calculus is the foundation of the LISP language. A limitation on functional style programming is the limitation on assignment they do this because any value should be a constant or a calculated value and therefore changing any value implies a error either with a constant or the calculations.

This methodology is not dissimilar to the way most larger businesses think of their work. Larger departments are still seen as functions their parts often include humans and many other tools but the process is still to take defined well understood inputs and apply some transformation that creates value and return the finished product to customers.

While businesses are not mathematicians it is easy to see why if mathematicians and businesses both utilise functions; therefore it is easy to see why businesses might see change as risk (as in fact most business analytics is a testimony to).

We can contrast this with Machine Learning the same variables representing the weights and biases of a neural network are repeatedly changed in response to different parameters (the inputs representing a given task).

How might a business react if a AI doesn't make same decisions second to second based on new learning but it is in theory "continuously better". There is often in the news discussions of AI going bad and there is a concept of AI ethics, I am concerned while there is concepts of data science projects but little literature on AI lifecycle management within a active operational environment.

Therefore there is a risk of Businesses failing to appreciate what a fully functional data science team might bring to them due to a reliance on buying in their IT products wholesale likewise a organisation that doesn't adapt its culture might be squeamish about AI owing to it being messy and changeable. I hope this article convinces you why there is much to look forward to if you get over these hurdles.

Data Science is therefore different from larger IT and I have argued in my article about the analytical sandbox they sit in the business. Though many would be tempted to think it similar or even the same as business intelligence.

Data Science as distinct from Business Intelligence

Business Intelligence focused on answering questions such as what happened last quarter, sales targets, KPIs and essentially focuses on either responding to ad hoc requests for answers by querying historic data or creating standardised dashboarding. Business Intelligence developed this way to answer spreadsheet risk.

Spreadsheets are a easy way for IT to supply data to people, even people with low data literacy. The problem with this approach is that upon leaving IT control users can change data or splice together datasets in new and odd ways. This understandably made IT afraid of giving data out and worried about security issues. This creates many versions of the truth where a user may send or share estimates based on dated data. Therefore Business Intelligence developed from need for businesses to enforce a "one version of the truth policy". By controlling standards data could be supplied in a pre-cleaned and checked manner.

Therefore Business Intelligence focused on standards, and supplying one version of the truth dashboards. Though these approaches are entirely explanatory. It was enough to simply count things and show them in graphs. Data Science on the other hand developed to use data to predict and forecast future events and was Exploratory in nature. Data Science relied heavily on maths (rarely only just counted) and utilised a iterative approach to attempt to explore, optimise, or forecast future business events based on the current data.

While Data Science grew from having the clean data sources that Business Intelligence allowed Business Intelligence tending to focus on explanation meant that data science for forecasting cannot be developed within Business Intelligence structured approach.

A number of Data Science methodologies go further they see the role of IT and Business Intelligence to gatekeep against Data Science personal and requires the Data Scientist to seek approval both from the executives that requested the task and IT and Business Intelligence to make sure it is operationalised effectively. Therefore there is a assumed split between the two.

Therefore while Business Intelligence can maintain and clean data I suggest data science is the end game representing the ability for a business to continuous analyse, forecast and predict its environment. In conclusion Data science extends Business Intelligence operations but its activities differ enough to treat it differently.

The Data Science Lifecycle

The data science project lifecycle has developed from several sources. The most obvious is the scientific method (for obvious reasons) this has been used for centuries and relies on creating hypothesis and running a test. The obvious improvement that data science adds to this method is the fact that for many businesses the data already exists and the statistical Hypothesis testing methods can be rapidly carried out by machines.

This means that questions raised by executives can be answered rapidly using database queries and statistical tests. These Hypothesis tests could be things like do we see a statistically relevant relationship between the number of staff working and profits. There are a variety of Hypothesis Testing methods and knowing the right one to use is important but a good hypothesis test can help in formulation of executive decisions. Knowing whether the data supports or rejects a certain hypothesis is a very important metric.

Doing this method automatically and repeatedly is called (or at least part of) data mining which has the CHRISP-DM method as a data mining project method. CHRISP-DM was published in 1999 to standardize data mining processes across industries, CHRISP-DM recommends a combined approach with Agile.

The Data Science Process Alliance who maintain the CHRISP-DM methodology define how Data Science differs from software development seen here Data Science vs Software Engineering (datascience-pm.com).. I recommend that you read the article but in summary the data science process alliance recommends businesses manage data science projects on a ad hoc basis in response to continuous improvement processes like Kanban and Kaizen.

Tom Davenport created the DELTA framework offers a approach to data analytics projects that take into account the context for organisations skills, datasets and leadership engagement. A explanation of the DELTA model is provided here What is the Delta Model? A powerful Strategic Framework | toolshero. The DELTA model looks at how to integrate data science into strategic level thinking with DELTA standing for Data, Enterprise Organisation, Leadership, Targets, Analyses. The intent is to create Strategic level analysis focused entirely on the customer.

Doug Hubbard created the Applied Information Economics Approach which provides a approach for measuring intangibles and provides guidance on developing decision models, calibrating expert estimates and the expected value of information. It is explained here Applied Information Economics (AIE) - CIO Wiki (cio-wiki.org). Information Economics Approach allows you to wargame business decisions and analyse the value of information by breaking it down into estimating gains in monetary terms would get from a given decision and estimate risks of being wrong and improvements having additional information to risk in errors. The willingness to buy new information and willingness to develop new analysis can then be estimated from improvement to the related decisions that your company makes.

The Dell EMC of which i recommend as a method for structuring data analytics projects splits projects into Discovery, Data Prep, Model Planning, Model Building, Communicate results, and Operationalise. Key to it being a data science project method is that the project can and should cycle back and forth along the project cycle. This is likely a problem for organisations who are often motivated by need to see results and backwards steps is often not seen as progress.

You can see by applying these methods it is capable to assess the value of having access to more information and analytics across the business and therefore a paradox that the true value of Data Science will only be known to your business when the business implements some form of Data Science.

Though there are benefits of Data Science which nearly any business will be tempted by Machine Learning and AI.

Machines that learn

Broadly all machine learning can be put in 4 categories classification, regression, unsupervised and reinforcement learning. Technically regression and classification can be defined as "supervised". Supervised and unsupervised is often a confusing concept. Supervised learning doesn't mean the model is watched by a person it means the correct value is known and used as the training label.

A highly brief look at the type of things that can be done with Data Science is below. Long-term i hope to work through these in later articles.

Regression

Regression is the fancy name mathematicians give to machine learning that calculates numbers. It is called Regression as the starting point is chosen at random and by repeatedly doing the calculation seeing the error and moving values a step in the right direction (the gradient) the error will be reduced each time. The concept is credited to Sir Issac Newton around 1687.

Therefore Regression models give the capability to essentially take a set of numbers that we do know and attempt to forecast numbers we do not. It appears to have gone unremarked that the mathematicians have managed what many oracles and prophets have failed at and created a science of prognostication.

Surprisingly this plays into how the Atomic bomb was developed John Von Neumann supposedly would repeatedly take the formulas being worked on by the Manhattan project and randomly move values a bit up and down and see whether these resulted in better outputs versus the actual test outcomes. Machine learning just automates that test and repeat process.

Model Types: Linear Regression, Neural Network with Rectified Linear Output, Regression adapted decision trees

Classification

Classification is a process that calculates the chance that a given example belongs to a given class. A class in this case should bee thought about a yes or no answer. Is a picture a banana, is this project going to be cancelled.

Classification can be focused on filtering using decision trees which are calculated using informational entropy as a step by step filtering process that will detect the best steps to lead to good set of actions to get to better answers.

Conversely logistic regression and sigmoid neural networks can also learn a output as a risk % that a given object belongs to a given class. This can be useful as given a population of say customers and knowing those customers who bought a item we can feed in this data and then get a classification AI that will give the % chance a future given customer will buy the given item.

Model Types: Logistic Regression, Artificial Neural Networks using sigmoid output, decision trees, random forest, boosted trees

Unsupervised

Unsupervised learning is called unsupervised as it doesn't receive any training labels, Unsupervised learning is used to find groupings within the data. The first time you will learn about Unsupervised learning it will always be a map grouping a bunch of locations together based on proximity to each other.

Such tools have been used to find "crime hotspots" and anywhere where the map has groups of different coloured pins probably uses unsupervised learning. What is often less appreciated is that the same techniques can be used to group any object as while on the map there is 2 dimensions (x and y) any dimensions can be substituted. If you wanted to group customers based on 3 dimensions where x was money in bank, y was money spent on grocery shopping and z was 0 or a 1 to if they'd visited your store could essentially split and group your customer into a series of groups (I've yet ever wanted to do this but it can be done).

These multiple dimensions is sometimes called N phase space where N is number of dimensions being used. It is easier to realise why it is likened to space by describing a simplified version of how a unsupervised model learns. Typically a unsupervised learning model starts with a number of random points then measuring the distance to the points closer to it than any of the other random points you started with and square the result. This value gives you the error. Then randomly moving the location up or down one of those N dimensions and seeing if the error reduces doing this a few times and making the moves that reduces your error finds the centre you where searching for.

A problem with unsupervised learning is due to the use of squaring the distance means that using lots of dimensions is unfeasible as squared values are exponential and can quickly overwhelm the memory allocated to store a given variable. This gives the unfortunate downside that you cannot give the model all the variables you have on a customer and just leave it to split them up into groups. Therefore less really is more and the project becomes about limiting the data down to key variables.

Unsupervised Learning Models: k-means Clustering, Hidden Markov Model, DBSCAN Clustering, Principal component analysis (PCA), t-SNE

Reinforcement Learning

Reinforcement learning is the one that is often left out in most starting training. Reinforcement learning is about learning to predict a heuristic for game theory. Today reinforcement learning is mostly used for Game AI and it's the type of AI that beats grandmasters at chess and the like. Reinforcement learning uses N lookaheads to scan future moves it can make while Neural networks for reinforcement learning attempt to assess the benefits and risks of making a given move. Likewise Halo enemy AI was very successful for its day by using decision trees to decide its actions.

Reinforcement Learning Models: Neural Networks, N look ahead algorithms, Decision trees

Oddities that aren't

Most forms of machine learning fit broadly into one of those categories. Computer vision despite sounding very intimidating uses a neural network and a convolutional layer to transform the picture data into a matrix that the AI can treat as any other input data. In doing this the data is also transformed so that it matches lines and colour contrasts as numeric data; computers have problems with the colour black which mathematically creates a load of inputs with zeroes in.

So while that is a woefully simplistic explanation you see the actual output is still regression, classification, or reinforcement,

Operations

Garter concluded in their 2021 Market report on AIOPs “There is no future of IT operations that does not include AIOps.” . Link here Discovery Gartner Market Guide for AIOps Platforms 2021 - ScienceLogic.

The conclusion by Garter was that AI as a form of monitoring of the large amount of systems that IT deal with benefit from AI monitors. The assumption that this wont apply outside IT would appear flawed when one realises the array of sensors, online forms, and methods to give structure to unstructured data is making it easy to digitise non digital assets.

In effect this approach already exists within banking who use AI and various inputs in what they call complex event processing link here Complex event processing - Wikipedia . What happens is that data feeds are sent to AI to predict important adverse market moves. There is no reason if used in IT AIOps and Banking that AI cannot be used across more repetitive Operational tasks.

The reality is AI cannot take everyone's jobs what it can do is "prognosticate" future values at a rate much better than human capability it can probably do much more in the future as we have discussed but most of this lies outside the technical capabilities of businesses until they have developed and been embedded in operations.

This is probably the biggest reasons to develop and invest in Data Science skills; regardless of how efficiently built these AI are there is the possibility that the business environment may at some point change around them. While I am certain that a team with good business intelligence skills and operational manuals that set forth acceptable Key Performance Indicators for an AI's performance.

Therefore it is likely that having a business filled with AI's will still require humans to monitor and manage their performance to make sure they do not drift. This almost certainly applies regardless of how good the data scientists used are as all tools need maintenance.

I like to imagine that the future high tech operations department is something in-between a IT department and a HR department in that most of the tools are code and required coding skills but also need well defined objectives and measures to work out when a system needs to be looked at.

Criticism's

It might be obvious but needs to be stated data science grows from clean master data management and good business intelligence protocols. Companies that have yet to develop clean organised data should possibly invest in that rather than trying to do more with data science.

There is also a risk that if we deploy machine learning to predict and decide everything we will end up with a confusing hard to understand workplace and so some restraint is required to focus on the data science projects that have the biggest impact.

Conclusion

I believe in a concept of which i read about on one point and haven't seen since. We should approach an idea of differential minds i.e. where machine learning betters humanity due to its logical nature we should apply it and where human cognition appears more empathetic and the ability to take them out for a coffee and debrief them appears useful we should use the human.

Starting to apply data science will be instrumental; I believe that companies can buy in most of the skills related to data science so whether that is temporary outsource consultants or permanent departments is not the issue i believe needs to be addressed. The issue i think needs to be looked into is that any AI brought in will regardless need to sit within a company wide AI strategy. And what would help you in this is the data science project frameworks already discussed.

What I hope I have dissuaded you of is thinking that businesses can import AI into the business using old ways of thinking. It can clearly be seen the project methodologies are different, they represent a niche in the business different to normal IT asset lifecycle, and normal Business Analytics may overlook their use.

To illustrate why i think you cannot do otherwise you should note that over the course of this article multiple times I have mentioned historical cases of the intellectual founders that led to Machine Learning development. This is to highlight that AI is not new in the way that is often discussed in the media. The first Neural Network used and tested was in the 1950s was steam based and used as a US Navy battleship targeting equipment. In the 1980s many widespread implementations of Neural Networks where developed.

This is not new technology, what is new is the amount of clean and organised data available alongside increases in CPU and GPU power that makes it easier to run tests. So if the technology has existed in the past it is worth pondering what delayed or stopped its adoption?

There is within AI literature the concept of AI winters, during the cold war there was much belief within intelligence circles that automated translation software would eliminate the need for Russian Linguists. There was so much hype that the chiefs believed that a solution was going to be immediate and when that failed to happen the individuals in charge became frustrated and pulled the funding and largely all funding dried up for all AI research. Today we have proven the method with google translate being highly robust, a whole new science of Natural Language Processing is now available.

Repeatedly over the history of AI a model has been proposed, a counter paper rubbishing the idea as too expensive or costly and the funding is pulled. Years later the method revolutionises science. The modern day Neural Network was heavily criticised in its time and at one point the gradient descent maths was rediscovered having been originally written in a paper in the 1980s and largely forgotten.

There appears both a opportunity and a risk with AI and usually the risk is described as AI ethics. I propose a different paradigm the risk is the risk of the AI winter. Businesses may seek to adopt this new technology but when the needs are analysed by a Business Analyst that look for functional requirements in line with Lambda Calculus, Its creation is run as a normal project that the project manager being driven by timeframe rushes it through to completion. And when finally deployed the AI whose processes are poorly understood is managed by people untrained by the statistical methods that drive this new business object.

The outcome appears obvious the business fails to develop the AI its failure is attributed to the AI rather than the methods used and another AI winter ensues. Therefore this article urges businesses interested in machine learning, AI and data science to also recognise it as a distinct discipline and not a subset of IT.

Unless maybe they want another AI winter; and I contrast that with the people who believe AI will take all the jobs as anyone can see from reading this there are plenty of roles available in a efficiently run business that uses AI. The problem is they aren't job titles businesses are used too and if businesses try to implement AI using the ideas of how they implement a IT department it might not work. It can be clearly seen data science is more exploratory it is driven by objectives not timetables

Data Science

Add comment

Comments

Create Your Own Website With Webador