“I'm not out trying to prove anything. I'm sort of finished with that, so I get to play in other sandboxes and try and figure out what I like and I'm interested in.” — George Clooney
Sandboxes as Clooney say exist to play in, plenty of kids used them to practice the simplest of construction techniques. Here is the thing AI learn too, and data science is about learning do you really want them playing with production data?
It seems obvious when we say it out loud and if your unsure just go ask IT. This would be like asking Kids to learn making sand castles playing with concrete.
Your IT department is protective of its production data they have whole security departments dedicated to protecting it. This blog post is a attempt to set out the arguments of why all modern corporations need to consider the humble sandbox.
Seems Obvious
Businesses want the products of the data revolution; they want dashboard embedded analytics, AI, data science projects but many CEO's are admittedly uneasy with the idea that they should allow free access to a bunch of coders who in one line of code (drop table in SQL) could ruin their day.
So that raises a problem and it is a problem that many Corporation's will probably miss but it is one that dear Mr Clooney has already presented us with the solution too. It is recommended that you create a data analytics sandbox.
A place to play with data that like a kids sandbox is safe, secure and watched by a man from the IT department.
Businesses who don't do this will not have the places to test the products built by data analysts, data scientists, and business analytics. A group of personnel at least one white paper i read called the new analytical elite (that plays in sandboxes still).
But lets first make the case why other approaches are flawed and wont drive the data revolution if you don't do this.
There is two Transformations in ETLT
The operational methodology in IT is Extract Transform Load but referencing IBM that is in fact the proscribed methodology for database warehouse management for data ingestion into the warehouse reference: What is ETL (Extract, Transform, Load)? | IBM. The method for data science though is ETLT within data science projects there is two transforms because there is the first transformation when it enters the analysts sandbox and a second when the analyst transforms it to there specifications.
And I refer to it here because in this acronym you can see a key difference to create a Analysts sandbox your going to need IT and the business to
work together extract the data and we mean all or most of it and put it into a nice safe place where a certain class of people not in IT will have free access to it. Somewhere a CTO is screaming and concerns like security and data cleanliness will be mentioned, I am going to try and convince you why if you want to do more with AI then that has to happen in the sandbox.
Though the change from ETL to ETLT is enshrined in Dell EMC and CHRISP-DM methods (data science project methods). The whole idea of a data science project imply experimentation so clearly you need to create a environment for that.
The additional transform is important for another reason raw database data is often kept in highly normalised states as a full relational database this aids queries and transactions. This is less useful for data science and while a tabular (spreadsheet like) view is often the end point the continuous changes to these views can becomes strenuous on IT.
A initial simple solution is to require the data scientists to get there own data and do there own queries. The extra stress on production systems of additional users going in and out of the database though is enough to justify creating a copy of the data in a separate location.
The conclusion here is a analytics sandbox is built as a separate area away from production. There is no interference with the Business data, it doesn't matter how many queries happen
The Data Lake and the analyst sand castles on its shore
There is social reasons, practical reasons and business reasons why this is true. The social reason is IT needs to play the role of being security they're natural task is to limit what data people can have access to. The practical reason is the key issues around AI is testing and explainability.
Testing requires access to domain knowledge and the easiest method of doing that is having them sit in the business not IT. Explainability means the KPI's, graphs, metrics and dashboard that will give you the assurance they are working as intended. Clearly that means Explainable to the business and in all honestly probably explainable to executives.
I have no data to back this up though it is just a assertion that appears common sense to me. But it raises the question if IT has all the data and doesn't want to let free access where should we build a sandbox?
And that bring us to why this section is called the slightly whimsical "Data Lake and the analyst sand castles on its shore" the data science and analyst space is likely to be a afterthought for IT. At best it will be listed under regional databases on many ITIL training handouts which makes it a afterthought.
Though despite this it has been part of the literature for some time. Data warehousing pioneer Bill Inmon and industry expert Claudia Imhoff have been evangelizing about the idea since the late 1990s, although the co-authors referred to it then as “Exploration Warehousing”. Bill Inmon was famously the "father of data warehousing" being the first to coin the phrase data warehousing and offer classes on the subject.
Arguably it could be said that if you warehouse data it cannot be of much use unless you can take it out and do stuff with it. Therefore I hope if you don't have a data analytical sandbox your now started to be tempted to build one.
Why a good view doesn't necessarily mean good Architecture
It should be obvious now that like all scientific processes data scientists love a good test in fact it's testing all the way through. IT want stability and there is a tendency to produce what are called Views.
Views in SQL and similar limit data down to the bare minimum that a user requires and this comes with many advantages in that aa view can be run over night or otherwise update on a schedule therefore limiting the work for IT.
The View is quite ubiquitous most dashboards, and datamarts boil down to the distribution of views across the business. The problem is that the data science team wants ALL THE DATA and as much as possible and may want to add and remove fields from the data as they go.
Imagine how IT will feel if you request them to change there views, all there dashboarding routinely to serve only a small subset of the Business.
Therefore we get to another conclusion the Sandbox will need all or a fair bit of data copied across to a separate location. Being all the data is a large job every night but hopefully it is less than the continuous changes IT would be subjected to otherwise and IT can scale back somewhat every week or every week could work.
This is where we get to saying a good view doesn't mean a good architecture. IT doesn't know what the business wants until the business tells them and ironically the business doesn't even know what data it has in some cases till they have played with it. Moving from views as the solution to architecture of roles and responsibilities comes with benefits of getting the business involved in the data.
So What Is It?
The data analytics sandbox is probably defined by what it isn't. It isn't a view. It could just be a hardisk location with enough data that is dumped out. But it must be free for the business to engage in data science operations with the minimum of bureaucracy and that requires forethought.
Oddly a data analytics sandbox is best described by what is not.
In some cases the data analytics sandbox is a whole database warehouse exported to a new location. Or it could just be a little data decided upon by the team. So it can be said that complexity is not what defines it, it is defined as having enough that the data team is satisfied they will not ask for more thereby delaying the project.
It is not a OLAP cube. Knowing the average costs of projects and the median time do not matter and hold any importance within data science. Data science tends to focus on the row level of individual objects. You can create regressions or classifications about say a project though the average a value related to the population holds no value in predicting individual projects. Therefore it can be said that a hypercube of all relevant functions like means, modes, medians that OLAP or dashboards use are not as important as atomic data. While being said it can be used as the basis for producing dashboard as naturally the data team should support reporting.
Likewise it is not a production system concerns about latency and speed aren't a factor. It simply needs to hold the data and be available. Therefore it can be as bare bones as required.
It isn't necessary a test area either. Many businesses keep test areas for the integration of components into the business. These locations are probably very close to the data analytics sandbox but they are often and abandoned and forgotten in-between changes. A analytics sandbox is in use as long as the data team using it continues operations. Given AI should periodically be re-analysed to make sure they still functions and the environment hasn't changed the comparison falls short.
It isn't a data mart having a series of set views controlled by IT do not allow the data science teams to experiment freely and make them closer to being a function of IT and not the business. The whole point is that data should serve the executive class insight naturally supports business decisions and placing data knowledge workers closer to executive knowledge workers compliments both. Executives request insight about the business and the sandbox supplies.
For the same reason it is not a series of views it cannot be said to be a collection of dashboards. You could use it as the test area for businesses to build and test ideas for dashboards that then can be implemented by IT; though it would be wrong to say it is series of static dashboard as the service supplied to the business is agile and ad hoc querying of the businesses data.
It could be a SQL database, a huge number of CSVs the point is a small number of people can dig into the data and answer complex questions for the business without bothering IT because IT have pre-screened the security issues in advance.
There is the saying "to go fast, go well". That is the conclusion I take from theories like Agile. It is only by setting up the sandbox and "going well" we can "go fast" to explain how the analytics sandbox supports this I now discuss the 12 principles of agile.
It is Agile
To underpin why this might be important lets compare with the Agile Principles see below and consider if you think you can do data science with agility without a separate environment. Keeping interactions focused on the business department progressing the data science project and on small teams is a benefit.
Without a analytics sandbox and easy access to data for add hoc queries and testing a small team can focus on agility. A controlled secured framework cannot deliver.
See the Agile principles below:-
1st Principle: “Our Highest Priority is to satisfy the customer through early and continuous delivery of valuable software” – keeping the sandbox as a business not IT product supports this.
This also covers 2nd Principle: “Welcome changing requirements even late in development.”
This allows 3rd Principle: “Deliver software frequently”.
4th Principle: “Business People and developers must work together daily thorough the project” Seems a bit easier if the business owns the sandbox.
5th Principle: “Build projects around motivated individuals. Give them the ENVIRONMENT and support they need, and trust them to get the job done” – give them the Environment
6th principle: “The most efficient way of conveying information is face to face” – okay you don’t need a sandbox for this. Agile sees documents as placeholders for conversation.
7th Principle “working software is the primary measure of progress”, a sandbox allows a split between production and test environments that underscore this. Any work by IT should give business value. Lots of changes to views is not supportive of this.
8th Principle “Agile processes promote sustainable development. The sponsors, developers and users should be able to maintain a constant pace indefinitely”. They do this best in their own environment they own.
9th Principle “Continuous attention to technical excellence and good design enhances agility” -Okay not this one.
10th Principle “Simplicity-the art of maximising the amount of work not done – is essential” – a sandbox allows the development team to take shortcuts results to business are assessed by IT and business at end when moves to production.
11th Principle “the best architectures, requirements, and designs emerge from self organising teams” a sandbox is managed at the level of teams.
12th Principle “At regular intervals the team reflects on how to become more effective then tunes its behaviours accordingly” it cannot change its behaviour if it is beholden to central IT functions.
The Dreaded Shadow IT
It is worth executives and CEOs making a conscious decision to create Analytical Sandboxes. The main aim of having a team with easy access to data is that it reduces the time of knowledge workers to produce insights by giving them direct access to data. Rather than a business or department contacting IT and asking for changes (and it will be plural) they can just do the task and raise the task to IT for implementation in production afterwards.
Finally it can be concluded that these objects will only be built as a conscious action on the part of the business. The security concerns of giving access to large swathes of data requires thought and architecting. Limitations so no user can promote fully into production without checks and balances.
The point here is any IT hardware, software or data IT doesn't have control over becomes shadow IT and a security risk to the business. Despite sounding somewhat disparaging of IT the point of this article is to say we need to create controlled managed safe spaces in our IT architecture because the alternative maybe worse.
The literature on the Analytics Sandbox is clear the business owns and controls the space it installs different applications there it plays around with things and males mistakes; but its monitored and security checked by IT.
Conclusion
The final conclusion then is as a priori to building great data science businesses will need more sandboxes. There might be the chance that other businesses through APIs and other tools will share their new AIs. It isn't impossible that a business to business marketplace for AI and insight may develop but it doesn't seem a given and if a given AI is a differentiator why would they allow their competition to have their tools.
Even if that market develops then that begs the question of who will be the buyers and sellers in this markets and only the sellers will need to develop data science capabilities. The others might get away with highly secured databases locked away in IT. Are you in the seller or buyer camp?
So there is plenty of reasons for most businesses to ponder a AI strategy and whether they need and for what departments a data analytics sandbox would be appropriate.
But without the humble data analytics sandbox there is a stark choice for executives will they find a way to let data be explored, iterated over and played with in hope there staff can generate new AI and insight? Or do we carry on with limited access to views and lots of change requests?
I conclude businesses and executives need to consider their AI strategies are they going to adopt data science fully then building a data analytics sandbox appears a no brainer; or are we going to wait for someone to work out how to build AI on data in your warehouse which you wont give them? It seems unlikely to happen to modernise businesses will need to take risks and change their architecture.
There is a cherished saying in a team I am in it is "go have a play"; it indicates take the data or applications read some forums, don't worry about breaking it and figure it out. Its turned into the single method that given often the business isn't quite sure what IT can do and IT doesn't even understand what the business staff want the solution appears to "go have a play".
In short this is a plea from a data analyst please can the data come out to play?
Add comment
Comments
Interesting to read. Good to understand more.