Libraries for Data Science

Published on 9 January 2022 at 22:55

“Doctor Who: You want weapons? We're in a library. Books are the best weapon in the world. This room's the greatest arsenal we could have. Arm yourself!(from Tooth and Claw in Season 2)” ― Russell T. Davies

Bad libraries build collections, good libraries build services, great libraries build communities. It is in this vein of thought that I want to create a list of AI libraries or resources that are useful and give them broad review. If you have a library that I really ought to include please do let me know. If this becomes too large for the various reviews I may split it into multiple parts.

How to split the libraries

I think there is a broad typology of libraries useful for data science. The problem is they often overlap. The below is how I would split them up and maybe help to find different libraries similar to ones your using or discover new ones.

Machine Learning

-Machine Learning Libraries: usually big and contain lots of different algorithms of statistical type learners. Statistical learners are things like clustering algorithms. They use maths or statistics to learn relationships within the data I suggest they differ from artificial intelligence because they make no effort to parallel the way a brain works. This often makes them less accurate (but not always) though usually makes them much easier to explain to other people; as even where these other people remain fully trained in data science neural networks remain hard to explain.

Examples: Tensor Flow, Scikrit Learn, XGBoost. NLTK

Best In Class: Scikrit Learn, frankly its too much to learn it needs its own review.

Learning Resources: Kaggle.

Artificial Intelligence

-Artificial Intelligence: Often can be split into there own specialist libraries though often will be included as implementations within what previously called the machine learners.

Examples: Keras, PyTorch

Examples that while not specifically ANN focused do include: Tensor Flow includes Keras,

Best in class: I quite like PyTorch but I feel i need to look further and spend some time with different versions.

Learning Resources: Kaggle, PyTorch main website, and Keras main website.

Visualisation

-Data Visualisation Libraries: I also lump in the more general "machine learning explanation" libraries that also exist which to my mind are graph and visualisation libraries specific for machine learning purposes. These are important as no customer, manager or lay person who discusses machine learning will be convinced by talk of AI but graphs and visualisations have a wonderful...ah duh, moment when used to explain complex things.

It is really hard to be serious about data and not love graphs, and when I first became interested I largely believed the idea that AI was to hard to explain simply and yet over time I have seen plenty of simple graphs and visualisation that wonderfully and simply explain some intuitive thing in the data that is often worth more than the automation achieved by AI.

Examples: Matplotlib, Seaborn

Best in class: Matplotlib and seaborn; they are basically joined at the hip,

Learning resources: Kaggle

Linear Algebra

-Linear Algebra: Libraries used to implement matrix calculations and often either by using compiled C++ code or GPU acceleration are designed to allow python users to access faster speeds and with less fiddling with developing code for linear algebra. Linear algebra is used for the implementation of the layers within a neural network and has all sort of supplementary uses in machine learning.

Examples: Theanos, Numpy, Tensors from Pytorch

Best in class: Numpy but I am tempted by Pytorch tensors GPU acceleration.

Learning resources: You'll pick Numpy because its all over the place on stackoverflow and other resources.

Utility

-Utility: There is I feel a group of libraries often useful to data science which deals with storing and managing data. There is a large variety of libraries that simplify data cleaning, storing, and management for different users. It is often forgotten by IT that the data stored in SQL databases is normalised to the 3rd normalised form. This makes the data take up the most minimum space in the database but don't make it easy to use and do joins to other data. This is before we consider data contains mistakes and other problems and so for all sorts of reasons may need cleaning and manipulation after first extracting it from the database before any analysis can take place.

Therefore utility libraries are very useful as they reduce the time that data scientist and analysts spend formatting cleaning, and looking at data before they move onto building the models.

Examples: Pandas, NLTK

Best in class: I haven't decided yet, I might cheat and say best in class is whatever works for you. What has worked for me was the Python primitive data types of list and dictionary. Though that would be catty and utility libraries do have (strangely) utility..

Learning Resources: Kaggle

Specialist sub-classes

Some libraries don't fit into a nice broad class but rather contain lots of little tools and bits that make them best utilised for a specific task within data analytics. Below you will see a quick breakdown of these tasks and libraries that can be used.

Natural Language Processing: Is about analysing, tagging, sentiment analysis and generally grouping text. Most organisations have piles of unstructured unmined text just waiting for a analyst to put it to good use.

Examples: NLTK, spaCy

Learning resources: Kaggle

Computer Vision: utilising a artificial intelligence library they add extra tools to do with interpreting pictures or videos and classifying these pictures.

Examples: Keras

Learning Resources: Kaggle

Typology Conclusions:

While I think all the below are great libraries and I have my preferences there is a few clear take always. Most of the utility apps can be bypassed; though I think your projects may subsequently take longer but that is ok as it takes some up upfront effort to learn every utility out there. If you want to learn machine learning go with sickrit learn and get seaborn so you can follow the examples on the scikrit learn website.

When it comes to AI I am still making up my mind Keras is probably easier to learn, Pytorch looks closer to how I understand AI. Numpy is when i can't be bothered what I actually use. There is also fastai thaat is something that is on my radar and need to look into.

The Libraries

Below I try to do a simple language brief explanation of each library under discussion here. If you have any others that you are aware of please let me know and I will try to get through them.

Tensorflow

Language: Python

Type: Neural networks, joined at the hip with Keras and supplying other tools.

The website for Tensorflow can be found here TensorFlow. The actual neural network processing is done by the Keras library. The proposed magical selling point is the ability to call the created model through Javascript by the users browser. Sounds powerful but I am not convinced that big models will run quickly and easily in the browser and it is something that I am hoping for TensorFlow to convince me of.

The fact that it has things built on top of other things means that the importations look like the 4 below. This struck me as concerning as I have in the past built neural networks in python purely with the numpy library (and once without any help), of which I admit I had to build a actual neural network class but this is something I recommend any programmer does at some point because regardless of the ease of using a API and someone else library I feel it is best to learn a piece f machine learning by building one yourself s the maths becomes familiar.

CODE:

import numpy as np

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers

So if Tensorflow doesn't let you get hands on what other goodies does it let you have seems to be the pressing question. You can go and pay money on coursera DeepLearning.AI TensorFlow Developer Professional Certificate | Coursera. I then honestly stopped with Tensorflow there are multiple different AI libraries out there to spend too long learning anyone

Then turns out Keras is the main library used for AI within Tensorflow so you know just google Keras instead of Tensorflow.

Keras

Language: Python

Type: Neural Network Library

Annoyingly Keras says it was built on top of Tensorflow library but Tensorflow says it was built on top of Keras. I wonder what came first the chicken or the egg. See above for the effective stack.

Keras does a better job at explaining how to use it and by exytension shows about tensorflow. d it's website is here Keras: the Python deep learning API. The style is in format of normal python developer documentation with healthy dose of code examples rather than links to other peoples videos. Keras also did a better job of selling itself Why choose Keras?. Lesson learned want to learn about Tensorflow ignore Tensorflow and start with Keras.

"Keras is an API designed for human beings, not machines. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear & actionable error messages. It also has extensive documentation and developer guides.".

And I support this view. I feel that there is nothing that I couldn't implement with Numpy alone and probably implement from scratch if I really spent some time writing the code (though id probably make a mistake or two). Though Keras has all this existing ready out the box. This is the best of the includes batteries as standard model espoused across the Python ecosystem and especially for computer vision is indispensable for saving you a lot of time.

Layers and activation functions are handled through creating objects called "layers". And this is one great thing about Keras with the exception of the fit and oracle sub

The amount of code needed to train a model as you can see it is very compact amount of code.

Code:

#splits data into train test

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

#reshape and set float precision for

x_train = x_train.reshape(60000, 784).astype("float32") / 255

x_test = x_test.reshape(10000, 784).astype("float32") / 255

model.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=keras.optimizers.RMSprop(), metrics=["accuracy"], ) history = model.fit(x_train, y_train, batch_size=64, epochs=2, validation_split=0.2)

test_scores = model.evaluate(x_test, y_test, verbose=2)

print("Test loss:", test_scores[0])

print("Test accuracy:", test_scores[1])

Theano

Language: Python

Type: Linear Algebra

Theano is a Python library that lets you define mathematical expressions used in Machine Learning, optimize these expressions and evaluate those very efficiently by decisively using GPUs in critical areas. It can rival typical full C-implementations in most cases.

code:

from theano import *

a = tensor.dscalar()

b = tensor.dscalar()

c = a + b

f = theano.function([a,b], c)

d = f(3.5, 5.5)

print (d)

That doesn't appear very impressive though it does get better and implements all the things you would expect of Numpy. Where this becomes interesting is that Theano can produce a visual graph of the calculations and then proposes to calculate and compile faster methods of managing the graph utilising GPU acceleration. This graph is a great visual representation of the calculations being undertaken.

theano.printing.pydotprint(f, outfile="scalar_addition.png", var_with_name_simple=True)

I quite like the graphing method, I don't think it matters as much with a neural network but I can see the attraction for the type of complex regression modelling that is particularly used in the social science. I have been thinking of using it to replace my use of Numpy. I could not see any extended accuracy floats.

The problem where I became a bit sceptical was where it referenced Numpy for its use of variables. I guess it will be something to look into later if it is better than Numpy but I feel sceptical when the replacement utilises the preceding version.

Scikrit Learn

Language: Python

Type: Machine Learning

Scikrit Learn to me if Keras is the specialist neural network stall, then Scikrit is the diner with the large menu.

Lets list what you can do; biclustering, regression, calibration, classification, clustering, dimensional reduction, model selection, preprocessing, covariance estimation, cross decomposition, decision trees, decomposition, ensemble models, feature selection, gaussian mixture models, gaussian process for machine learning, generalised linear models, inspection, kernel approximation, manifold learning, miscellaneous, missing value imputation, model selection tools, multioutput methods, nearest neighbour, neural networks, pipelines and composite estimators, preprocessing, semi supervised classification, support vector machines, text classifiers.

Phew... suffice to say Scikrit learn is a great place to start with machine learning and you are unlikely to need to move outside and use other tools for some time. Scikrit Learn uses mathplotlib and seaborn extensively and most of the tutorials (of which the website gives a few for each class) use mathplotlib.

A full blog post going through the whole library methodically is something that I am planning as there is so much to learn from their website.

PyTorch

Language: Python

Type: Neural Network code

The website Tutorial looks very good Tensors — PyTorch Tutorials 1.7.1 documentation. If Pytorch is for you stop reading and get on the website!

Pytorch is a Neural network library that also includes Tensors that work like Numpy or Theanos matrix calculations (So you could build a network from scratch or not). Importantly PyTorch tensors are setup to use GPU acceleration right out the gate; points for batteries included as standard!

PyTorch includes the ability to create datasets of png. files for use with computer vision. There is a wonderful guide on the website; I was set to complain it used too many lines of code and should use only 1 or 2 like Keras. I then realised the data was loaded in 2 lines and PyTorch had the capability using matplotlib to load a random collage of the data you just loaded to clarify what you just load. No making the mistake of loading the holiday snapshots instead of that MNIST handwritten character datasheet!

PyTorch comes with ability to transform the loaded pictures for your computer vision meaning there is a number of functions to augment your data and bulk it out prior to use.

Downsides: I didn't see anything for NLP like a long term short term neurones.

The actual code for creating a Neural Network is below. I was completely thrown at first because it was instancing a class of the neural network that inherits a whole module that defines the rest of the neural network and its function and you rewrite the pieces of the neural network "bits" that you don't like to taste. I shouldn't like this it feels wrong like a abuse of object oriented programming and ignoring functional programming technique but I can't lie I really like it...

CODE:

class NeuralNetwork(nn.Module):

def __init__(self):

super(NeuralNetwork, self).__init__()

self.flatten = nn.Flatten()

self.linear_relu_stack = nn.Sequential(

nn.Linear(28*28, 512), nn.ReLU(),

nn.Linear(512, 512),

nn.ReLU(),

nn.Linear(512, 10), )

def forward(self, x):

x = self.flatten(x)

logits = self.linear_relu_stack(x)

return logits

The backward propagation function looks funky to me, I felt like it looked like the code similar to if I had written it myself and despite this there was a creation of a named optimiser, loss function, and several others in a function that looks too verbose to me. I think neural networks I wrote myself where simultaneously less line sof code and yet also less reliant on another library for supporting functions.

PyTorch please give a standardised implementation of backward propagation and maybe let the user rewrite it using inheritance and object oriented programming. Though I don't want to leave PyTorch on a bum note it charmed me more than Keras and it does come with the abillity to launch your models directly to the TorchBoard for sharing and production online. You can even give it extra points for supplementary libraries that help

Numpy

Language: Python

Type: Linear Algebra Library

Numpy it is the oldest and most common library for linear algebra. You should use it... if simply for the reason of when learning with it there is plenty of material and information on the internet and when you make a mistake they will help you. It is C++ backed so optimised for speed. Numpy is so wide spread several of the neural networks either in part or in full rely or reference Numpy.

Download numpy it is so common I almost don't need to tell you where to learn as it will just come up on stack overflow.

And if you disagree here is a post for building a recurrent neural network from a simple google web search shows many end to end guides on building a whole ANN class using just Numpy; like this one Implementing Recurrent Neural Network using Numpy | by Rishit Dholakia | Towards Data Science. In fact often there will just be the code.

You may have noticed that such a implementation really only comes with the downside of if you start building your own classes you will no longer know how to talk to those who use more standardised libraries. Though that being true it is also beneficial to build a few neural networks using Numpy yourself for simply the reason if you can then you really do understand how they work.

Pandas

Language: Python

Type: utility for data cleaning.

I am not the biggest fan of Pandas. It just seems to me that the idea about it is very much taking data out of usually a CSV and giving a variety of tools to filter and do data engineering. I also have had bad experiences of Pandas ironically because Pandas is incredibly useful to filter and analyse data and makes that so simple there is often a problem when you start making a more complex query and suddenly get a error about it.

The standard methodology for data analytics is Extract Transform Load, I feel that Pandas is a great method for extract transform and load all in one method. That being said despite this I cannot lavish praise on Pandas because I have often simply used Python lists; why create a library for what is a primitive data type is my thoughts?

This being said if you go on Kaggle you can see the value of Pandas Kaggle: Your Machine Learning and Data Science Community. They show how loading data into a single variable from CSV and then filtering and changing that variable using a limited syntax and number of functions to change and filter the data while undertaking data cleaning makes life very easier. Though I am spoiled by mostly using APIs and rarely having to use the massive amounts of CSVs that Pandas seem to use. Likewise I can see many instances where a more direct connection to SQL

My conclusion queries are for SQL when you first extract the data and data can be stored in lists. As a old advert states for else use Pandas; if in fact anything is left regarding ETL is not already covered by SQL and the humble list...

Seaborn

Language: Python

Type: Graphs and visualisations

I like graphs and visualisations for a long time I used Python without really using the full abilities of a visualisation package and relied on Excel and the like. Similarly Iron Maiden wrote a single called wasted years; You really need to learn mathplotlib and seaborn. You'll notice that I just spoke about them as one thing well they almost are as seaborn more or less just adds details to.

You can see you can see a sample below or you can look at the online gallery that strikes one like a abstract art gallery with the various shapes as different ways to show data Example gallery — seaborn 0.11.2 documentation (pydata.org).

CODE:

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(data=df, x=X_Label, y=Y_Label)
plt.xlabel('label string')
plt.ylabel('label string')
plt.savefig('C:\\Data\\graphs\\galactix_simulations_'+str(x_nd)+'_'+str(x_nd)+'.png')
plt.close()

The code is more simple than you think it could be shortened to a call to sns (in this case a scatterplot) and either a place to save it. You can see how seaborn and matplotlib work together as when seaborn creates the graph it can then be manipulated using pyplot (imported as plt). You need to close a old graph as if you don't it will keep drawing the next graph on top of the previous. Trust me this comes in useful but is annoying the first few times you forgot and messed up your graph.

Honestly I am no expert I have in the past did a review of ggplot2 on my YouTube channel and I feel seaborn doesn't compare as well to the grammar of graphics pioneered for ggplot2. That being said I use it for Python and nothing else so I am clearly nit picking and would prefer to have the machine learning and syntax from Python for everything else.

XGBoost

Language: Python, R, JVM, Ruby, Swift, Julia, C and C++

Type: Machine Learning of a very specific sort of boosted decision trees.

Website here Get Started with XGBoost — xgboost 1.5.1 documentation.

I am unsure that this really grips me. I feel that I can build boosted decision trees with sickrit learn. I don't think I need another library specifically for that specific algorithm. That being said boosted decision trees are one of the methods that can sometimes benchmark alongside a neural network, also come with slightly improved explainability (though I think some of the newer methods close this gap), and because they are made up of lots of decision trees they are well suited for a distributed environment where multiple CPUs can be used.

Boosted decision trees are interesting and as a real high level description (if you need more go on google) they are a group of decision trees produced by entropy calculations and as a group the different decision trees correct for each others mistakes. The easiest explanation for this is the random forest which each decision tree was produced by a different random split in the data and the final decision is voted for and the majority wins.

Decision trees are often used best for classification (vote which class a thing is), it can be used for regression (a linear regression can be calculated at end to give a number), and it can even be used for AI; the decision tree leads to a decision (surprisingly)..

You can see this is where XGBoost is targeted at with a distributed version it is well suited for Hadoop or Yarn distributed databases. So while I don't see a need for XGBoost I think it is more that generally I have only had to work with SQL databases and not the large distributed Hadoop databases (though I do have training on them). At the end of the day I don't have big enough data to need a Hadoop ecosystem; I am not google. if you are google maybe you lobe XGBoost.

CODE:

import xgboost as xgb

# read in data

dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')

dtest = xgb.DMatrix('demo/data/agaricus.txt.test')

# specify parameters via map

param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }

num_round = 2

bst = xgb.train(param, dtrain, num_round)

# make prediction

preds = bst.predict(dtest)

NLTK

Language: Python

Type: machine learning and utility with a overriding emphasis on NLP (natural language processing).

Website here NLTK :: Sample usage for wsd. Though i found this website a bit hard to follow so...

Other websites that might explain it better Natural Language Processing With Python's NLTK Package – Real Python.

Stands for Natural Language Toolkit. NLP or Natural Language Processing is a common task for analysing unstructured text. While again I am pretty sure you can do this in Scikrit Learn (there maybe a theme emerging here) natural language processing is very different and a specific library to do this is welcome and refreshing.

All businesses have stacks of unlabelled, unprocessed (and arguably unloved) pile of free unstructured text that might have lingering benefit for the business if just given to a data analyst. Just think on it if google can index the whole internet why can your business not do the same for its data? Surely some benefit can be derived.

NLTK allows the tokenising of strings, i.e. the splitting up of a piece of texts into its constituent words and also filtering stop words. Stop words are words with less importance to the analytics task and are as such stopped and filtered out. There are usually agreed lists of them that are considered superfluous by linguists.

You can also stem words and by that I mean us a stemmer to simplify down what we think will be the closest to root meaning as we can get. This may sound pseudoscience-y until you start to consider language itself. Words like consult, consultant, consultant, and consultancy might all be thought as a stem of consult; doing so will simplify a range of tagging and analytic tasks because now instead of looking for everything about consulting you just have the stem of consult.

I consider much more complicated NLP to be a whole sub set of skills distinct from other analytics and machine learning tasks. I think if your only going to need bag of words (counting the words or stems) then don't use NLTK writing itself and subsequently getting into regular expressions will make you a better programmer. These sort of tasks aren't really fully NLP as they are more gathering the statistics in a Corpus (a set of books). That being said if you need to do Lemmatising, chunking and want to create trees showing the possible meanings of the words your analysing then NLTK is a great resources.

I also learned what a lexical dispersion plot is in researching this and now want to use them more often.

SpaCy

Language: Python

Type: Natural Language Processing aimed at creating pipelines for feeding Keras or Pytorch ANN language models.

Website: spaCy · Industrial-strength Natural Language Processing in Python.

Learning Resources: Kaggle

I am sure that NLTK can create inputs for Neural Networks and other machine learning models; though SpaCy bills itself as being specifically for this and creating pipelines to methodically do this. A finished pipeline will take text clean and present the data to the piece of machine learning. This can obviously be useful say if you think esteemed business mogul Elon M next tweet will send a stock price soaring a web scraper could download the text clean it and pass to the AI to make the decision (just a example).

I think I prefer Spacy over NLTK. I like the idea about building pipelines and while they do the same thing and use the same processes I just feel a low level process for SpaCy; it might be because they're website is a bit more stylish. In defence of NLTK though I haven't seen the visual lexical graphing and charting tools that NLTK brings to the table.

For future updates: Gensim, FANN, ffnet, OpenCV, SimpleCV, PyCLIPS, Experta, AirSim, Carla, Bullet, FastAI, MXNet, JAX.

C++ Libraries, OpenNN, Caffe, mlpack Library, Microsoft Cognitive Toolkit (CNTK), DyNet, Shogun, FANN, SHARK Library, Armadillo, captum, Stanford CoreNLP, TextBlob, Gensim, Pattern, Polyglot, PyNLPl, Vocabulary

Libraries for Data Science

Add comment

Comments

Create Your Own Website With Webador