Keyboard shortcuts:

N/СпейсNext Slide

PPrevious Slide

OSlides Overview

ctrl+left clickZoom Element

If you want print version => add '?print-pdf'
at the end of slides URL (remove '#' fragment) and then print.
Like: https://wwwcourses.github.io/...CourseIntro.html?print-pdf

Data science basics

Created for

Iva E. Popova, 2016-2022,

Iva E. Popova on LinkedIn

Frame the concepts: Big data, Machine Learning, NLP, No SQL, Graph Database, Data Visualization

Frame the concepts: Big data, Machine Learning, No SQL, Graph Database, NLP, Data Visualization

Big Data

Collection of datasets being so large, that its difficult to process with the traditional techniques, like RDBMS
The rule of Vs:: Volume - or how much are the data?; Variety - how diverse are different types of data?; Velocity - at what speed are the new data generated?; (Veracity - how accurate is the data?)

Big Data - Tools and Techniques

Apache Hadoop - The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
Python in Hadoop® ecosystem:: Use Hadoop® Streaming API and do Map-Reduce with simple python code.

Machine Learning

Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed

Machine Learning

Coined in 1959 by Arthur Samuel
Algorithm that can learn from data and predict new data states: Supervised and Unsupervised Learning
Interweaves the advantages in computational statistics, data mining, big data, neural networks, logic programming and rule-based computing

Machine Learning with Python

numpy
pandas
Tensorflow

Natural Language Processing (NLP)

Conceptualised in the 1950s: ELIZA by by Joseph Weizenbaum; MARGIE by Schank et all., 1975
Evolves rapidly with evolution of Machine Learning and Deep Learning: Google Translate; Inteligent Chat Bots

NLP with Python

NLTK
Stanford CoreNLP

NoSQL Data Stores

Tries to solve the problems of storing and manipulating data, which can not be structured efficiently into tables
Variety of types, most popular:: Key-value pair; Document-oriented; Graph Database

Python and NoSQL stores

pickleDB for key-value pairs databases.
PyMongo for MongoDB document-oriented databases
Neo4j Python Driver for Python Graph databases.

Data Visualization

Matplotlib for classic visualizations
Bokeh for interactive visualizations capable of handling very large and/or streaming datasets.
Lightning python client for reproducible and real-time visualization with Python and d3.js

Anaconda notes

Next sections demonstrate the usage of various Python's Data Science related packages. You can install them separatelly, as given in the slides, or you can skip the installations, and use the Anaconda Python's distribution, bundled with data science and machine learning related applications

JupyterLab - the next-generation web-based user interface for Project Jupyter

Overview

JupyterLab is the next-generation user interface for Project Jupyter
It's a brand new implementation of the classic Jupyter Notebook, using the new Front-End technologies which allows for adding new features and extensions
The notebook document format used in JupyterLab is the same as in the classic Jupyter Notebook

Overview

JupyterLab offers an IDE-like experience to users
It puts together most of the instruments a data scientist need: code/text editors, terminals, image viewer, python console.: And they work in a synchronised way.
JupyterLab allows third-party sides to write extensions (npm packages) for it through the JupyterLab public APIs

Install JupyterLab

Conda: conda install -c conda-forge jupyterlab
Pip: pip install jupyterlab; If installing using per user (--user), you must add the user-level bin directory to your PATH environment variable in order to launch jupyter lab
Pipenv: pipenv install jupyterlab; In order to launch jupyter lab, you must activate the project’s virtualenv.
Reference: JupyterLab>Installation

Basic operations

Start the server

Make sure you are in the folder, containing your project files
Activate the virtual environment and write:
Jupyter Notebook will be opened automatically in your browser

Stop the server

Press CTRL+C on the Terminal running jupyterlab
Or if you're lazy to type 'y', then double press CTL+C
Or just close the Terminal

Managing Workspaces

JupyterLab sessions always reside in a workspace
Workspaces contain the state of JupyterLab: the files that are currently open, the layout of the application areas and tabs, etc
The default workspace does not have a name and resides at the primary /lab URL: http(s)://<server:port>/<lab-location>/lab
All other workspaces have a name that is part of the URL:: http(s)://<server:port>/<lab-location>/lab/workspaces/foo
More on JupyterLab URLs and Workspaces: JupyterLab URLs

Working with Notebooks

Jupyter Notebook Basics
reference: Notebooks>Notebooks

Upload your Jupyter Notebook to GitHub

There is nothing different than pushing any other file to your github account:: Add your changes to the stage; Commit those changes; Push the branch to your GitHub repo

Share/Render your JupyterNotebooks (GitHub Render)

You can use github.com to host your .ipynb files for free
GitHub can render .ipynb files. For more details check github docs: when you click on it (the .ipynb extension) github will try to render the notebook in your browse; You can copy that URL and share it with anybody; Like that: JupyterNotebook_basics

Share/Render your JupyterNotebooks (nbviewer)

The preferred way

nbviewer.jupyter.org is a Web Service to render .ipynb files
You can render there your .ipynb files, hosted on github, or GoogleDrive, Dropbox, ...
Like that: JupyterNotebook_introduction

numpy

numpy overview

numpy is a python package that adds support for:: large, multi-dimensional arrays and matrices; large collection of high-level mathematical functions to operate on these arrays
Written mainly in C langualge
numpy arrays and operations are much faster than Python's list equivalents.
pandas is built on top of numpy

data types

Numpy supports much finer data types than Python: Data types @docs.scipy.org

N-Dimensional Arrays

The main object in numpy is the homogeneous multidimensional array
It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers for each dimension.: In numpy, dimensions are called axes; Number of axes defines the rank of the array

Examples

numpy array examples in Jupyter Notebook:: See as html; Or play online at google colab:NDArrays.ipynb

References

Functions and Methods Overview
numpy in PyPi

pandas - Overview

Overview

pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
pandas is great for data analysis and modeling
pandas combined with the IPython toolkit and other libraries creates environment for doing data analysis in Python, which excels in performance, productivity, and the ability to collaborate.

Pandas Data Structures

The two primary data structures in pandas are Series and DataFrame.

pandas - Series Object

Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data, based on the NumPy ndarray.
But a Pandas Series object wraps both a sequence of values and a sequence of indices


			import pandas as pd

			ds = pd.Series([1,2,3,4])
			print(ds)


			0    1
			1    2
			2    3
			3    4
			dtype: int64

Create Series with Explicit Indexing

The explicit index definition gives the Series object additional capabilities compared to numpy arrays: i.e. the index need not to be an integer, but can consist of values of any desired type


			ds = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
			print(ds)


			a    1
			b    2
			c    3
			d    4
			dtype: int6

Create Series from dictionary

By default, a Series index is drawn from the sorted keys of the dict.


			ds = pd.Series({
				"d":4,
				"a":1,
				"c":3,
				"b":2,
				"e":5
			})
			print(ds)


			a    1
			b    2
			c    3
			d    4
			e    5
			dtype: int64

Series Indexing

You can use a single index value, or a list of indexes, or slicing

Series slicing

the Series also supports array-style operations such as slicing


			## slicing
			print(ds["a":"d"])
			#a    1
			#b    2
			#c    3
			#d    4
			#dtype: int64

Altering index in place

A Series’s index can be altered in place by assignment


			ds = pd.Series([1,2,3,4,5],index=["a","b","c","d","e"])

			ds.index = ["A","B","C","D","E"]
			print(ds)
			#A    1
			#B    2
			#C    3
			#D    4
			#E    5
			#dtype: int64

NumPy operations on Series


			ds = pd.Series([1,2,3,4,5],index=["a","b","c","d","e"])

			## filtering by value
			ds[ds>2]
			#c    3
			#d    4
			#e    5
			#dtype: int64

			## multiplication
			ds*2
			#a     2
			#b     4
			#c     6
			#d     8
			#e    10
			#dtype: int64

Dictionary like operation on Series


			ds = pd.Series([1,2,3,4,5],index=["a","b","c","d","e"])

			"a" in ds
			#True

			"f" in ds
			#False

Missing Data

Missing data can appear when we transform or make some operations on Series oject. These data values are marked as NaN (Not A Number) values


			ds1 = pd.Series([1,3], index=["a","c"])
			ds2 = pd.Series([2,3], index=["b","c"])

			print(ds1+ds2)
			#a    NaN
			#b    NaN
			#c    6.0
			#dtype: float64

All Examples

Examples as HTML: Series.html
Examples as ipynb: Series.ipynb
Examples as ipynb on Google Colab: Series.ipynb on Google Colab

References

pandas.Series @pandas.pydata.org

pandas - DataFrame Object

Pandas DataFrame Object

a DataFrame is an analogue of a two-dimensional array or table with flexible row and column indices.
You can think of a DataFrame as a sequence of aligned (sharing same index) Series objects: i.e. each column in a DataFrame is represented by a Series Object

Create DataFrame from a single Series object.

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.
We can pass columns names, instead of defaults

Create DataFrame from a dictionary of Series object.

A DataFrame can be thought as a dictionary of Series objects, where the dictionary keys represent the columns names.


			prices_ds = pd.Series([1.5, 2, 2.5, 3],
									index=["apples", "oranges", "bananas", "strawberries"])

			suppliers_ds = pd.Series(["supplier1", "supplier2", "supplier4", "supplier3"],
										 index=["apples", "oranges", "bananas", "strawberries"])

			fruits_df = pd.DataFrame({
				"prices": prices_ds,
				"suppliers": suppliers_ds
			})
			print(fruits_df)
			#              prices  suppliers
			#apples           1.5  supplier1
			#oranges          2.0  supplier2
			#bananas          2.5  supplier4
			#strawberries     3.0  supplier3

All Examples

Examples as HTML: DataFrameOverview.html
Examples as ipynb: DataFrameOverview.ipynb

Create DataFrame form csv files

DataFrame form csv files

note

In next slides, the term csv is used in its broader meaning - i.e. it will be used for tab/semicolon/whatsoever separated values

the Dataset

In next examples the csv files used are taken from IMDb data files available for download
For test purposes, a shorten versions of the files, containing the first 500 records from each, are available:: title.basics_sample_500.tsv; name.basics_sample_500.tsv

`read_csv`

read_csv is the preferred method for loading csv data into a DataFrame object
It provides a flourish amount of properties and methods for fine-tune loading, cleaning and manipulation of csv files.

Examples

examples as HTML: imdb.html
example as ipynb: imdb.ipynb

Processing various file formats with pandas

CSV & text files

DataFrames Manipulations

Examples as HTML: DataFramesManipulations.html
Examples as ipynb: DataFramesManipulations.ipynb

DataFrames Merge (aka SQL Join)

DataFrames Merge

Overview

merge method in pandas is analogues to SQL join operation!
merge is the entry point for all standard database join operations between DataFrame objects:
The related DataFrame.join method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.

Syntax


			pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
							 left_index=False, right_index=False, sort=True,
							 suffixes=('_x', '_y'), copy=True, indicator=False,
							 validate=None)

the datasets

developers.csv and languages.csv


			did;  dname
			1;    Ivan
			2;    Asen
			3;    Maria
			4;    Stoyan
			5;    Aleks
			6;    Svetlin


			did;  language
			2;    "C++"
			3;    "Python"
			3;    "R"
			6;    "Java"

Inner Join

Return only the rows in which the left table have matching keys in the right table


			dev_langs_inner = pd.merge(devs,langs,on="did",how='inner')
			print(dev_langs_inner)


			did dname language
			0 2 Asen  "C++"
			1 3 Maria "Python"
			2 3 Maria "R"
			3 6 Svetlin "Java"

Output Join

Returns all rows from both tables, join records from the left which have matching keys in the right table.


			dev_langs_outer = pd.merge(devs,langs,on="did",how='outer')
			dev_langs_outer


			did dname language
			0 1 Ivan  NaN
			1 2 Asen  "C++"
			2 3 Maria "Python"
			3 3 Maria "R"
			4 4 Stoyan  NaN
			5 5 Aleks NaN
			6 6 Svetlin "Java"

Left outer join

Return all rows from the left table, and any rows with matching keys from the right table.


			dev_langs_left_outer = pd.merge(devs,langs,on="did",how='left')
			dev_langs_left_outer


			did dname language
			0 1 Ivan  NaN
			1 2 Asen  "C++"
			2 3 Maria "Python"
			3 3 Maria "R"
			4 4 Stoyan  NaN
			5 5 Aleks NaN
			6 6 Svetlin "Java"

Right outer join

Return all rows from the right table, and any rows with matching keys from the right table.


			dev_langs_right_outer = pd.merge(devs,langs,on="did",how='right')
			dev_langs_right_outer


			did dname language
			0 2 Asen  "C++"
			1 3 Maria "Python"
			2 3 Maria "R"
			3 6 Svetlin "Java"

All Examples

Examples as HTML: DataFramesMerge.html
Examples as ipynb: DataFramesMerge.ipynb

resources

merge @pandas.pydata.org
Merge, join, and concatenate @pandas.pydata.org
Comparison with SQL @pandas.pydata.org

DataFrames Join

Overview

Join columns with other DataFrame either on index or on a key column
By default, join() will join the DataFrames on their indices
Efficiently Join multiple DataFrame objects by index at once by passing a list

All Examples

Examples as HTML: DataFramesJoin.html
Examples as ipynb: DataFramesJoin.ipynb

references

Cookbook @pandas.pydata.org
Merge, join, and concatenate @pandas.pydata.org
Comparison with SQL JOIN @pandas.pydata.org

YouTube

These slides are based on

customised version of

Hakimel's reveal.js

framework

Data science basics

Frame the concepts: Big data, Machine Learning, NLP, No SQL, Graph Database, Data Visualization

Frame the concepts: Big data, Machine Learning, No SQL, Graph Database, NLP, Data Visualization

Big Data

Big Data - Tools and Techniques

Machine Learning

Machine Learning

Machine Learning with Python

Natural Language Processing (NLP)

NLP with Python

NoSQL Data Stores

Python and NoSQL stores

Data Visualization

Anaconda notes

JupyterLab - the next-generation web-based user interface for Project Jupyter

JupyterLab - the next-generation web-based user interface for Project Jupyter

Overview

Overview

Install JupyterLab

Basic operations

Start the server

Stop the server

Managing Workspaces

Working with Notebooks

Upload your Jupyter Notebook to GitHub

Share/Render your JupyterNotebooks (GitHub Render)

Share/Render your JupyterNotebooks (nbviewer)

numpy

numpy

numpy overview

data types

N-Dimensional Arrays

Examples

References

pandas - Overview

pandas - Overview

Overview

Pandas Data Structures

pandas - Series Object

pandas - Series Object

Pandas Series Object

Create Series with Explicit Indexing

Create Series from dictionary

Series Indexing

Series slicing

Altering index in place

NumPy operations on Series

Dictionary like operation on Series

Missing Data

All Examples

References

pandas - DataFrame Object

pandas - DataFrame Object

Pandas DataFrame Object

Create DataFrame from a single Series object.

Create DataFrame from a dictionary of Series object.

All Examples

More on pandas.DataFrame() class

Create DataFrame form csv files

DataFrame form csv files

note

the Dataset

read_csv

Examples

More on pandas.read_csv

Processing various file formats with pandas

Processing various file formats with pandas

CSV & text files

DataFrames Manipulations

DataFrames Manipulations

DataFrames Merge (aka SQL Join)

DataFrames Merge

Overview

Syntax

the datasets

Inner Join

Output Join

Left outer join

Right outer join

All Examples

`read_csv`