Data Analysis Training

Data Analysis Course:
Data Analysis using Python is meant to make data do the talking. Using libraries like numpy, pandas & matplotlib we learn here to conclude data before subjecting data to machine learning. Pandas provide extensive utilities for data analysis - merging, grouping, aggregation & much more.

Data Analysis Course Curriculum

1. Introducing Data Analysis and Libraries

Data analysis and processing

An overview of the libraries in data analysis

Python libraries in data analysis

NumPy

Pandas

Matplotlib

PyMongo

The scikit-learn library

2. NumPy Arrays and Vectorized Computation

NumPy arrays

Data types

Array creation

Indexing and slicing

Fancy indexing

Numerical operations on arrays

Array functions

Data processing using arrays

Loading and saving data

Saving an array

Loading an array

Linear algebra with NumPy

NumPy random numbers

3. Data Analysis with Pandas

An overview of the Pandas package

The Pandas data structure

Series

The DataFrame

The essential basic functionality

Reindexing and altering labels

Head and tail

Binary operations

Functional statistics

Function application

Sorting

Indexing and selecting data

Computational tools

Working with missing data

Advanced uses of Pandas for data analysis

Hierarchical indexing

The Panel data

4. Data Visualization

The matplotlib API primer

Line properties

Figures and subplots

Exploring plot types

Scatter plots

Bar plots

Contour plots

Histogram plots

Legends and annotations

Plotting functions with Pandas

Additional Python data visualization tools

Bokeh

MayaVi

5. Time Series

Time series primer

Working with date and time objects

Resampling time series

Downsampling time series data

Upsampling time series data

Time zone handling

Timedeltas

Time series plotting

6. Interacting with Databases

Interacting with data in text format

Reading data from text format

Writing data to text format

Interacting with data in binary format

HDF5

Interacting with data in MongoDB

Interacting with data in Redis

The simple value

List

Set

Ordered set

7. Data Analysis Application Examples

Data munging

Cleaning data

Filtering

Merging data

Reshaping data

Data aggregation

Grouping data

8. Machine Learning Models with scikit-learn

An overview of machine learning models

The scikit-learn modules for different models

Data representation in scikit-learn

Supervised learning – classification and regression

Unsupervised learning – clustering and dimensionality reduction

Measuring prediction performance

9. Getting Started with Predictive Modelling

Introducing predictive modelling

Scope of predictive modelling

Ensemble of statistical algorithms

Statistical tools

Historical data

Mathematical function

Business context

Knowledge matrix for predictive modelling

Task matrix for predictive modelling

Applications and examples of predictive modelling

LinkedIn's "People also viewed" feature

What it does?

How is it done?

Correct targeting of online ads

How is it done?

Santa Cruz predictive policing

How is it done?

Determining the activity of a smartphone user using accelerometer data

How is it done?

Sport and fantasy leagues

How was it done?

Python and its packages – download and installation

Anaconda

Standalone Python

Installing a Python package

Installing pip

Installing Python packages with pip

Python and its packages for predictive modelling

IDEs for Python

10. Data Cleaning

Reading the data – variations and examples

Data frames

Delimiters

Various methods of importing data in Python

Case 1 – reading a dataset using the read_csv method

The read_csv method

Use cases of the read_csv method

Passing the directory address and filename as variables

Reading a .txt dataset with a comma delimiter

Specifying the column names of a dataset from a list

Case 2 – reading a dataset using the open method of Python

Reading a dataset line by line

Changing the delimiter of a dataset

Case 3 – reading data from a URL

Case 4 – miscellaneous cases

Reading from an .xls or .xlsx file

Writing to a CSV or Excel file

Basics – , dimensions, and structure

Handling missing values

Checking for missing values

What constitutes missing data?

How missing values are generated and propagated

Treating missing values

Deletion

Imputation

Creating dummy variables

Visualizing a dataset by basic plotting

Scatter plots

Histograms

Boxplots

11. Data Wrangling

Subsetting a dataset

Selecting columns

Selecting rows

Selecting a combination of rows and columns

Creating new columns

Generating random numbers and their usage

Various methods for generating random numbers

Seeding a random number

Generating random numbers following probability distributions

Probability density function

Cumulative density function

Uniform distribution

Normal distribution

Using the Monte-Carlo simulation to find the value of pi

Geometry and mathematics behind the calculation of pi

Generating a dummy data frame

Grouping the data – aggregation, filtering, and transformation

Aggregation

Filtering

Transformation

Miscellaneous operations

Random sampling – splitting a dataset in training and testing datasets

Method 1 – using the Customer Churn Model

Method 2 – using sklearn

Method 3 – using the shuffle function

Concatenating and appending data

Merging/joining datasets

Inner Join

Left Join

Right Join

An example of the Inner Join

An example of the Left Join

An example of the Right Join

of Joins in terms of their length

12. Statistical Concepts for Predictive Modelling

Random sampling and the central limit theorem

Hypothesis testing

Null versus alternate hypothesis

Z-statistic and t-statistic

Confidence intervals, significance levels, and p-values

Different kinds of hypothesis test

A step-by-step guide to do a hypothesis test

An example of a hypothesis test

Chi-square tests

Correlation

13. Linear Regression with Python

Understanding the maths behind linear regression

Linear regression using simulated data

Fitting a linear regression model and checking its efficacy

Finding the optimum value of variable coefficients

Making sense of result parameters

p-values

F-statistics

Residual Standard Error

Implementing linear regression with Python

Linear regression using the statsmodel library

Multiple linear regression

Multi-collinearity

Variance Inflation Factor

Model validation

Training and testing data split of models

Linear regression with scikit-learn

Feature selection with scikit-learn

Handling other issues in linear regression

Handling categorical variables

Transforming a variable to fit non-linear relations

Handling outliers

Other considerations and assumptions for linear regression

14. Logistic Regression with Python

Linear regression versus logistic regression

Understanding the math behind logistic regression

Contingency tables

Conditional probability

Odds ratio

Moving on to logistic regression from linear regression

Estimation using the Maximum Likelihood Method

Likelihood function:

Log likelihood function:

Building the logistic regression model from scratch

Making sense of logistic regression parameters

Wald test

Likelihood Ratio Test statistic

Chi-square test

Implementing logistic regression with Python

Processing the data

Data exploration

Data visualization

Creating dummy variables for categorical variables

Feature selection

Implementing the model

Model validation and evaluation

Cross validation

Model validation

The ROC curve

Confusion matrix

15. Clustering with Python

Introduction to clustering – what, why, and how?

What is clustering?

How is clustering used?

Why do we do clustering?

Mathematics behind clustering

Distances between two observations

Euclidean distance

Manhattan distance

Minkowski distance

The distance matrix

Normalizing the distances

Linkage methods

Single linkage

Compete linkage

Average linkage

Centroid linkage

Ward's method

Hierarchical clustering

K-means clustering

Implementing clustering using Python

Importing and exploring the dataset

Normalizing the values in the dataset

Hierarchical clustering using scikit-learn

K-Means clustering using scikit-learn

Interpreting the cluster

Fine-tuning the clustering

The elbow method

Silhouette Coefficient

16. Trees and Random Forests with Python

Introducing decision trees

A decision tree

Understanding the mathematics behind decision trees

Homogeneity

Entropy

Information gain

ID3 algorithm to create a decision tree

Gini index

Reduction in Variance

Pruning a tree

Handling a continuous numerical variable

Handling a missing value of an attribute

Implementing a decision tree with scikit-learn

Visualizing the tree

Cross-validating and pruning the decision tree

Understanding and implementing regression trees

Regression tree algorithm

Implementing a regression tree using Python

Understanding and implementing random forests

The random forest algorithm

Implementing a random forest using Python

Why do random forests work?

Important parameters for random forests

17. Best Practices for Predictive Modelling

Best practices for coding

Commenting the codes

Defining functions for substantial individual tasks

Example 1

Example 2

Example 3

Avoid hard-coding of variables as much as possible

Version control

Using standard libraries, methods, and formulas

Best practices for data handling

Best practices for algorithms

Best practices for statistics

Best practices for business contexts

18. A Conceptual Framework for Data Visualization

Data, information, knowledge, and insight

Data

Information

Knowledge

Data analysis and insight

The transformation of data

Transforming data into information

Data collection

Data preprocessing

Data processing

Organizing data

Getting datasets

Transforming information into knowledge

Transforming knowledge into insight

Data visualization history

Visualization before computers

Minard's Russian campaign (1812)

The Cholera epidemics in London (1831-1855)

Statistical graphics (1850-1915)

Later developments in data visualization

How does visualization help decision-making?

Where does visualization fit in?

Data visualization today

What is a good visualization?

Visualization plots

Bar graphs and pie charts

Bar graphs

Pie charts

Box plots

Scatter plots and bubble charts

Scatter plots

Bubble charts

KDE plots

19. Data Analysis and Visualization

Why does visualization require planning?

The Ebola example

A sports example

Visually representing the results

Creating interesting stories with data

Why are stories so important?

Reader-driven narratives

Gapminder

The State of the Union address

Mortality rate in the USA

A few other example narratives

Author-driven narratives

Perception and presentation methods

The Gestalt principles of perception

Some best practices for visualization

Comparison and ranking

Correlation

Distribution

Location-specific or geodata

Part-to-whole relationships

Trends over time

Visualization tools in Python

Development tools

Canopy from Enthought

Anaconda from Continuum Analytics

Interactive visualization

Event listeners

Layouts

Circular layout

Radial layout

Balloon layout

20. Numerical Computing and Interactive Plotting

NumPy, SciPy, and MKL functions

NumPy

NumPy universal functions

Shape and reshape manipulation

An example of interpolation

Vectorizing functions

SciPy

An example of linear equations

The vectorized numerical derivative

MKL functions

The performance of Python

Scalar selection

Slicing

Slice using flat

Array indexing

Numerical indexing

Logical indexing

Other data structures

Stacks

Tuples

Sets

Queues

Dictionaries

Dictionaries for matrix representation

Sparse matrices

Visualizing sparseness

Dictionaries for memoization

Tries

Visualization using matplotlib

Word clouds

Installing word clouds

Input for word clouds

Web feeds

The Twitter text

Plotting the stock price chart

Obtaining data

The visualization example in sports

21. Financial and Statistical Models

The deterministic model

Gross returns

The stochastic model

Monte Carlo simulation

What exactly is Monte Carlo simulation?

An inventory problem in Monte Carlo simulation

Monte Carlo simulation in basketball

The volatility plot

Implied volatilities

The portfolio valuation

The simulation model

Geometric Brownian simulation

The diffusion-based simulation

The threshold model

Schelling's Segregation Model

An overview of statistical and machine learning

K-nearest neighbors

Generalized linear models

Bayesian linear regression

Creating animated and interactive plots

22. Statistical and Machine Learning

Classification methods

Understanding linear regression

Linear regression

Decision tree

An example

The Bayes theorem

The NaÃ¯ve Bayes classifier

The NaÃ¯ve Bayes classifier using TextBlob

Installing TextBlob

Downloading corpora

The NaÃ¯ve Bayes classifier using TextBlob

Viewing positive sentiments using word clouds

k-nearest neighbors

Logistic regression

Support vector machines

Principal component analysis

Installing scikit-learn

k-means clustering

23. Bioinformatics, Genetics, and Network Models

Directed graphs and multigraphs

Storing graph data

Displaying graphs

igraph

NetworkX

Graph-tool

PageRank

The clustering coefficient of graphs

Analysis of social networks

The planar graph test

The directed acyclic graph test

Maximum flow and minimum cut

A genetic programming example

Stochastic block models

24. Advanced Visualization

Computer simulation

Python's random package

SciPy's random functions

Simulation examples

Signal processing

Animation

Visualization methods using HTML5

How is Julia different from Python?

D3.js for visualization

Dashboard

Frequently Asked Questions

What are the modes of training for "Data Analysis" course?

This "Data Analysis" course is an instructor-led training (ILT). The trainer travels to your office location and delivers the training within your office premises. If you need training space for the training we can provide a fully-equipped lab with all the required facilities. The online instructor-led training is also available if required. Online training is live and the instructor's screen will be visible and voice will be audible. Participants screen will also be visible and participants can ask queries during the live session.

Will I be provided with any study material during the "Data Analysis" training?

Participants will be provided "Data Analysis"-specific study material. Participants will have lifetime access to all the code and resources needed for this "Data Analysis". Our public GitHub repository and the study material will also be shared with the participants.

What is the pedagogy of zekeLabs?

All the courses from zekeLabs are hands-on courses. The code/document used in the class will be provided to the participants. Cloud-lab and Virtual Machines are provided to every participant during the "Data Analysis" training.

What is the duration of this course?

The "Data Analysis" training varies several factors. Including the prior knowledge of the team on the subject, the objective of the team learning from the program, customization in the course is needed among others. Contact us to know more about "Data Analysis" course duration.

What would be the venue for the "Data Analysis" training?

The "Data Analysis" training is organised at the client's premises. We have delivered and continue to deliver "Data Analysis" training in India, USA, Singapore, Hong Kong, and Indonesia. We also have state-of-art training facilities based on client requirement.

Who is the trainer for "Data Analysis" training?

Our Subject matter experts (SMEs) have more than ten years of industry experience. This ensures that the learning program is a 360-degree holistic knowledge and learning experience. The course program has been designed in close collaboration with the experts working in esteemed organizations such as Google, Microsoft, Amazon, and similar others.

Can we customize this course based on our requirements?

Yes, absolutely. For every training, we conduct a technical call with our Subject Matter Expert (SME) and the technical lead of the team that undergoes training. The course is tailored based on the current expertise of the participants, objectives of the team undergoing the training program and short term and long term objectives of the organisation.

How can I reach out to you if I have any other queries regarding the "Data Analysis" course?

Drop a mail to us at [email protected] or call us at +91 8041690175 and we will get back to you at the earliest for your queries on "Data Analysis" course.