Notes for Everyone

Monday, 3 October 2022

6 a). Apply and explore various plotting functions on UCI data sets. Density and contour plots

6 a). Apply and explore various plotting functions on UCI data sets. Density and contour plots

Aim

To apply and explore various plotting functions like Density and contour plots on datasets.

Procedure

There are three Matplotlib functions that can be helpful for this task: plt.contour for contour plots, plt.contourf for filled contour plots, and plt.imshow for showing images

A contour plot can be created with the plt.contour function. It takes three arguments: a grid of x values, a grid of y values, and a grid of z values.

The x and y values represent positions on the plot, and the z values will be represented by the contour levels.

Perhaps the most straightforward way to prepare such data is to use the np.meshgrid function, which builds two-dimensional grids from one-dimensional arrays.

Next standard line-only contour plot and for color the lines can be color-coded by specifying a colormap with the cmap argument.

Additionally, we'll add a plt.colorbar() command, which automatically creates an additional axis with labeled color information for the plot.

Program

%matplotlib inline

import matplotlib.pyplot as plt

plt.style.use('seaborn-white')

import numpy as np

def f(x, y):

return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

x = np.linspace(0, 5, 50)

y = np.linspace(0, 5, 40)

X, Y = np.meshgrid(x, y)

Z = f(X, Y)

plt.contour(X, Y, Z, colors='black');

Output

plt.contour(X, Y, Z, 20, cmap='RdGy');

Output

plt.contourf(X, Y, Z, 20, cmap='RdGy')

plt.colorbar();

Output

Result

Various plotting functions like Density and contour plots on datasets are successfully executed.

6 b). Apply and explore various plotting functions like Correlation and scatter plots on UCI data sets

6 b). Apply and explore various plotting functions like Correlation and scatter plots on UCI data sets

Aim

To apply and explore various plotting functions like Correlation and scatter plots on datasets.

Procedure

Program

import pandas as pd

con = pd.read_csv('D:/diabetes.csv')

con

list(con.columns)

import seaborn as sns

sns.scatterplot(x="Pregnancies", y="Age", data=con);

Output

sns.lmplot(x="Pregnancies", y="Age", data=con);

Output

sns.lmplot(x="Pregnancies", y="Age", hue="Outcome", data=con);

Output

from scipy import stats

stats.pearsonr(con['Age'], con['Outcome'])

Output

(0.23835598302719774, 2.209975460664566e-11)

cormat = con.corr()

round(cormat,2)

sns.heatmap(cormat);

Output

Result

Various plotting functions like Correlation and scatter plots on datasets

are successfully executed.

6 c. Apply and explore histograms and three dimensional plotting functions on UCI data sets

Aim

To apply and explore histograms and three dimensional plotting functions on UCI data sets

Procedure

ü Download CSV file and upload to explore.

ü A histogram is basically used to represent data provided in a form of some groups.

ü To create a histogram the first step is to create bin of the ranges, then distribute the whole range of the values into a series of intervals, and count the values which fall into each of the intervals.

ü Bins are clearly identified as consecutive, non-overlapping intervals of variables.The matplotlib.pyplot.hist() function is used to compute and create histogram of x.

ü The first one is a standard import statement for plotting using matplotlib, which you would see for 2D plotting as well.

ü The second import of the Axes3D class is required for enabling 3D projections. It is, otherwise, not used anywhere else.

Program

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt # To visualize

from mpl_toolkits.mplot3d import Axes3D

data = pd.read_csv('d:\\diabetes.csv')

data

data['Glucose'].plot(kind='hist')

Output

fig = plt.figure(figsize=(4,4))

ax = fig.add_subplot(111, projection='3d')

Output

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

x = data['Age'].values

y = data['Glucose'].values

z = data['Outcome'].values

ax.set_xlabel("Age (Year)")

ax.set_ylabel("Glucose (Reading)")

ax.set_zlabel("Outcome (0 or 1)")

ax.scatter(x, y, z, c='r', marker='o')

plt.show()

Output

Result

The histograms and three dimensional plotting functions on UCI data sets are successfully executed.

Saturday, 1 October 2022

5 c. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following: Multiple Regression

Aim

Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.

Procedure

The Pandas module allows us to read csv files and return a DataFrame object.

Then make a list of the independent values and call this variable X.

Put the dependent values in a variable called y.

From the sklearn module we will use the LinearRegression() method to create a linear regression object.

This object has a method called fit() that takes the independent and dependent values as parameters and fills the regression object with data that describes the relationship.

We have a regression object that are ready to predict age values based on a person Glucose and BloodPressure

Program

import pandas as pd

from sklearn import linear_model

df = pd.read_csv (r'd:\\diabetes.csv')

print (df)

X = df[['Glucose', 'BloodPressure']]

y = df['Age']

regr = linear_model.LinearRegression()

regr.fit(X, y)

predictedage = regr.predict([[150, 13]])

print(predictedage)

Output

[28.77214401]

5 b. Linear Regression and Logistic Regression with the Diabetes Dataset Using Python Machine Learning

Aim

In this experiment we use the diabetes dataset from sklearn and then we need to implement the Linear Regression over this:

Procedure

Load sklearn Libraries.

Load Data

Load the diabetes dataset

Split Dataset

Creating Model Linear Regression and Logistic Regression

Make predictions using the testing set

Finding Coefficient And Mean Square Error

Program

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

from sklearn import datasets, linear_model

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

#To calculate accuracy measures and confusion matrix

from sklearn import metrics

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

diabetes_X = diabetes_X[:, np.newaxis, 2]

# Split the data into training/testing sets

diabetes_X_train = diabetes_X[:-20]

diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets

diabetes_y_train = diabetes_y[:-20]

diabetes_y_test = diabetes_y[-20:]

# Create linear regression object

regr = linear_model.LinearRegression()

# Train the model using the training sets

regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set

diabetes_y_pred = regr.predict(diabetes_X_test)

# Create Logistic regression object

Logistic_model = LogisticRegression()

Logistic_model.fit(diabetes_X_train, diabetes_y_train)

# The coefficients

print('Coefficients: \n', regr.coef_)

# The mean squared error

print('Mean squared error: %.2f'

% mean_squared_error(diabetes_y_test, diabetes_y_pred))

# The coefficient of determination: 1 is perfect prediction

print('Coefficient of determination: %.2f'

% r2_score(diabetes_y_test, diabetes_y_pred))

y_predict = Logistic_model.predict(diabetes_X_train)

#print("Y predict/hat ", y_predict)

y_predict

Output

Coefficients:

[938.23786125]

Mean squared error: 2548.07

Coefficient of determination: 0.47

5 d. Compare the results of the above analysis for the two data sets.

5 d. Compare the results of the above analysis for the two data sets.

Aim

In this program, we can compare the results of the two different data sets.

Procedure

Step 1: Prepare the datasets to be compared

Step 2: Create the two DataFrames

Based on the above data, you can then create the following two DataFrames

Step 3: Compare the values between the two Pandas DataFrames

In this step, you’ll need to import the NumPy package.

Let’s say that you have the following data stored in a CSV file called car1.csv

While you have the data below stored in a second CSV file called car2.csv

Program

import pandas as pd

import numpy as np

data_1 = pd.read_csv(r'd:\car1.csv')

df1 = pd.DataFrame(data_1)

data_2 = pd.read_csv(r'd:\car2.csv')

df2 = pd.DataFrame(data_2)

df1['amount1'] = df2['amount1']

df1['prices_match'] = np.where(df1['amount'] == df2['amount1'], 'True', 'False')

df1['price_diff'] = np.where(df1['amount'] == df2['amount1'], 0, df1['amount'] - df2['amount1'])

print(df1)

Output

Model City Year amount amount1 prices_match price_diff

0 Maruti Chennai 2022 600000 600000 True 0

1 Hyndai Chennai 2022 700000 700000 True 0

2 Ford Chennai 2022 800000 850000 False -50000

3 Kia Chennai 2022 900000 900000 True 0

4 XL6 Chennai 2022 1000000 1000000 True 0

5 Tata Chennai 2022 1100000 1150000 False -50000

6 Audi Chennai 2022 1200000 1200000 True 0

7 Ertiga Chennai 2022 1300000 1300000 True 0

Please click here to download the Dataset

Dataset 1: car1.csv

Dataset 2: car2.csv

Wednesday, 21 September 2022

CS3361 DATA SCIENCE LABORATORY L T P C 0 0 4 lab manual

. a. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following:
Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis.

CS3361 DATA SCIENCE LABORATORY L T P C 0 0 4 lab manual

CS3361 DATA SCIENCE LABORATORY L T P C 0 0 4 2 COURSE OBJECTIVES:

 To understand the python libraries for data science

 To understand the basic Statistical and Probability measures for data science.

 To learn descriptive analytics on the benchmark data sets.

 To apply correlation and regression analytics on standard data sets.

 To present and interpret data using visualization packages in Python.

LIST OF EXPERIMENTS:

1. Download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas packages.

2. Working with Numpy arrays

3. Working with Pandas data frames

4. Reading data from text files,

Excel and the web and exploring various commands for doing descriptive analytics on the Iris data set.

5. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following: a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis.

b. Bivariate analysis: Linear and logistic regression modeling

c. Multiple Regression analysis

d. Also compare the results of the above analysis for the two data sets.

6. Apply and explore various plotting functions on UCI data sets.

a. Normal curves

b. Density and contour plots

c. Correlation and scatter plots

d. Histograms

e. Three dimensional plotting

7. Visualizing Geographic Data with Basemap List of Equipments:(30 Students per Batch) Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh

Note: Example data sets like: UCI, Iris, Pima Indians Diabetes etc.

TOTAL: 60 PERIODS

COURSE OUTCOMES: At the end of this course, the students will be able to: CO1: Make use of the python libraries for data science CO2: Make use of the basic Statistical and Probability measures for data science. CO3: Perform descriptive analytics on the benchmark data sets. CO4: Perform correlation and regression analytics on standard data sets CO5: Present and interpret data using visualization packages in Python