1. Conditional Probability Solution
Steps:
- Simulate Data:
- Generate random data using
numpy.random.choice
for age groups and purchase probability. - Count total purchases and group them by age.
- Compute Conditional Probability:
- Calculate $ P(purchase | age group) $ using the formula $ P(E|F) = \frac{purchases in age group}{total in age group} $.
- Compute Overall Probability:
- Calculate $ P(purchase) = \frac{total purchases}{total trials} $.
- Analyze Independence:
- Compare $ P(E|F) $ and $ P(E) $ to determine if the variables are independent.
from numpy import random
random.seed(0)
totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0
for _ in range(100000):
ageDecade = random.choice([20, 30, 40, 50, 60, 70])
purchaseProbability = 0.4 # Fixed probability
totals[ageDecade] += 1
if random.random() < purchaseProbability:
totalPurchases += 1
purchases[ageDecade] += 1
PEF = float(purchases[^3_30]) / float(totals[^3_30])
print(f"P(purchase | 30s): {PEF}")
Steps:
- Generate Random Data:
- Create synthetic e-commerce transaction data using
numpy.random.normal
.
- Visualize Data:
- Plot a histogram of the data using
matplotlib
.
- Compute Mean and Median:
- Use
numpy.mean
and numpy.median
to calculate central tendencies.
- Experiment with Outliers:
- Add outliers to the data and observe their impact on mean and median values.
import numpy as np
import matplotlib.pyplot as plt
incomes = np.random.normal(27000, 15000, 10000)
plt.hist(incomes, 50)
plt.show()
print("Mean:", np.mean(incomes))
print("Median:", np.median(incomes))
3. Linear Regression
Steps:
- Generate Synthetic Data:
- Simulate page speed vs purchase amount data using
numpy.random.normal
.
- Fit Linear Regression:
- Use
scipy.stats.linregress
to compute slope, intercept, R-squared, and other metrics.
- Visualize Fit:
- Plot observed data points and fitted regression line.
- Experiment with Variability:
- Increase random variation in data and observe changes in R-squared value.
from scipy import stats
pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = 100 - (pageSpeeds * 3) + np.random.normal(0, 0.1, 1000)
slope, intercept, r_value, p_value, std_err = stats.linregress(pageSpeeds, purchaseAmount)
print(f"R-squared: {r_value**2}")
4. Decision Trees
Steps:
- Load Data:
- Import CSV file into a pandas DataFrame.
- Preprocess Data:
- Convert categorical variables into numerical values (e.g., “Yes” → 1, “No” → 0).
- Train Decision Tree Model:
- Use
sklearn.tree.DecisionTreeClassifier
to train a decision tree for predicting hiring decisions.
- Visualize Tree:
- Export decision tree visualization using
tree.export_graphviz
.
from sklearn.tree import DecisionTreeClassifier, export_graphviz
data = pd.read_csv('PastHires.csv')
d = {'Y':1, 'N':0}
data['Hired'] = data['Hired'].map(d)
data['Employed?'] = data['Employed?'].map(d)
features = data[['Years Experience', 'Employed?']]
y = data['Hired']
clf = DecisionTreeClassifier()
clf = clf.fit(features, y)
export_graphviz(clf, out_file='tree.dot')
5. Dealing with Outliers
Steps:
- Identify Outliers:
- Define outliers as values beyond two standard deviations from the median.
- Filter Outliers:
- Implement a function to remove outliers from the dataset.
- Visualize Filtered Data:
- Plot histograms before and after filtering outliers to compare distributions.
- Experiment with Thresholds:
- Adjust threshold for identifying outliers and observe effects on results.
def reject_outliers(data):
u = np.median(data)
s = np.std(data)
filtered = [e for e in data if (u - 2*s < e < u + 2*s)]
return filtered
filtered_data = reject_outliers(incomes)
plt.hist(filtered_data, 50)
plt.show()
6. Naive Bayes Spam Classifier
Steps:
- Load Email Data:
- Read spam/ham email datasets into pandas DataFrame.
- Vectorize Text Data:
- Use
CountVectorizer
to convert text into numerical features.
- Train Naive Bayes Model:
- Use
sklearn.naive_bayes.MultinomialNB
to train a spam classifier.
- Evaluate Model:
- Test classifier accuracy on unseen data.
7. Covariance and Correlation
Steps:
- Generate Random Data:
- Create synthetic page speed vs purchase amount data using
numpy
.
- Compute Covariance:
- Implement covariance calculation manually or use
numpy.cov
.
- Compute Correlation Coefficient:
- Use correlation formula or built-in functions like
numpy.corrcoef
.
- Interpret Results:
- Analyze covariance and correlation values to determine relationships between variables.
def covariance(X, Y):
return np.mean((X - np.mean(X)) * (Y - np.mean(Y)))
print("Covariance:", covariance(pageSpeeds, purchaseAmount))
print("Correlation:", np.corrcoef(pageSpeeds, purchaseAmount)[0,1])
8. Multiple Regression
Steps:
- Load Dataset:
- Import car price dataset into pandas DataFrame.
- Scale Features:
- Standardize numerical features (e.g., mileage, cylinder count) using
StandardScaler
.
- Fit Regression Model:
- Use
statsmodels.api
to fit an Ordinary Least Squares (OLS) regression model.
- Analyze Coefficients:
- Examine coefficients to identify key predictors of car price.
- Predict Values:
- Scale input features and use trained model for predictions.
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_excel('cars.xls')
X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = sm.add_constant(X_scaled)
model = sm.OLS(y, X_scaled).fit()
print(model.summary())
9. Examples of Distributions
- Generate random values from a uniform distribution using
numpy.random.uniform
. - Visualize distribution with a histogram (
matplotlib
).
Normal Distribution
- Generate random values from a normal distribution (
numpy.random.normal
) with specified mean and standard deviation. - Visualize distribution with a histogram.
Exponential PDF
- Generate exponential distribution data using
scipy.stats.expon
. - Plot probability density function (PDF).
mu, sigma = 5.0, 2.0
values = np.random.normal(mu, sigma, 10000)
plt.hist(values, 50)
plt.show()
from scipy.stats import poisson
mu = 500
x = np.arange(400, 600, 0.5)
plt.plot(x, poisson.pmf(x, mu))
plt.show()