ML Techniques Summary

1. Conditional Probability Solution

Steps:

  1. Simulate Data:
    • Generate random data using numpy.random.choice for age groups and purchase probability.
    • Count total purchases and group them by age.
  2. Compute Conditional Probability:
    • Calculate $ P(purchase | age group) $ using the formula $ P(E|F) = \frac{purchases in age group}{total in age group} $.
  3. Compute Overall Probability:
    • Calculate $ P(purchase) = \frac{total purchases}{total trials} $.
  4. Analyze Independence:
    • Compare $ P(E|F) $ and $ P(E) $ to determine if the variables are independent.
from numpy import random
random.seed(0)

totals = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
purchases = {20:0, 30:0, 40:0, 50:0, 60:0, 70:0}
totalPurchases = 0

for _ in range(100000):
    ageDecade = random.choice([20, 30, 40, 50, 60, 70])
    purchaseProbability = 0.4  # Fixed probability
    totals[ageDecade] += 1
    if random.random() < purchaseProbability:
        totalPurchases += 1
        purchases[ageDecade] += 1

PEF = float(purchases[^3_30]) / float(totals[^3_30])
print(f"P(purchase | 30s): {PEF}")

2. Mean & Median Exercise

Steps:

  1. Generate Random Data:
    • Create synthetic e-commerce transaction data using numpy.random.normal.
  2. Visualize Data:
    • Plot a histogram of the data using matplotlib.
  3. Compute Mean and Median:
    • Use numpy.mean and numpy.median to calculate central tendencies.
  4. Experiment with Outliers:
    • Add outliers to the data and observe their impact on mean and median values.
import numpy as np
import matplotlib.pyplot as plt

incomes = np.random.normal(27000, 15000, 10000)
plt.hist(incomes, 50)
plt.show()

print("Mean:", np.mean(incomes))
print("Median:", np.median(incomes))

3. Linear Regression

Steps:

  1. Generate Synthetic Data:
    • Simulate page speed vs purchase amount data using numpy.random.normal.
  2. Fit Linear Regression:
    • Use scipy.stats.linregress to compute slope, intercept, R-squared, and other metrics.
  3. Visualize Fit:
    • Plot observed data points and fitted regression line.
  4. Experiment with Variability:
    • Increase random variation in data and observe changes in R-squared value.
from scipy import stats

pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = 100 - (pageSpeeds * 3) + np.random.normal(0, 0.1, 1000)

slope, intercept, r_value, p_value, std_err = stats.linregress(pageSpeeds, purchaseAmount)
print(f"R-squared: {r_value**2}")

4. Decision Trees

Steps:

  1. Load Data:
    • Import CSV file into a pandas DataFrame.
  2. Preprocess Data:
    • Convert categorical variables into numerical values (e.g., “Yes” → 1, “No” → 0).
  3. Train Decision Tree Model:
    • Use sklearn.tree.DecisionTreeClassifier to train a decision tree for predicting hiring decisions.
  4. Visualize Tree:
    • Export decision tree visualization using tree.export_graphviz.
from sklearn.tree import DecisionTreeClassifier, export_graphviz

data = pd.read_csv('PastHires.csv')
d = {'Y':1, 'N':0}
data['Hired'] = data['Hired'].map(d)
data['Employed?'] = data['Employed?'].map(d)

features = data[['Years Experience', 'Employed?']]
y = data['Hired']
clf = DecisionTreeClassifier()
clf = clf.fit(features, y)

export_graphviz(clf, out_file='tree.dot') 

5. Dealing with Outliers

Steps:

  1. Identify Outliers:
    • Define outliers as values beyond two standard deviations from the median.
  2. Filter Outliers:
    • Implement a function to remove outliers from the dataset.
  3. Visualize Filtered Data:
    • Plot histograms before and after filtering outliers to compare distributions.
  4. Experiment with Thresholds:
    • Adjust threshold for identifying outliers and observe effects on results.
def reject_outliers(data):
    u = np.median(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2*s < e < u + 2*s)]
    return filtered

filtered_data = reject_outliers(incomes)
plt.hist(filtered_data, 50)
plt.show()

6. Naive Bayes Spam Classifier

Steps:

  1. Load Email Data:
    • Read spam/ham email datasets into pandas DataFrame.
  2. Vectorize Text Data:
    • Use CountVectorizer to convert text into numerical features.
  3. Train Naive Bayes Model:
    • Use sklearn.naive_bayes.MultinomialNB to train a spam classifier.
  4. Evaluate Model:
    • Test classifier accuracy on unseen data.

7. Covariance and Correlation

Steps:

  1. Generate Random Data:
    • Create synthetic page speed vs purchase amount data using numpy.
  2. Compute Covariance:
    • Implement covariance calculation manually or use numpy.cov.
  3. Compute Correlation Coefficient:
    • Use correlation formula or built-in functions like numpy.corrcoef.
  4. Interpret Results:
    • Analyze covariance and correlation values to determine relationships between variables.
def covariance(X, Y):
    return np.mean((X - np.mean(X)) * (Y - np.mean(Y)))

print("Covariance:", covariance(pageSpeeds, purchaseAmount))
print("Correlation:", np.corrcoef(pageSpeeds, purchaseAmount)[0,1])

8. Multiple Regression

Steps:

  1. Load Dataset:
    • Import car price dataset into pandas DataFrame.
  2. Scale Features:
    • Standardize numerical features (e.g., mileage, cylinder count) using StandardScaler.
  3. Fit Regression Model:
    • Use statsmodels.api to fit an Ordinary Least Squares (OLS) regression model.
  4. Analyze Coefficients:
    • Examine coefficients to identify key predictors of car price.
  5. Predict Values:
    • Scale input features and use trained model for predictions.
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

df = pd.read_excel('cars.xls')
X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = sm.add_constant(X_scaled)

model = sm.OLS(y, X_scaled).fit()
print(model.summary())

9. Examples of Distributions

Uniform Distribution

  1. Generate random values from a uniform distribution using numpy.random.uniform.
  2. Visualize distribution with a histogram (matplotlib).

Normal Distribution

  1. Generate random values from a normal distribution (numpy.random.normal) with specified mean and standard deviation.
  2. Visualize distribution with a histogram.

Exponential PDF

  1. Generate exponential distribution data using scipy.stats.expon.
  2. Plot probability density function (PDF).
mu, sigma = 5.0, 2.0
values = np.random.normal(mu, sigma, 10000)
plt.hist(values, 50)
plt.show()

from scipy.stats import poisson
mu = 500
x = np.arange(400, 600, 0.5)
plt.plot(x, poisson.pmf(x, mu))
plt.show()