Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project

Topic Tree
Terms and Definitions
Activities
Study Notes
Videos
Question Bank

Topic Tree

1. Introduction

Importance of methodology in AI/DS projects
Framework for systematic execution

2. The Methodology Framework (10 Steps grouped into 5 modules)

Module 1: From Problem to Approach

2.1.1 Business Understanding (problem scoping, 5W1H, Design Thinking)
2.1.2 Analytic Approach (choosing type of analytics: descriptive, diagnostic, predictive, prescriptive)

Module 2: From Requirements to Collection

2.1.3 Data Requirements (content, format, sources)
2.1.4 Data Collection (methods, online/offline sources)

Module 3: From Understanding to Preparation

2.1.5 Data Understanding (exploratory analysis, visualization)
2.1.6 Data Preparation (cleaning, integration, transformation, feature engineering)

Module 4: From Modelling to Evaluation

2.1.7 AI Modelling (descriptive vs. predictive models, training/testing, algorithm choice)
2.1.8 Evaluation (metrics like accuracy, precision, recall, F1 score; diagnostic & statistical validation)

Module 5: From Deployment to Feedback

2.1.9 Deployment (integration into real-world use, limited rollout, productionization)
2.1.10 Feedback (user response, iterative refinement, automation of retraining)

3. Model Validation

Importance of validation (prevent overfitting/underfitting)
Techniques: Train-test split, K-Fold cross validation, LOOCV, Time-series CV

4. Model Performance Metrics

Metrics for classification (Accuracy, Precision, Recall, F1-score)
Metrics for regression (MSE, RMSE, R²)

5. Capstone Project Application

Integrating all steps into a real-world project
Iterative refinement across modules
Use of case studies, discussions, and hands-on activities

Terms and Definitions

Data Science Methodology:
A prescribed sequence of iterative steps that data scientists follow to approach a problem, analyze data, and find solutions systematically.
Business Understanding (Problem Scoping/Defining):
The process of identifying the real-world problem to solve, using tools like the 5W1H Problem Canvas and Design Thinking (DT) framework.
Analytic Approach:
Choosing the right type of analytics for the problem:
- Descriptive Analytics – What happened?
- Diagnostic Analytics – Why did it happen?
- Predictive Analytics – What is likely to happen?
- Prescriptive Analytics – What should we do about it?
Data Requirements:
Defining the necessary content, format, and sources of data for analysis.
Data Collection:
Gathering raw data from structured, semi-structured, or unstructured sources, either online or offline.
Data Understanding (Exploratory Data Analysis – EDA):
Using visualization and summary statistics to discover patterns, spot anomalies, and test hypotheses.
Data Preparation:
Cleaning, integrating, transforming, and engineering features to make the dataset suitable for modelling.
AI Modelling:
Building descriptive models (to summarize data) or predictive models (to forecast outcomes). This involves choosing algorithms, training/testing, and iterative refinement.
Evaluation:
Assessing model performance using metrics such as:
- Classification: Accuracy, Precision, Recall, F1-Score
- Regression: MSE, RMSE, R²
Deployment:
Implementing the validated model in real-world applications (e.g., production systems, limited rollout).
Feedback:
Collecting user/system responses post-deployment and iteratively refining the model for improvements.
Model Validation:
Techniques to test generalization, e.g., Train-test split, K-Fold Cross Validation, Leave-One-Out (LOOCV), Time-series CV.
Overfitting:
A condition when a model learns noise along with signal, performing well on training data but poorly on unseen data.
Underfitting:
A model too simple to capture underlying patterns, leading to poor accuracy on both training and testing data.

Activities

Word Search Game

Cross Word Puzzle

Study Notes

1. Introduction

Data Science Methodology provides a structured approach for solving real-world problems using data.
It ensures systematic planning, analysis, and execution in projects, especially Capstone projects.
The methodology follows an iterative, modular framework.

2. Methodology Framework (10 Steps in 5 Modules)

🔹 Module 1: From Problem to Approach

Business Understanding
- Identify and define the problem.
- Use tools: 5W1H Problem Canvas (Who, What, When, Where, Why, How).
- Apply Design Thinking for human-centered solutions.
Analytic Approach
- Decide type of analytics:
  - Descriptive – What happened?
  - Diagnostic – Why did it happen?
  - Predictive – What will happen?
  - Prescriptive – What should we do?

🔹 Module 2: From Requirements to Collection

Data Requirements
- Define: content, format, sources.
- Ensure relevant, sufficient, and quality data.
Data Collection
- Gather data from primary or secondary sources.
- Sources: surveys, sensors, databases, web scraping, APIs.

🔹 Module 3: From Understanding to Preparation

Data Understanding
- Perform Exploratory Data Analysis (EDA).
- Use visualizations and statistical summaries to identify trends, anomalies, and patterns.
Data Preparation
- Clean, integrate, and transform data.
- Handle missing values, outliers, duplicates.
- Perform feature engineering for better model performance.

🔹 Module 4: From Modelling to Evaluation

AI Modelling
- Choose appropriate algorithms (classification, regression, clustering, etc.).
- Split into training/testing datasets.
- Train models iteratively.
Evaluation
- Use metrics to check performance:
  - Classification: Accuracy, Precision, Recall, F1-score.
  - Regression: MSE, RMSE, R².
- Ensure statistical validation and avoid overfitting/underfitting.

🔹 Module 5: From Deployment to Feedback

Deployment
- Integrate the model into real-world applications.
- Deploy in limited rollout or full production environment.
Feedback

Collect responses from users/systems.
Continuously improve via feedback loop and model retraining.

3. Model Validation Techniques

Train-Test Split – divide data into training and testing sets.
K-Fold Cross Validation – split into multiple folds for robust evaluation.
Leave-One-Out CV (LOOCV) – extreme case of cross-validation.
Time-series CV – for sequential/temporal data.

4. Key Challenges

Overfitting: Model fits training data too closely, poor generalization.
Underfitting: Model too simple, fails to capture patterns.
Bias vs Variance Trade-off: Must balance complexity and accuracy.

5. Capstone Project Application

Capstone projects simulate real-world problem-solving using this methodology.
Each step (Problem → Approach → Data → Modelling → Deployment → Feedback) must be documented.
Encourages hands-on practice, case studies, and iteration until effective solutions are achieved.

✅ Summary:
The Data Science Methodology framework is the backbone of an AI/DS Capstone Project. It ensures problems are well-defined, data is systematically handled, models are validated, and solutions are deployed with feedback for continuous improvement.

Explanation with examples

1. Business Understanding (Problem Scoping)

Goal: Define the problem clearly.

Example:
A retail chain wants to reduce customer churn (customers who stop buying).
- 5W1H:
  - Who? Customers
  - What? Churn (stop buying)
  - When? Last 6 months
  - Where? Online store
  - Why? Loss of revenue
  - How? Identify patterns & predict at-risk customers

2. Analytic Approach

Goal: Decide which type of analytics to use.

Example: For churn prediction:
- Descriptive Analytics: Past churn rate = 20%.
- Diagnostic Analytics: Customers left due to poor service.
- Predictive Analytics: Which current customers are likely to churn?
- Prescriptive Analytics: Offer discounts to high-risk customers.

3. Data Requirements

Goal: Identify what data is needed.

Example: For churn:
- Content: Purchase history, complaints, demographics.
- Format: Structured (tables), semi-structured (chat logs).
- Source: Company database, customer feedback forms.

4. Data Collection

Goal: Gather the required data.

Example:
- Collect transaction logs (structured).
- Use web scraping for customer reviews (unstructured).
- Conduct surveys for satisfaction ratings.

5. Data Understanding (EDA)

Goal: Explore patterns in data.

Example:
- Visualize churn rate by age group – older customers churn less.
- Analyze complaint categories – “late delivery” is most common.
- Correlation: Customers with low satisfaction scores often churn.

6. Data Preparation

Goal: Clean and transform data for modeling.

Example:
- Handle missing values (e.g., fill missing ages with median).
- Remove duplicate transactions.
- Convert “Yes/No” churn column into binary (1/0).
- Engineer new features: “Average Purchase Value per Month.”

7. AI Modelling

Goal: Build predictive models.

Example:
- Apply Logistic Regression or Random Forest to predict churn.
- Train the model on 70% of data, test on 30%.
- Model predicts which customers are likely to churn next month.

8. Evaluation

Goal: Check accuracy and reliability.

Example:
- Accuracy: 85% of predictions are correct.
- Precision: Of customers predicted to churn, 80% actually did.
- Recall: Model captured 70% of all churners.
- F1-Score: Balanced measure = 0.74.
- Use confusion matrix to analyze true/false predictions.

9. Deployment

Goal: Put the model into real-world use.

Example:
- Integrate churn prediction into CRM software.
- Sales team gets a list of “high churn risk” customers daily.
- System triggers personalized email offers automatically.

10. Feedback

Goal: Refine the model using real-world outcomes.

Example:
- After deployment, the churn rate drops from 20% → 12%.
- Feedback shows younger customers ignore email offers.
- Model retrained with SMS engagement feature to improve results.

✅ Summary with Capstone Example

Capstone Project Idea:Predicting Student Dropout in an Online Course
- Business Understanding: University wants to reduce dropouts.
- Analytic Approach: Predictive (who will drop out?).
- Data Requirements: Attendance, quiz scores, engagement time.
- Data Collection: LMS logs, surveys.
- EDA: Students with low quiz scores tend to drop out.
- Preparation: Handle missing attendance logs, create engagement index.
- Modelling: Train a Decision Tree classifier.
- Evaluation: Model achieves 82% accuracy.
- Deployment: Faculty dashboard shows “at-risk students.”
- Feedback: More tutoring sessions → dropout rate reduces.

Videos

Digital Store

Visit Saitechinfo Digital Stor e for complete access of topic cards in AI

https://saitechinfo.net/product/artificial-intelligence-topic-cards/

Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project

Contents

Topic Tree

1. Introduction

2. The Methodology Framework (10 Steps grouped into 5 modules)

Module 1: From Problem to Approach

Module 2: From Requirements to Collection

Module 3: From Understanding to Preparation

Module 4: From Modelling to Evaluation

Module 5: From Deployment to Feedback

Terms and Definitions

Activities

Study Notes

1. Introduction

2. Methodology Framework (10 Steps in 5 Modules)

🔹 Module 1: From Problem to Approach

🔹 Module 2: From Requirements to Collection

🔹 Module 3: From Understanding to Preparation

🔹 Module 4: From Modelling to Evaluation

🔹 Module 5: From Deployment to Feedback

3. Model Validation Techniques

4. Key Challenges

5. Capstone Project Application

Explanation with examples

1. Business Understanding (Problem Scoping)

2. Analytic Approach

3. Data Requirements

4. Data Collection

5. Data Understanding (EDA)

6. Data Preparation

7. AI Modelling

8. Evaluation

9. Deployment

10. Feedback

✅ Summary with Capstone Example

Videos

Digital Store

Share this:

Like this:

Leave a Reply Cancel reply