Data Science Methodology

Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project

Contents

  1. Topic Tree
  2. Terms and Definitions
  3. Activities
  4. Study Notes
  5. Videos
  6. Question Bank

Topic Tree

1. Introduction

  • Importance of methodology in AI/DS projects
  • Framework for systematic execution

2. The Methodology Framework (10 Steps grouped into 5 modules)

Module 1: From Problem to Approach

  • 2.1.1 Business Understanding (problem scoping, 5W1H, Design Thinking)
  • 2.1.2 Analytic Approach (choosing type of analytics: descriptive, diagnostic, predictive, prescriptive)

Module 2: From Requirements to Collection

  • 2.1.3 Data Requirements (content, format, sources)
  • 2.1.4 Data Collection (methods, online/offline sources)

Module 3: From Understanding to Preparation

  • 2.1.5 Data Understanding (exploratory analysis, visualization)
  • 2.1.6 Data Preparation (cleaning, integration, transformation, feature engineering)

Module 4: From Modelling to Evaluation

  • 2.1.7 AI Modelling (descriptive vs. predictive models, training/testing, algorithm choice)
  • 2.1.8 Evaluation (metrics like accuracy, precision, recall, F1 score; diagnostic & statistical validation)

Module 5: From Deployment to Feedback

  • 2.1.9 Deployment (integration into real-world use, limited rollout, productionization)
  • 2.1.10 Feedback (user response, iterative refinement, automation of retraining)

3. Model Validation

  • Importance of validation (prevent overfitting/underfitting)
  • Techniques: Train-test split, K-Fold cross validation, LOOCV, Time-series CV

4. Model Performance Metrics

  • Metrics for classification (Accuracy, Precision, Recall, F1-score)
  • Metrics for regression (MSE, RMSE, R²)

5. Capstone Project Application

  • Integrating all steps into a real-world project
  • Iterative refinement across modules
  • Use of case studies, discussions, and hands-on activities

Terms and Definitions

  • Data Science Methodology:
    A prescribed sequence of iterative steps that data scientists follow to approach a problem, analyze data, and find solutions systematically.
  • Business Understanding (Problem Scoping/Defining):
    The process of identifying the real-world problem to solve, using tools like the 5W1H Problem Canvas and Design Thinking (DT) framework.
  • Analytic Approach:
    Choosing the right type of analytics for the problem:
    • Descriptive Analytics – What happened?
    • Diagnostic Analytics – Why did it happen?
    • Predictive Analytics – What is likely to happen?
    • Prescriptive Analytics – What should we do about it?
  • Data Requirements:
    Defining the necessary content, format, and sources of data for analysis.
  • Data Collection:
    Gathering raw data from structured, semi-structured, or unstructured sources, either online or offline.
  • Data Understanding (Exploratory Data Analysis – EDA):
    Using visualization and summary statistics to discover patterns, spot anomalies, and test hypotheses.
  • Data Preparation:
    Cleaning, integrating, transforming, and engineering features to make the dataset suitable for modelling.
  • AI Modelling:
    Building descriptive models (to summarize data) or predictive models (to forecast outcomes). This involves choosing algorithms, training/testing, and iterative refinement.
  • Evaluation:
    Assessing model performance using metrics such as:
    • Classification: Accuracy, Precision, Recall, F1-Score
    • Regression: MSE, RMSE, R²
  • Deployment:
    Implementing the validated model in real-world applications (e.g., production systems, limited rollout).
  • Feedback:
    Collecting user/system responses post-deployment and iteratively refining the model for improvements.
  • Model Validation:
    Techniques to test generalization, e.g., Train-test split, K-Fold Cross Validation, Leave-One-Out (LOOCV), Time-series CV.
  • Overfitting:
    A condition when a model learns noise along with signal, performing well on training data but poorly on unseen data.
  • Underfitting:
    A model too simple to capture underlying patterns, leading to poor accuracy on both training and testing data.

Activities

Word Search Game

Cross Word Puzzle

Study Notes

1. Introduction

  • Data Science Methodology provides a structured approach for solving real-world problems using data.
  • It ensures systematic planning, analysis, and execution in projects, especially Capstone projects.
  • The methodology follows an iterative, modular framework.

2. Methodology Framework (10 Steps in 5 Modules)

🔹 Module 1: From Problem to Approach

  1. Business Understanding
    • Identify and define the problem.
    • Use tools: 5W1H Problem Canvas (Who, What, When, Where, Why, How).
    • Apply Design Thinking for human-centered solutions.
  2. Analytic Approach
    • Decide type of analytics:
      • Descriptive – What happened?
      • Diagnostic – Why did it happen?
      • Predictive – What will happen?
      • Prescriptive – What should we do?

🔹 Module 2: From Requirements to Collection

  1. Data Requirements
    • Define: content, format, sources.
    • Ensure relevant, sufficient, and quality data.
  2. Data Collection
    • Gather data from primary or secondary sources.
    • Sources: surveys, sensors, databases, web scraping, APIs.

🔹 Module 3: From Understanding to Preparation

  1. Data Understanding
    • Perform Exploratory Data Analysis (EDA).
    • Use visualizations and statistical summaries to identify trends, anomalies, and patterns.
  2. Data Preparation
    • Clean, integrate, and transform data.
    • Handle missing values, outliers, duplicates.
    • Perform feature engineering for better model performance.

🔹 Module 4: From Modelling to Evaluation

  1. AI Modelling
    • Choose appropriate algorithms (classification, regression, clustering, etc.).
    • Split into training/testing datasets.
    • Train models iteratively.
  2. Evaluation
    • Use metrics to check performance:
      • Classification: Accuracy, Precision, Recall, F1-score.
      • Regression: MSE, RMSE, R².
    • Ensure statistical validation and avoid overfitting/underfitting.

🔹 Module 5: From Deployment to Feedback

  1. Deployment
    • Integrate the model into real-world applications.
    • Deploy in limited rollout or full production environment.
  2. Feedback
  • Collect responses from users/systems.
  • Continuously improve via feedback loop and model retraining.

3. Model Validation Techniques

  • Train-Test Split – divide data into training and testing sets.
  • K-Fold Cross Validation – split into multiple folds for robust evaluation.
  • Leave-One-Out CV (LOOCV) – extreme case of cross-validation.
  • Time-series CV – for sequential/temporal data.

4. Key Challenges

  • Overfitting: Model fits training data too closely, poor generalization.
  • Underfitting: Model too simple, fails to capture patterns.
  • Bias vs Variance Trade-off: Must balance complexity and accuracy.

5. Capstone Project Application

  • Capstone projects simulate real-world problem-solving using this methodology.
  • Each step (Problem → Approach → Data → Modelling → Deployment → Feedback) must be documented.
  • Encourages hands-on practice, case studies, and iteration until effective solutions are achieved.

Summary:
The Data Science Methodology framework is the backbone of an AI/DS Capstone Project. It ensures problems are well-defined, data is systematically handled, models are validated, and solutions are deployed with feedback for continuous improvement.


Explanation with examples

1. Business Understanding (Problem Scoping)

Goal: Define the problem clearly.

  • Example:
    A retail chain wants to reduce customer churn (customers who stop buying).
    • 5W1H:
      • Who? Customers
      • What? Churn (stop buying)
      • When? Last 6 months
      • Where? Online store
      • Why? Loss of revenue
      • How? Identify patterns & predict at-risk customers

2. Analytic Approach

Goal: Decide which type of analytics to use.

  • Example: For churn prediction:
    • Descriptive Analytics: Past churn rate = 20%.
    • Diagnostic Analytics: Customers left due to poor service.
    • Predictive Analytics: Which current customers are likely to churn?
    • Prescriptive Analytics: Offer discounts to high-risk customers.

3. Data Requirements

Goal: Identify what data is needed.

  • Example: For churn:
    • Content: Purchase history, complaints, demographics.
    • Format: Structured (tables), semi-structured (chat logs).
    • Source: Company database, customer feedback forms.

4. Data Collection

Goal: Gather the required data.

  • Example:
    • Collect transaction logs (structured).
    • Use web scraping for customer reviews (unstructured).
    • Conduct surveys for satisfaction ratings.

5. Data Understanding (EDA)

Goal: Explore patterns in data.

  • Example:
    • Visualize churn rate by age group – older customers churn less.
    • Analyze complaint categories – “late delivery” is most common.
    • Correlation: Customers with low satisfaction scores often churn.

6. Data Preparation

Goal: Clean and transform data for modeling.

  • Example:
    • Handle missing values (e.g., fill missing ages with median).
    • Remove duplicate transactions.
    • Convert “Yes/No” churn column into binary (1/0).
    • Engineer new features: “Average Purchase Value per Month.”

7. AI Modelling

Goal: Build predictive models.

  • Example:
    • Apply Logistic Regression or Random Forest to predict churn.
    • Train the model on 70% of data, test on 30%.
    • Model predicts which customers are likely to churn next month.

8. Evaluation

Goal: Check accuracy and reliability.

  • Example:
    • Accuracy: 85% of predictions are correct.
    • Precision: Of customers predicted to churn, 80% actually did.
    • Recall: Model captured 70% of all churners.
    • F1-Score: Balanced measure = 0.74.
    • Use confusion matrix to analyze true/false predictions.

9. Deployment

Goal: Put the model into real-world use.

  • Example:
    • Integrate churn prediction into CRM software.
    • Sales team gets a list of “high churn risk” customers daily.
    • System triggers personalized email offers automatically.

10. Feedback

Goal: Refine the model using real-world outcomes.

  • Example:
    • After deployment, the churn rate drops from 20% → 12%.
    • Feedback shows younger customers ignore email offers.
    • Model retrained with SMS engagement feature to improve results.

✅ Summary with Capstone Example

  • Capstone Project Idea:Predicting Student Dropout in an Online Course
    • Business Understanding: University wants to reduce dropouts.
    • Analytic Approach: Predictive (who will drop out?).
    • Data Requirements: Attendance, quiz scores, engagement time.
    • Data Collection: LMS logs, surveys.
    • EDA: Students with low quiz scores tend to drop out.
    • Preparation: Handle missing attendance logs, create engagement index.
    • Modelling: Train a Decision Tree classifier.
    • Evaluation: Model achieves 82% accuracy.
    • Deployment: Faculty dashboard shows “at-risk students.”
    • Feedback: More tutoring sessions → dropout rate reduces.

Videos

Digital Store

Visit Saitechinfo Digital Store for complete access of topic cards in AI

https://saitechinfo.net/product/artificial-intelligence-topic-cards/

Leave a Reply

wpChatIcon