Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project
Contents
- Topic Tree
- Terms and Definitions
- Activities
- Study Notes
- Videos
- Question Bank
Topic Tree
1. Introduction
- Importance of methodology in AI/DS projects
- Framework for systematic execution
2. The Methodology Framework (10 Steps grouped into 5 modules)
Module 1: From Problem to Approach
- 2.1.1 Business Understanding (problem scoping, 5W1H, Design Thinking)
- 2.1.2 Analytic Approach (choosing type of analytics: descriptive, diagnostic, predictive, prescriptive)
Module 2: From Requirements to Collection
- 2.1.3 Data Requirements (content, format, sources)
- 2.1.4 Data Collection (methods, online/offline sources)
Module 3: From Understanding to Preparation
- 2.1.5 Data Understanding (exploratory analysis, visualization)
- 2.1.6 Data Preparation (cleaning, integration, transformation, feature engineering)
Module 4: From Modelling to Evaluation
- 2.1.7 AI Modelling (descriptive vs. predictive models, training/testing, algorithm choice)
- 2.1.8 Evaluation (metrics like accuracy, precision, recall, F1 score; diagnostic & statistical validation)
Module 5: From Deployment to Feedback
- 2.1.9 Deployment (integration into real-world use, limited rollout, productionization)
- 2.1.10 Feedback (user response, iterative refinement, automation of retraining)
3. Model Validation
- Importance of validation (prevent overfitting/underfitting)
- Techniques: Train-test split, K-Fold cross validation, LOOCV, Time-series CV
4. Model Performance Metrics
- Metrics for classification (Accuracy, Precision, Recall, F1-score)
- Metrics for regression (MSE, RMSE, R²)
5. Capstone Project Application
- Integrating all steps into a real-world project
- Iterative refinement across modules
- Use of case studies, discussions, and hands-on activities
Terms and Definitions
- Data Science Methodology:
A prescribed sequence of iterative steps that data scientists follow to approach a problem, analyze data, and find solutions systematically. - Business Understanding (Problem Scoping/Defining):
The process of identifying the real-world problem to solve, using tools like the 5W1H Problem Canvas and Design Thinking (DT) framework. - Analytic Approach:
Choosing the right type of analytics for the problem:- Descriptive Analytics – What happened?
- Diagnostic Analytics – Why did it happen?
- Predictive Analytics – What is likely to happen?
- Prescriptive Analytics – What should we do about it?
- Data Requirements:
Defining the necessary content, format, and sources of data for analysis. - Data Collection:
Gathering raw data from structured, semi-structured, or unstructured sources, either online or offline. - Data Understanding (Exploratory Data Analysis – EDA):
Using visualization and summary statistics to discover patterns, spot anomalies, and test hypotheses. - Data Preparation:
Cleaning, integrating, transforming, and engineering features to make the dataset suitable for modelling. - AI Modelling:
Building descriptive models (to summarize data) or predictive models (to forecast outcomes). This involves choosing algorithms, training/testing, and iterative refinement. - Evaluation:
Assessing model performance using metrics such as:- Classification: Accuracy, Precision, Recall, F1-Score
- Regression: MSE, RMSE, R²
- Deployment:
Implementing the validated model in real-world applications (e.g., production systems, limited rollout). - Feedback:
Collecting user/system responses post-deployment and iteratively refining the model for improvements. - Model Validation:
Techniques to test generalization, e.g., Train-test split, K-Fold Cross Validation, Leave-One-Out (LOOCV), Time-series CV. - Overfitting:
A condition when a model learns noise along with signal, performing well on training data but poorly on unseen data. - Underfitting:
A model too simple to capture underlying patterns, leading to poor accuracy on both training and testing data.
Activities
Study Notes
1. Introduction
- Data Science Methodology provides a structured approach for solving real-world problems using data.
- It ensures systematic planning, analysis, and execution in projects, especially Capstone projects.
- The methodology follows an iterative, modular framework.
2. Methodology Framework (10 Steps in 5 Modules)
🔹 Module 1: From Problem to Approach
- Business Understanding
- Identify and define the problem.
- Use tools: 5W1H Problem Canvas (Who, What, When, Where, Why, How).
- Apply Design Thinking for human-centered solutions.
- Analytic Approach
- Decide type of analytics:
- Descriptive – What happened?
- Diagnostic – Why did it happen?
- Predictive – What will happen?
- Prescriptive – What should we do?
- Decide type of analytics:
🔹 Module 2: From Requirements to Collection
- Data Requirements
- Define: content, format, sources.
- Ensure relevant, sufficient, and quality data.
- Data Collection
- Gather data from primary or secondary sources.
- Sources: surveys, sensors, databases, web scraping, APIs.
🔹 Module 3: From Understanding to Preparation
- Data Understanding
- Perform Exploratory Data Analysis (EDA).
- Use visualizations and statistical summaries to identify trends, anomalies, and patterns.
- Data Preparation
- Clean, integrate, and transform data.
- Handle missing values, outliers, duplicates.
- Perform feature engineering for better model performance.
🔹 Module 4: From Modelling to Evaluation
- AI Modelling
- Choose appropriate algorithms (classification, regression, clustering, etc.).
- Split into training/testing datasets.
- Train models iteratively.
- Evaluation
- Use metrics to check performance:
- Classification: Accuracy, Precision, Recall, F1-score.
- Regression: MSE, RMSE, R².
- Ensure statistical validation and avoid overfitting/underfitting.
- Use metrics to check performance:
🔹 Module 5: From Deployment to Feedback
- Deployment
- Integrate the model into real-world applications.
- Deploy in limited rollout or full production environment.
- Feedback
- Collect responses from users/systems.
- Continuously improve via feedback loop and model retraining.
3. Model Validation Techniques
- Train-Test Split – divide data into training and testing sets.
- K-Fold Cross Validation – split into multiple folds for robust evaluation.
- Leave-One-Out CV (LOOCV) – extreme case of cross-validation.
- Time-series CV – for sequential/temporal data.
4. Key Challenges
- Overfitting: Model fits training data too closely, poor generalization.
- Underfitting: Model too simple, fails to capture patterns.
- Bias vs Variance Trade-off: Must balance complexity and accuracy.
5. Capstone Project Application
- Capstone projects simulate real-world problem-solving using this methodology.
- Each step (Problem → Approach → Data → Modelling → Deployment → Feedback) must be documented.
- Encourages hands-on practice, case studies, and iteration until effective solutions are achieved.
✅ Summary:
The Data Science Methodology framework is the backbone of an AI/DS Capstone Project. It ensures problems are well-defined, data is systematically handled, models are validated, and solutions are deployed with feedback for continuous improvement.
Explanation with examples
1. Business Understanding (Problem Scoping)
Goal: Define the problem clearly.
- Example:
A retail chain wants to reduce customer churn (customers who stop buying).- 5W1H:
- Who? Customers
- What? Churn (stop buying)
- When? Last 6 months
- Where? Online store
- Why? Loss of revenue
- How? Identify patterns & predict at-risk customers
- 5W1H:
2. Analytic Approach
Goal: Decide which type of analytics to use.
- Example: For churn prediction:
- Descriptive Analytics: Past churn rate = 20%.
- Diagnostic Analytics: Customers left due to poor service.
- Predictive Analytics: Which current customers are likely to churn?
- Prescriptive Analytics: Offer discounts to high-risk customers.
3. Data Requirements
Goal: Identify what data is needed.
- Example: For churn:
- Content: Purchase history, complaints, demographics.
- Format: Structured (tables), semi-structured (chat logs).
- Source: Company database, customer feedback forms.
4. Data Collection
Goal: Gather the required data.
- Example:
- Collect transaction logs (structured).
- Use web scraping for customer reviews (unstructured).
- Conduct surveys for satisfaction ratings.
5. Data Understanding (EDA)
Goal: Explore patterns in data.
- Example:
- Visualize churn rate by age group – older customers churn less.
- Analyze complaint categories – “late delivery” is most common.
- Correlation: Customers with low satisfaction scores often churn.
6. Data Preparation
Goal: Clean and transform data for modeling.
- Example:
- Handle missing values (e.g., fill missing ages with median).
- Remove duplicate transactions.
- Convert “Yes/No” churn column into binary (1/0).
- Engineer new features: “Average Purchase Value per Month.”
7. AI Modelling
Goal: Build predictive models.
- Example:
- Apply Logistic Regression or Random Forest to predict churn.
- Train the model on 70% of data, test on 30%.
- Model predicts which customers are likely to churn next month.
8. Evaluation
Goal: Check accuracy and reliability.
- Example:
- Accuracy: 85% of predictions are correct.
- Precision: Of customers predicted to churn, 80% actually did.
- Recall: Model captured 70% of all churners.
- F1-Score: Balanced measure = 0.74.
- Use confusion matrix to analyze true/false predictions.
9. Deployment
Goal: Put the model into real-world use.
- Example:
- Integrate churn prediction into CRM software.
- Sales team gets a list of “high churn risk” customers daily.
- System triggers personalized email offers automatically.
10. Feedback
Goal: Refine the model using real-world outcomes.
- Example:
- After deployment, the churn rate drops from 20% → 12%.
- Feedback shows younger customers ignore email offers.
- Model retrained with SMS engagement feature to improve results.
✅ Summary with Capstone Example
- Capstone Project Idea:Predicting Student Dropout in an Online Course
- Business Understanding: University wants to reduce dropouts.
- Analytic Approach: Predictive (who will drop out?).
- Data Requirements: Attendance, quiz scores, engagement time.
- Data Collection: LMS logs, surveys.
- EDA: Students with low quiz scores tend to drop out.
- Preparation: Handle missing attendance logs, create engagement index.
- Modelling: Train a Decision Tree classifier.
- Evaluation: Model achieves 82% accuracy.
- Deployment: Faculty dashboard shows “at-risk students.”
- Feedback: More tutoring sessions → dropout rate reduces.
Videos
Digital Store
Visit Saitechinfo Digital Store for complete access of topic cards in AI