About This Project
Research Question
Can computational methods detect and quantify representational and linguistic bias in Indian educational content across subjects, grades, and content sources?
Methodology
- Data Collection — NCERT textbooks (Grades 6-10), CBSE exam papers, Pratham Books stories, AI-generated content
- Preprocessing — spaCy NLP pipeline: sentence tokenization, POS tagging, NER
- Feature Engineering — Pronoun counts, name detection, profession-gender PMI, geographic entity extraction
- Baseline Models — Rule-based classifiers + TF-IDF Logistic Regression
- Advanced Model — Fine-tuned IndicBERT for multi-label bias classification
- Explainability — LIME sentence-level explanations, SHAP feature importance
Bias Taxonomy
| Dimension | Metric |
|---|---|
| Gender representation | Gender Ratio (GR) |
| Profession-gender link | PMI |
| Urban/rural balance | Urban Ratio |
| State coverage | Regional Coverage Score (RCS) |
| AI amplification | Bias Amplification Rate (BAR) |
Independent research project for undergraduate college application.