About This Project

Research Question

Can computational methods detect and quantify representational and linguistic bias in Indian educational content across subjects, grades, and content sources?

Methodology

  1. Data Collection — NCERT textbooks (Grades 6-10), CBSE exam papers, Pratham Books stories, AI-generated content
  2. Preprocessing — spaCy NLP pipeline: sentence tokenization, POS tagging, NER
  3. Feature Engineering — Pronoun counts, name detection, profession-gender PMI, geographic entity extraction
  4. Baseline Models — Rule-based classifiers + TF-IDF Logistic Regression
  5. Advanced Model — Fine-tuned IndicBERT for multi-label bias classification
  6. Explainability — LIME sentence-level explanations, SHAP feature importance

Bias Taxonomy

DimensionMetric
Gender representationGender Ratio (GR)
Profession-gender linkPMI
Urban/rural balanceUrban Ratio
State coverageRegional Coverage Score (RCS)
AI amplificationBias Amplification Rate (BAR)

Independent research project for undergraduate college application.