About This Project

Research Question

Can computational methods detect and quantify representational and linguistic bias in Indian educational content across subjects, grades, and content sources?

Methodology

Data Collection — NCERT textbooks (Grades 6-10), CBSE exam papers, Pratham Books stories, AI-generated content
Preprocessing — spaCy NLP pipeline: sentence tokenization, POS tagging, NER
Feature Engineering — Pronoun counts, name detection, profession-gender PMI, geographic entity extraction
Baseline Models — Rule-based classifiers + TF-IDF Logistic Regression
Advanced Model — Fine-tuned IndicBERT for multi-label bias classification
Explainability — LIME sentence-level explanations, SHAP feature importance

Bias Taxonomy

Dimension	Metric
Gender representation	Gender Ratio (GR)
Profession-gender link	PMI
Urban/rural balance	Urban Ratio
State coverage	Regional Coverage Score (RCS)
AI amplification	Bias Amplification Rate (BAR)

Independent research project for undergraduate college application.