Beyond Bias: Advancing Algorithmic Fairness in Predictive Policing for Equitable Communities
Artificial intelligence (AI) and machine learning are increasingly being used in the security sector, particularly in systems designed to predict potential criminal activities. These "predictive policing" systems aim to identify where crimes might occur and even which individuals might be involved in future criminal acts. This application of AI holds the promise of making the justice system more efficient, but it also introduces significant challenges, especially concerning fairness and bias.
The core issue is that the data used to train these AI systems can be skewed, meaning it might not accurately represent all groups in society. When algorithms learn from biased data, they can unintentionally continue or even worsen these existing biases in their predictions. For example, historically, risk assessments have been influenced by sensitive personal details like race, nationality, and skin color. Beyond these, other factors such as socio-economic status and age can also introduce unwanted biases. The implications of such biased algorithms are far-reaching and concerning: they can lead to marginalized communities facing further disadvantages, or individuals suffering unjust consequences like false arrests based on inaccurate AI classifications. Therefore, ensuring algorithmic fairness is crucial for AI to be a true enabler of sustainable development and to maintain public trust in law enforcement.
Recognizing these challenges, researchers have been exploring various ways to make these algorithms fairer. A systematic review was conducted to analyze the different fairness metrics (ways to measure fairness), sensitive features (like race, age, gender), and strategies used in existing studies to reduce bias in policing and recidivism algorithms. The review process followed established guidelines, meticulously searching major academic databases like Scopus, IEEE Xplore, and ScienceDirect, using specific keywords related to "fairness," "police/recidivism," and "algorithm". Out of an initial large pool of studies, 15 relevant papers were selected for in-depth analysis after a rigorous screening and quality assessment process.
The review revealed that measuring fairness in policing algorithms is complex, with a variety of methods and metrics being employed. Some common fairness metrics include:
Statistical Parity Difference (SPD): This measures whether different groups receive favorable outcomes at the same rate. An SPD of zero indicates perfect fairness.
Disparate Impact (DI): This looks at the ratio of favorable outcomes between a disadvantaged group and a privileged group. A value of 1 signifies perfect fairness.
Average Odds Difference: This metric considers the average difference in both false positive rates (wrongly classifying someone as high risk) and true positive rates (correctly identifying someone as high risk) between groups. A lower value indicates a fairer algorithm.
Equality of Opportunity Difference: This specifically focuses on whether all groups have an equal chance of being correctly identified for a positive outcome when they genuinely qualify for it.
False Positive Rate (FPR) and False Negative Rate (FNR): A high FPR means many non-offenders are wrongly labeled as potential reoffenders, leading to undue detention, while a high FNR means actual offenders are wrongly deemed fit for release, potentially leading to preventable crimes. Studies have shown biases can manifest differently among groups, for instance, with African American individuals sometimes exhibiting higher false positive rates.
In terms of the data itself, most studies on algorithmic fairness in policing and recidivism rely on datasets from the United States, such as Chicago's Strategic Subject List (SSL) data or New York Police Department Crime Complaint data. This reliance on U.S.-based datasets limits the generalizability of findings, as societal norms, legal structures, and demographic compositions vary significantly across different countries and regions.
Regarding the types of biases investigated, the review found a strong emphasis on racial bias, with 12 out of 15 studies examining it. However, there are significant gaps in addressing biases related to other protected attributes: only two studies looked at socio-economic status or income, and only three studies each examined bias related to gender and age. This highlights a narrow focus in bias investigations, underscoring the need for future research to expand its scope beyond race to include gender, socio-economic status, and age to ensure fairness across multiple dimensions.
Various strategies have been proposed to analyze and mitigate biases in these algorithms. These strategies often fall into several categories:
Data-centric strategies: These approaches directly target the dataset itself, operating on the principle that "better data leads to better predictions". This can involve incorporating external, relevant data (like census information on education and poverty rates) to reduce bias, using data augmentation techniques to synthetically balance underrepresented groups, or assessing and improving the overall quality of the data, as missing or incomplete information can lead to biased predictions.
Counterfactual reasoning and causal analysis: These methods aim to improve fairness by treating disadvantaged individuals as if they belonged to a more privileged group. A crucial finding in this area is that police actions are a primary driver of model discrimination in predictive policing. Research has identified a "vicious cycle" where increased police deployment leads to higher arrest rates, which then boosts reported crime rates, further justifying more police presence. This suggests that bias often originates from policing practices, not just the algorithm itself.
Demographic segmentation: This involves dividing the dataset based on characteristics like race and training separate, specialized models for each group. While this can improve prediction accuracy, it has been noted that bias can still persist within these segmented groups, implying that many features in the data might be subtly correlated with sensitive attributes.
Fairness-accuracy trade-off: A common perception is that improving fairness in AI algorithms necessarily reduces their accuracy. However, some studies, including the one in focus, challenge this notion, demonstrating that bias mitigation can actually improve overall model performance and accuracy.
Beyond the technical aspects of fairness, the human element is also critical. Studies have shown that people generally view decisions made by algorithms in policing as less fair and appropriate compared to those made by human officers. However, public trust can increase when successful algorithmic decision-making instances are observed. This highlights the importance of incorporating human judgment and oversight into algorithmic processes to combine human intuition with machine efficiency, though this also introduces the challenge of human biases.
Given the identified gaps, particularly the lack of focus on age-related bias, the study undertook an empirical investigation to mitigate age bias in a real-world predictive policing system. The focus was on the Chicago Police Department’s Strategic Subject List (SSL) dataset, which contains information on 398,684 individuals and is used to predict the risk of someone being involved in a shooting incident, either as a victim or an offender. The SSL assigns a risk score from 0 (extremely low risk) to 500 (extremely high risk), with scores above 250 considered high risk.
A significant age bias was found in this dataset: age accounts for about 89% of the variation in SSL scores. Critically, all individuals under 30 years old were classified as "high risk," regardless of their actual history. This is despite 127,513 individuals on the list having never been arrested or shot, with approximately 90,000 of them still deemed high risk. This startling finding suggests that the SSL scores reinforced an "age out of crime" theory, where individuals are presumed to "grow out" of crime in their 30s, heavily disadvantaging younger individuals. This also aligns with the broader finding that police actions are a primary contributor to model discrimination.
To address this pronounced age bias, the researchers pre-processed the data by simplifying 'AGE AT LATEST ARREST' into two categories: "under 30 years old" and "30 years or older". This binary split ensured a relatively balanced distribution for analysis. They then employed two main bias mitigation strategies:
Conditional Score Recalibration (CSR): This is a novel technique introduced by the study. It involves carefully re-evaluating and adjusting the risk scores for individuals initially assigned a moderately high-risk score (between 250 and 350). If these individuals meet three specific criteria—no prior arrests for violent offenses, no previous arrests for narcotic offenses, and no involvement in shooting incidents as a victim—their risk score is reassessed and they are categorized as low risk. This strategy was applied across the entire dataset (young and old) to avoid introducing new biases.
Class Balancing: After CSR, the dataset became imbalanced again, with more individuals classified as low risk. To rectify this, the researchers used an undersampling technique, randomly selecting a subset of the majority (low-risk) class to match the size of the high-risk group, creating a balanced dataset for training.
The study initially evaluated common machine learning models like Logistic Regression, Random Forest, and Gradient Boosting without any bias mitigation. The Random Forest model was chosen for its robustness. The results confirmed the pre-existing bias: the model showed significant age bias (e.g., Demographic Parity of 0.8517 and Equality of Opportunity of 0.7616 for age), while performing fairly with respect to race.
However, after applying Conditional Score Recalibration and Class Balancing, the results were remarkably improved. The model's fairness with respect to age significantly increased:
The Demographic Parity for age decreased from 0.8517 to 0.3128, indicating a much more equal chance of receiving a positive outcome across age groups.
The Equality of Opportunity for age dropped from 0.7616 to 0.1521, meaning young and old individuals now have a more equal chance of being correctly identified as high risk when they truly are.
The Average Odds Difference for age improved dramatically from 0.3349 to a much lower 0.02024, showing minimal disparity in both false and true positive rates across age groups.
Crucially, this improvement in fairness did not come at the cost of accuracy. In fact, the model's overall accuracy increased from 0.8314 to 0.9014, and its F1 score (a balanced measure of accuracy) improved from 0.83 to 0.90. This finding directly challenges the common belief that efforts to increase fairness inevitably compromise performance, aligning with other recent research.
From an ethical perspective, the initial age bias in the SSL dataset, where all individuals under 30 were categorized as high risk, is problematic. John Rawls' theory of justice as fairness advocates for social institutions that benefit society's least advantaged members. The unequal treatment of younger individuals by the SSL dataset contradicts this principle. The bias mitigation strategies employed in this study, particularly CSR and Class Balancing, offer a framework for designing and implementing algorithms that do not inherently disadvantage any demographic group based on protected attributes like age.
In conclusion, this study underscores the critical importance of addressing algorithmic bias in predictive policing. Through a systematic review, it identified significant gaps in addressing biases related to age, gender, and socio-economic status, which are often overlooked compared to racial bias. Responding to this, the empirical part of the study successfully analyzed and mitigated age-related bias in the Chicago Police Department's Strategic Subject List dataset, particularly against younger individuals. The application of Conditional Score Recalibration and Class Balancing led to a significant reduction in age-related biases, and remarkably, this improvement in fairness was accompanied by an increase in the model's accuracy and F1 score. These results demonstrate that fairness and accuracy are not mutually exclusive, contributing to the ongoing discussion on the responsible and equitable use of AI in law enforcement. Future research should continue to diversify datasets globally, expand the scope of bias investigations to include more sensitive features like socio-economic status and gender, and consider more nuanced age classifications, while always integrating human validation in algorithmic processes.
Predictive Policing Researchers:
Dr. Rianna Walcott: An Assistant Professor of Communication at the University of Maryland and Associate Director of the Black Communication and Technology (BCaT) Lab, her work explores digital research, Black feminist theory, decolonial studies, and advocacy, with a PhD focused on Black communication practices on social media.
Chun-Ping Yen: A co-author of research on "Achieving Equity with Predictive Policing Algorithms: A Social Safety Net Perspective," advocating for integrating predictive policing into social safety nets and using public audits to reduce bias.
Alexia Gallon: Author of "Racism Repeats Itself: AI Racial Bias in Predictive Policing...," her research investigated racial biases in algorithms like PredPol, finding they produced fewer predictions for White Americans than Black Americans.