This sixteenth article in our series on "Big Data & Issues & Opportunities" (see our previous article here) will delve into a particular social and ethical issue that may materialise in a big data context, namely (data-driven) discrimination. Where appropriate, illustrations from the transport sector are provided.
According to the Oxford English Dictionary, the term 'discrimination' is defined as “treating a person or particular group of people differently, especially in a worse way from the way in which you treat other people, because of their skin colour, sex, sexuality, etc.” or in more general terms: “the unjust or prejudicial treatment of different categories of people, especially on the grounds of race, age, or sex.”
The principles of non-discrimination and equality are to a great extent covered in Title III of the EU Charter of Fundamental Rights. Thus, the EU Charter recognises the following fundamental rights, freedoms and principles in relation to discrimination: (i) equality before the law; (ii) non-discrimination; (iii) cultural, religious and linguistic diversity; (iv) equality between women and men; (v) the rights of the child; (vi) the rights of the elderly; and (vii) the integration of persons with disabilities.
Elements on which discriminatory treatments can be based are, as mentioned above, skin colour, race, sex, but also for example income or education level, gender, residential area, and others. Using big data analytics to improve business processes or to provide personalised services may lead to discrimination of certain groups of people. At any step of the big data analytics pipeline, unintended data biases may be created due to wrong statistical treatment or poor data quality. Big data poses certain challenges requiring expert knowledge to estimate the accuracy of conclusions drawn from it.
There is considerable interest in personalised services, individually targeting advertisements, and customised services and product offers. Personalising services means nothing else than to exclude people from or include them into certain target groups on the basis of personal data such as gender, income, education, consumption preferences, etc. Big data analytics relies on the categorisation of information and the conclusions that can be drawn from such categorisation. In that sense, the definition of discrimination in contrast to personalisation does not seem to be straightforward and discrimination might therefore be an inherent part of the analytics process.
Another important aspect of data-driven discrimination concerns the access to and knowledge of technology needed to use digital services or gather valuable information from online platforms or applications. The social differences in the corresponding access to technology and education or skills to use it, are often referred to as the “Digital Divide”.
Challenges for big data and discrimination
The challenges related to data-driven social discrimination and equity discussed in the framework of this article are (i) unintended data bias; (ii) intended data bias, i.e. personalised services, offers and advertisements; and (iii) the “Digital Divide”.
Unintended data bias
Biases in datasets or in statements or predictions based on the analysis of datasets can originate from various errors, shortcomings or misinterpretations along the analytics pipeline. The data collection process might be biased by design because of a biased formulation of a survey, a biased selection of data sources, an insufficient length of the surveyed time-period, or the neglect of relevant parameters or circumstances. Along the analytics process, correct statistical treatment and accuracy estimation require expert knowledge. The procedure is therefore prone to methodical and technical errors.
Some of the main causes of unintended data bias are discussed hereafter.
The sample size of a dataset directly influences the validity of a statement or the conclusions drawn from the data sample. The accuracy of a statistical analysis depends on the nature of the sample and its estimation. In the context of big data, the available data often consist of many subsets, demanding careful statistical treatment to normalise the estimating procedure to the single subsets in order to avoid overfitting or wrong conclusions. This heterogeneity calls for clean and careful aggregation of data from different data sources corresponding to different subsets where some unique features are not shared by all sets.
Dealing with a huge amount of data generated from a large amount of individuals or sensors makes analysis prone to errors resulting from bad data quality. Data derived from device measurements or automated studies must be carefully checked for various types of errors that may arise during the collection process. Since the cleaning and checking procedures are usually automated processes themselves, even more attention is required. In some sectors, well-established quality control and assurance procedures exist and should be standardised in order to ensure reliable conclusions and predictions.
Due to the usually high dimensionality, the analysis of big data requires the estimation of simultaneously different variables. Each estimation relates to a corresponding error leading to accumulated errors if a conclusion or algorithm-based prediction is based on many variables. This effect is referred to as noise accumulationand can make it difficult to refer to the original signal. Statistical techniques dealing with this issue require special expertise. Parameter selection and reduction of dimensionality is also crucial to overcome noise accumulation in classification and prediction analytics.
Spurious correlation and incidental endogeneity are two other effects that may lead to wrong conclusions and predictions. Variables or instances might "spuriously" correlate if the correlation is caused by an unseen third variable or event and not by the original variables. High dimensionality makes this effect more likely to occur. It may also be that variables are actually correlated but without any meaning or cause. Incidental endogeneity occurs as a result of selection biases, measurement errors and omitted variables. These phenomena arise frequently in the analysis of big data. The possibility of collecting many different parameters with available measurement techniques increases the risk to create incidental correlation. Big data aggregated from multiple sources with potentially different data generation procedures increases the risk of selection bias and measurement errors causing potential incidental endogeneity.
Learning algorithms are often highly complex. This complexity combined with a lack of transparency or comprehensibility for a broader community increases the probability of uncovered errors. Often algorithms are black boxes within a company with limited reproducibility. Open communication, in particular about accuracy levels, uncertainties within the algorithms, or implicit assumptions may often be insufficient.
The causes for data bias discussed above are all relevant in the transport sector. They may differ in importance in a specific domain, e.g. for freight and passenger transport. In route optimisation using big data, a huge amount of various sensor data of freight transport-related items might be aggregated with data from other sources (e.g. weather data), which calls for an accurate data merging and cleaning process to ensure good data quality.
Intended data bias
Increasing knowledge on customer or user behaviour and access to personal data creates – besides new business opportunities and the possibility of growth – power, in the sense that personal data of individuals or groups of individuals such as their gender, race, income, residential area and even patterns of their behaviour (e.g. movement profiles) can be aggregated to detailed profiles. Power inherited from such profiles may be unintentionally or intentionally used to discriminate people. The distinction between value-added personalisation and segmentation on the one hand and discrimination on the other hand is not well-defined and therefore depends largely on the experience and perception of the affected individuals.
Some personalised services or advertisements might be discriminatory because they exclude certain groups or are only offered to those people who communicated their personal data. This also includes the selective visibility of a service due to personalised online search results: different groups are not provided with the same information or are offered the same product or service with different pricing or availability options.
Personalisation may also lead to discriminatory treatment if it is based on statistical analysis assuming wrong segmentation criteria that are not really representing the target groups or are addressing it in a prejudicial way. Given that the underlying algorithms are typically not accessible to the target groups themselves, their ability to object is limited and it may lead to the manifestation of existing prejudice. In other words, data-based predictions or conclusions are more likely perceived to be objectively true since they rely on “objective” data. This might lead to even worse discrimination of a social group since prejudicial data can serve as evidence for the confirmation of prejudice.
By way of example, personalised job offers may limit the possibility for individuals of exploring new opportunities if algorithms based on educational backgrounds, professional experience, and other underlying factors do not make them aware of possibilities not fitting their profiles.
Lange, Coen and Berkeley confirmed in their study that users negatively perceive personalisation based on race or household income level. Their study surveyed the opinion of 748 participants. Information on the income level, residence area, and gender were considered as very private information, and negative responses to the use of it for individualised services were recorded. The use of race as a parameter for personalisation was also seen as unfair across all researched domains, i.e. targeted advertising, filtered search results, and differential pricing.
One might consider that such forms of information offering and service platforms are often operated by corporations. Accordingly, the online communication environment is to a large degree dictated by commercial actors who aim to maximise profits. Discrimination might emerge from the fact that people with e.g. lower income or other traits that do not correlate to the business models of those corporations are of less interest.
Applied to the domain of passenger transport, this could mean that a segregation of services based on specific characteristics of individuals, such as income or residential area and implicitly race or gender, might take place. The possibility to create new mobility offers according to individualised needs, e.g. private shuttle services combining different modes and optimising routes, might lead to a graduated system of offers dedicated to different social groups with low permeability.
Discrimination based on social factors and the “Digital Divide” are interconnected: different levels of access and skill in technology are influenced by individuals' social positions, which includes characteristics like age, gender, race, income, and level of education amongst others.
The term “Digital Divide” was first referred to as the diffusion of Internet access throughout the population but is nowadays extended to a “second-level Digital Divide”, which includes the different degrees of skill, time, knowledge, and usage possibilities. It turns out that social status directly influences the online usage behaviour, as higher education for example correlates with a higher online user experience in the fields of information retrieval and transactional purposes. Certain user groups are more likely to become more disconnected from the benefits of Internet usage, which might lead to reinforcement of existing social inequities.
In countries with high diffusion rates of Internet access (see comparison for Europe), the ability and skill to use online services or platforms becomes a substantial part of social life and individuals depend on it in various fields of their professional and private life. In the transportation sector, this is for example the case in route planning. Route planning is increasingly managed by applications or navigation programmes ensuring, among others, the online availability of public transport schedules, the purchase of tickets, and access to real-time information about the route. This however requires a certain level of skills, access to technology in the form of appropriate devices and some financial contributions.
Opportunities for big data and discrimination
Personalisation and segmentation for customised services or targeting may resolve biases. Big data analytics might indeed also be utilised to decrease social inequity and to improve existing discriminatory situations or services. Discriminatory situations can be made visible using big data analysis, which is the first step to resolve biases. Personalised or individualised services could in a second stage offer the possibility to people with special needs, who are not fitting the majority, to improve their inclusion into society. This could be seen as "positive discrimination".
Several ongoing projects aim to improve existing discrimination situations in the transport sector.
Mobility services to rural and periphery areas are a big challenge. This coincides with the rapidly changing age structure in these areas, where people are getting much older on average. The MobiDig project in the region of Northern Bavaria in Germany aims to tackle these issues by improving mobility services in rural areas in order to increase social inclusion. The project led by five partner institutions (including the Technical University of Munich and the Fraunhofer Group for Supply Chain Services) intends to evaluate and promote new mobility concepts in order to provide efficient and sufficient transport services.
Another issue in the transport sector seems to be gender equality. The systematic analysis of the situation based on big data allows identifying discriminatory practices and the reasons therefor. This is what several EU projects are aiming to do. They seek to make recommendations in order to improve the situation, such as implementing – as a starting point – information about the gender of workers in the transport sector in existing databases.
Illustration in the transport sector: Uber, the ride-sharing company, has allowed making discrimination visible thanks to its online platform technologies. Several forms of discrimination have been observed in the Uber environment. The Uber rating system used by passengers to give feedback about drivers at the end of a ride has allowed highlighting discrimination against drivers from racial minority groups. This is problematic as the data collected via the tool are used to evaluate drivers, and eventually dismiss them if their ratings do not meet Uber's expected standards. Another form of discrimination concerns passengers. It has been observed that drivers are sometimes less keen to offer their services to riders willing to go to poorer neighbourhoods. Besides highlighting those discriminatory situations, the Uber platform could also be used to deter or prevent discrimination by for example configuring the level of passenger information available to drivers in order to decrease discrimination against them.
Big data analytics can be a tool to make existing discriminatory decisions visible, hence this social issue may be resolved by personalised services (as “positive discrimination”) based on big data analytics. In spite of this opportunity, there are still biases because of big data's characteristics (e.g., heterogeneity, data size and quality, noise, etc.). Furthermore, also personalised services may cause discriminatory treatment by excluding certain groups. Finally, big data creates new visibilities and makes it possible to discern between people on a whole range of behaviour-related and other personal aspects. This also provides fertile ground for ‘new discriminations’.
These issues are of course highly relevant for the use of big data in the transport sector, for instance, for the planning of different routes on the basis of quality data or technologies used. Therefore, it is essential and important to reduce the likelihood of discrimination in the processing of big data and its analytics. In the same vein, "diversity, non-discrimination and fairness" has recently been listed by the High-Level Expert Group (AI HLEG) on Artificial Intelligence (AI) in their "Ethics Guidelines for Trustworthy AI" as one of the seven key requirements for realising trustworthy AI, to be implemented and evaluated throughout the AI system's lifecycle. Such Guidelines notably provide a self-assessment checklist in order to ensure that unfair bias creation or reinforcement is avoided and that accessibility, universal design principles, and stakeholder participation are considered.