Is Data Mining Ethical?
The idea of data mining is one that sends a chill down my spine. The collection and use of data that relies on peoples’ production and sharing of personal and sensitive information has a certain creep factor. Specifically, when data mining is used in ways inconsiderate of the people behind the data, the creep factor increases dramatically.
The media, researchers, and non-governmental organizations continue to access and reuse sensitive data without consent from Indigenous governing bodies. This has been done recently amidst the COVID-19 pandemic where tribal data in the United States was released by government entities without permission or knowledge of the tribes themselves. There is an effort to address gaps in data and data invisibility of Indigenous peoples in America. However, this can result in unintentional harm while ignoring Indigenous sovereign rights, which need to be protected. (RDA COVID-19 Indigenous Data WG, 2020).
In this article, we will review case studies on data mining in African communities, and on contact tracing for COVID-19 in South Korea and Brazil to demonstrate how ethical AI strategies work in different scenarios and cultures to impart a global perspective. These projects appear beneficial on the surface level, however, they embody a colonial nature that is deeply embedded in our world structures. We will be discussing these cases within the framework of top-down, bottom-up, and hybrid models of ethics in artificial intelligence (AI) which you can read more about here. Before we review the case studies, we will review what data mining is in this context.
Defining Data Mining
What is the difference between data sharing and data mining?
Data sharing implies that there is an owner of the data and openness or agreement to share information. Data mining gives the impression of taking without asking, with no acknowledgment or compensation, while the miners of the data are the sole beneficiaries. However, can data sharing and data mining be one and the same?
Data mining is closely tied to data colonialism, an enactment of neo-colonialism in the digital world which uses data as a means of power and manipulation. Manipulation runs rampant in this age of misinformation, which we have seen heavily at play in recent times as well as throughout history, playing on emotions to steer public opinion.
Case Study 1: Data Mining in the African Context
Data sharing is a prime example of conflicting principles of AI ethics. On one hand, it is the epitome of transparency and a crucial element to scientific and economic growth. On the other hand, it brings up serious concerns about privacy, intellectual property rights, organizational and structural challenges, cultural and social contexts, unjust historical pasts, and potential harms to marginalized communities. (Abebe et al., 2021)
The term data colonialism can be used to describe some of the challenges of data sharing, or data mining, which reflect the historical and present-day colonial practices such as in the African and Indigenous context. (Couldry and Mejias, 2019) When we use terms such as ‘mining’ to discuss how data is collected from people, the question remains, who benefits from the data collection?
The use of data can paradoxically be harmful to the communities it is collected from. Establishing trust is challenging due to the historical actions taken by data collectors while mining data from indigenous populations. What barriers exist that prevent data from being of benefit to African and indigenous people? We must address the entrenched legacies of power disparities concerning what challenges they present for modern data sharing. (Abebe et al., 2021)
One problematic example is non-government organizations (NGOs) that try to ‘fix’ problems for marginalized ethnic groups and can end up causing more harm than good. For instance, a Europe-based NGO attempted to address the problem of access to clean potable water in Buranda, while testing new water accessibility technology and online monitoring of resources. (Abebe et al., 2021)
The NGO failed to understand the perspective of the community on the true central issues and potential harms. Sharing the data publicly, including geographic locations put the community at risk, as collective privacy was violated and trust was lost. In the West we often think of privacy as a personal concern, however, collective identity serves as great importance to a multitude of African and Indigenous communities. (Abebe et al., 2021)
Another case study in Zambia observed that up to 90% of health research funding comes from external funders, meaning the bargaining power gives little room for Zambian scholars. In the study, power imbalances were reported in everything from funding to agenda-setting, data collection, analysis, interpretation, and reporting of results. (Vachnadze. 2021) This example exhibits further the understanding that trust cannot be formed on the foundation of these imbalances of power.
Many of these research projects lead with good intentions, yet there is a lack of forethought into the ethical use of data, during and after the project, which can create unforeseen and irreparable harms to the wellbeing of communities. This creates a hostile environment to build relationships of respect and trust. (Abebe et al., 2021)
To conclude the reflection of this case study, we can pose the ethical question, is data sharing good/beneficial? First and foremost, local communities must be the primary beneficiaries of responsible data-sharing practices. It is important to specify who benefits from data sharing and to make sure that it is not doing any harm to the people behind the data.
Case Study 2: Data Sharing for Contact Tracing during COVID-19
Contact tracing for the COVID-19 pandemic is another example of a complex ethical case of data collection.
Contact tracing can be centralized or non-centralized, which directly relates to top-down and bottom-up methods of data collection. Depending on the country and government, some have taken a more centralized top-down approach, and some have utilized a hybrid approach of government recommendations and bottom-up implementation via self-reporting.
The centralized approach was deployed in South Korea, whereby law, and for the purposes of infectious disease control, the national authority is permitted to collect and use the information on all COVID-19 patients and their contacts. In 2020, Germany and Israel tried and failed at adopting centralized approaches, due to a lack of exceptions for public health emergencies in their privacy laws. Getting past the legal barriers can be a lengthy and complex process and not conducive to applying a centralized contact tracing system for the outbreak. (Sagar. 2021)
Justin Fendos, a professor of cell biology from South Korea, wrote that in supporting the public health response to COVID-19, Korea had the political willingness to use technological tools to its full potential. The Korean government had collected massive amounts of transaction data to investigate tax fraud even before the COVID-19 outbreak. Korea’s government databases hold records of literally every credit card and bank transaction, and this information was repurposed during the outbreak to retroactively track individuals. In Korea, 95% of adults own a smartphone and many use cashless tools everywhere they go, including on buses and subways. (Fendos, 2020) Hence, contact tracing in Korea was extremely effective.
Public opinion about surveillance in Korea has been stated to be overwhelmingly positive. Fatalities in Korea due to COVID-19 were a third of the global average as of April 2020, when it was also said that they were one of the few countries to have successfully flattened the curve. There have been concerns, despite the success, regarding the level of personal details released by health authorities, which have motivated updated surveillance guidelines for sensitive information. (Fendos, 2020)
Non-centralized approaches to contact tracing are essentially smartphone apps that track proximal coincidence with less invasive data collection methods. These approaches have thus been adopted by many countries, and don’t have the same cultural and political obstacles as centralized approaches, avoiding legal pitfalls and legislative reform. (Sagar. 2021) Because of this and other reasons, contact tracing doesn’t always work the same as in Korea.
One study focused on three heavily impacted cities in Brazil that had the most deaths from COVID-19 until the first half of 2021. A methodology for applying data mining as a public health management tool included identifying variables of climate and air quality in relation to the number of COVID-19 cases and deaths. They provided forecasting models of new COVID-19 cases and daily deaths in the three Brazilian cities studied. However, the researchers noted that the counting of cases in Brazil was affected by high underreporting due to low testing, as well as technical and political problems, including the spread of misinformation, hence the study stated that cases may have been up to 12 times greater than investigations indicated. (Barcellos et al., 2021)
We can see from these examples that contact tracing has worked very differently in countries that have contrasting systems of government, and the same approach wouldn’t work for all countries. A lack of trust comes into play as well, and contact tracing didn’t work in many places simply because people didn’t trust the technology or the government behind it, often reflecting judgments based on misinformation. In Brazil, the spread of misinformation was coming from the government, which doesn’t inspire trust.
In America, a July 2020 study found that 41% said they would likely not speak on the phone or text with a public health official and 27% were unlikely to share names of recent contacts (McClain, 2020), which are vital steps that create a bottleneck in the process of contact tracing adoption. While there are concerns with contact tracing and privacy, there is a contradiction and hypocrisy when it comes to the prolific use of social media apps and how much data is freely shared on them on a daily basis. Yet, when it comes to participation in a tracking system for a global pandemic that is built with fundamental principles to protect personal privacy, it can be seen as a threat.
Conclusion
Data ethics issues across the planet are complex and this article only offers a couple of examples of areas of use and tensions. We must keep in mind that data represents real people and collecting or mining data from indigenous communities can be at their detriment, often unknown to the data scientists and companies who reap the benefits. This is not a new story, just a new setting, and we must be cognizant of these instances of colonialism that still penetrate our relations across cultures and across the world.
You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.
www.accel.ai
Join us in driving #AI for #SocialImpact initiatives around the world!
References
Abebe, R., Aruleba, K., Birhane, A., Kingsley, S., Obaido, G., Remy, S. L., & Sadagopan, S. (2021). Narratives and Counternarratives on Data Sharing in Africa. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 329–341. https://doi.org/10.1145/3442188.3445897
Anane‐Sarpong, E., Wangmo, T., Ward, C. L., Sankoh, O., Tanner, M., & Elger, B. S. (2018). “You cannot collect data using your own resources and put It on open access”: Perspectives from Africa about public health data‐sharing. Developing World Bioethics, 18(4), 394–405. https://doi.org/10.1111/dewb.12159
Barcellos, D. da S., Fernandes, G. M. K., & de Souza, F. T. (2021). Data based model for predicting COVID-19 morbidity and mortality in metropolis. Scientific Reports, 11(1), 24491. https://doi.org/10.1038/s41598-021-04029-6
Bezuidenhout, L., & Chakauya, E. (2018). Hidden concerns of sharing research data by low/middle-income country scientists. Global Bioethics, 29(1), 39–54. https://doi.org/10.1080/11287462.2018.1441780
Chilisa, B. (2012). Indigenous Research Methodologies. SAGE.
Couldry, N., & Mejias, U. A. (2019). Data Colonialism: Rethinking Big Data’s Relation to the Contemporary Subject. Television & New Media, 20(4), 336–349. https://doi.org/10.1177/1527476418796632
Fendos, J. (2020). How surveillance technology powered South Korea’s COVID-19 response. Brookings.
Hooker, S. (2018). Why “data for good” lacks precision. Medium.
Maxmen, A. (2019). Can tracking people through phone-call data improve lives? Nature, 569(7758), 614–617. https://doi.org/10.1038/d41586-019-01679-5
McClain, C. (2020, November 13). Key findings about Americans’ views on COVID-19 contact tracing. Pew Research Center.
RDA COVID-19 Indigenous Data WG. “Data sharing respecting Indigenous data sovereignty.” In RDA COVID-19 Working Group (2020). Recommendations and guidelines on data sharing. Research Data Alliance. https://doi.org/10.15497/rda00052
Sagar, R. (2021). What is Hybrid AI? Analytics India Magazine.
Walsh, A., Brugha, R., & Byrne, E. (2016). “The way the country has been carved up by researchers”: ethics and power in north–south public health research. International Journal for Equity in Health, 15(1), 204. https://doi.org/10.1186/s12939-016-0488-4
Walter, M., Kukutai, T., Carroll, S. R., & Rodriguez-Lonebear, D. (2020). Indigenous Data Sovereignty and Policy (M. Walter, T. Kukutai, S. R. Carroll, & D. Rodriguez-Lonebear, Eds.). Routledge. https://doi.org/10.4324/9780429273957