Human migration: the big data perspective
Sîrbu et al. [+15] Published: 23-March-2020 | Updated: 21-May-2021 International Journal of Data Science and Analytics, 11(4), 341–360. DOI: https://doi.org/10.1007/s41060-020-00213-5
Open Access
How can big data help to understand the migration phenomenon? In this paper, we try to answer this question through an analysis of various p
Abstract
How can big data help to understand the migration phenomenon? In this paper, we try to answer this question through an analysis of various phases of migration, comparing traditional and novel data sources and models at each phase. We concentrate on three phases of migration, at each phase describing the state of the art and recent developments and ideas. The first phase includes the journey, and we study migration flows and stocks, providing examples where big data can have an impact. The second phase discusses the stay, i.e. migrant integration in the destination country. We explore various data sets and models that can be used to quantify and understand migrant integration, with the final aim of providing the basis for the construction of a novel multi-level integration index. The last phase is related to the effects of migration on the source countries and the return of migrants.
1 | Introduction
The phenomenon of human migration has been a constant of human history, from the earliest ages until now. As such, the study of migration spans various research fields, including anthropology, sociology, economics, statistics and more recently physics and computer science. We are at a moment where various types of data not typically used to study migration are becoming increasingly available. These include so-called social big data: digital traces of humans generated by using mobile phones, online services, online social networks (OSNs), devices within the internet of things. At the same time, new technologies are able to extract valuable information from these large data sets. Both traditional and novel models and data are currently being employed to understand different questions on migration, including monitoring migration flows and the economic and cultural effects on the migrants and also on the source and destination communities. In this paper, we provide a survey of existing approaches, both traditional and data-rich, and we propose new methods and data sets that could contribute significantly to the study of human migration. We concentrate on three different phases of migration: the journey—analysing migration flows and stocks; the stay—studying migrant integration and changes in the communities involved; the return—the study of migrants returning to the origin country.
1.1 The journey
At the moment, information about migration flows and stocks comes from official statistics obtained either from national censuses or from the population registries. Given that migration intrinsically involves various nations, data are often inconsistent across databases and offer poor time resolution. With the availability of social big data, we believe it should be possible to estimate flows and stocks from available data in real time, by building models that map observed measures extracted from these unconventional data sources to official data, i.e. now-casting stocks and flows. We also look at migration phenomena within smaller communities, such as scientific migration, where even prediction of migration events can be possible. An important step in understanding migration flows is suitable visualization, which we also explore.
1.2 The stay
Migration might generate cultural changes with both long- and short-term effects on the local and incoming population. Migrant integration is generally measured through indicators related to the labour market, economic status or social ties. Again, these statistics are available with low resolution and not for all countries. A new direction is that of observing integration and perception on migration through big data. For instance, OSN sentiment analysis specific to immigration topics can allow us to evaluate perception of immigration. Analysis of retail data can enable us to understand whether immigrants are integrated economically but also whether they change their habits during their stay. Scientific data can help us understand how migration benefits both the host countries and the migrants themselves. Through these data, we can derive novel integration indices that take into account the traces of human activity observed.
1.3 The return
Besides effects on the receiving communities, the source communities may also see effects of migration. In fact, migrants can maintain a strong attachment to their home countries and eventually return there. This can bring multiple benefits: economic growth, new skills, entrepreneurship, better healthcare, different participation in governance issues and many others. We discuss various approaches to analysing these cases based on existing data.
Both traditional and new methods to analyse migration depend highly on the availability of data. Hence, infrastructures that can catalogue the various data sets and make them available to the community, ensuring privacy and ethical use, are very useful. At the same time, with new methods being developed, means of facilitating their use by the research community are necessary. An example of framework that aims to achieve these requirements is the SoBigData infrastructure [78] (www.sobigdata.eu). This includes a catalogue of methods, data sets and training material, grouped in so-called exploratories. Virtual research environments allow users to use some of the data and methods directly in the SoBigData engine. The exploratory on migration studies includes many of the methods and data sets presented below.
The rest of the paper is organised as follows: The study of migration flows and stocks is discussed in Sect. 2. This compares traditional data (Sect. 2.1) with social big data (Sect. 2.2) including scientific migration (Sect. 2.2.1), providing also a review of tools for visualization of migration data (Sect. 2.3). Section 3 concentrates on migrant integration and perception of migration. We start by looking at approaches based on traditional data sources (Sect. 3.1) and move on to social big data including retail data (Sect. 3.2.1), mobile data (Sect. 3.2.2), language and sentiment in OSNs (Sects. 3.2.3 and 3.2.4), ego networks (Sect. 3.2.5). The return of migrants is discussed in Sect. 4, while Sect. 5 concludes the paper with a summary and a discussion on ethical issues.
5 | Discussion and Conclusions
We have discussed three lines of research where social big data can complement existing approaches to provide small area and high-time resolution methods for analysis of migration. In terms of estimating flows and stocks, some research already exists trying to use social big data to now-cast immigration. However, models still need to be refined and validated. An important issue here is that a proper gold standard does not exist: exact current immigration rates are unknown, and those in the past can be noisy, so validation of now-casting models is not straightforward. Finding the relations between policies and immigration could be a step forward in finding means to validate model output. Another big data type that has not been included here and that can help make predictions in terms of migration related to climate is satellite data. To measure migrant integration, we believe that several new data types can be used to introduce novel integration indices, based on retail consumer behaviour, mobile data, OSN language, sentiment and network analysis. Research in this direction is slightly less developed, mostly due to low availability of ready-to-use data sets. Our consortium is making steps in this direction, using existing data sets, participating to data challenges or collecting new data. For the return of migrants, again research is limited, although potential exists in data such as retail, mobile or OSN.
In all three dimensions, research has to carefully consider the issues with the data that is being used. It is important that each study includes a well-planned data collection phase where available data are analysed to identify gaps and to devise strategies to fill the gaps by integrating other types of data. This in order to ensure that the problem being studied is thoroughly covered by the data used. In this process, research infrastructures such as SoBigData can be of great help. On the one hand, they can provide means to catalogue data, so that new data sets are available to the community for integration. On the other hand, they enable the community to share methods and experiences so that gaps identified and the solutions taken to fill these gaps can be reused. This applies not only to traditional data sources, but also to social big data. The complexity of digital demography implies that there is no free lunch with digital traces either [106]. One problem relates to the representativeness of the collected samples. For example, Facebook and Twitter penetration rates are different worldwide and tend to be different depending on the considered age of users [184]. Being unable to track specific categories of users can steer policies on migration in a direction that unwillingly perpetuates discriminations or neglects the needs of the invisible groups. For the above reasons, analytical and technical challenges to extract meaning from this kind of data, in synergy with more traditional data sources, remain an open and very important research area, with some recent efforts made in this direction [93]. Model validation using existing statistics and the relation to migration policies is important. Furthermore, careful data integration could help in overcoming some of the selection bias, resulting in novel, multi-level indices based on big data.
A different issue is that related to the ethics dimension of processing personal data, including sensitive personal data, describing human individuals and activities. As also stated in [187], the first rule that a researcher must follow is to acknowledge that data are people and can do harm. In particular, the context of migration is very sensitive to this problem, since individuals described in the data are often particularly vulnerable: refugees and their families might be persecuted in their home countries, so avoiding their re-identification is a critical matter. Moreover, mass media and social media impact our society and integration itself since a negative tone systematically relates to lower acceptance rates of asylum practices [102], so extreme care has to be taken in publishing results. Nevertheless, migration studies can have a significant impact to improve our society and to help the inclusion process of migrants; thus, encouraging data sharing is one of our main goals for achieving public good.
For all these reasons, it is essential that legal requirements and constraints are complemented by a solid understanding of ethical and legal views and values such as privacy and data protection, composing an actual ethical and legal framework. To this end, a number of infrastructural, organizational and methodological principles have been developed by the SoBigData Project, in order to establish a Responsible Research Infrastructure, allowing users to make full use of the functionalities and capabilities that big data can offer to help us solve our problems, while at the same time allowing them to respect fundamental rights and accommodate shared values, such as privacy, security, safety, fairness, equality, human dignity and autonomy [66]. In particular, we strongly rely on Value Sensitive Design and Privacy-by-Design methodologies, in order to develop privacy-enhancing technologies, privacy-aware social data mining processes and privacy risk assessment methodologies. These methods are developed mainly in the fields of mobility data (such as GPS trajectories), mobile and retail data, which are some of the (unconventional) big data used in our migration studies. Moreover, some other general tools have been implemented to assist researchers in their activities, create a new class of responsible data scientists and inform the data subjects and the society about our work and our goals, such as an online course, ethics briefs and public information documents.









