Análisis no supervisado de observaciones atípicas en la misión espacial Gaia; optimización mediante procesamiento distribuido e integración en Apsis

  1. Garabato Míguez, Daniel
Supervised by:
  1. Carlos Dafonte Co-director
  2. Minia Manteiga Co-director

Defence university: Universidade da Coruña

Fecha de defensa: 29 September 2020

Committee:
  1. Julián Dorado Chair
  2. Carme Jordi Nebot Secretary
  3. David Teyssier Committee member

Type: Thesis

Teseo: 633553 DIALNET lock_openRUC editor

Abstract

This PhD Thesis has been developed within the framework of the Gaia mission of the European Space Agency (ESA) and the international Data Processing and Analysis Consortium (DPAC), which are conducting the largest and most precise stellar census ever made, and will provide astrometric information for more than 2500 million sources to the scientific community. The enormous volumes of data that must be handled in this context -which are expected to be around a Petabyte of information '-, are those of a Big Data environment and it becomes a challenge to the scientific community ----cspecially to the DPAC consortium-, complicating their storage and distribution and making their analysis by means of common techniques and applications unfeasible. In this way, the usage of alternative Data Mining strategies is needed, so that the applications are executed in a distributed fashion aruong the machines of a cluster, trying to take advantage of the maximum computing power as possible, which has been nowadays narued as Big Data. The research group in which this Thesis has been developed is involved in the DPAC consortium -in collaboration with more than 400 scientists and engineers- since 2006, participating in the data analysis tasks and tools development for the exploitation of the mission catalog. The main contribution of this Thesis to the Gaia project has been materialized through the Outlier Analysis (OA) package, which is part ofthe processing chain narued Astrophysical Parameter lnference System (Apsis), and it is devoted to the unsupervised analysis 01.' dustering ~"·~by means of Artificial lntelligence (Al) techniques"'M- of those sources whose astronomical dass could not be reliably identified by the preceding dassification package, the Discrete Source Classifier (DSC). Specifically, we have addressed the following items: Opt.imization and accommodation of the Self-Organized Maps (SOM) training algorithm to different widely used distributed computing platforms, such as Apache Hacioop and Apache Spark, so that they can be executed in an acceptable time in order to perform an unsupervised analysis of massive datasets -mainly using Gaía BP IRP spectrophotometry-. In the sarue way, we have also adapted this technique to the SAGA fraruework, designated by DPAC to support Apsis. Integration of the OA module into Apsis ·-and, therefore, also into the SAGA platform- together with the other working packages. To do this, apart from the adaptation of SOM mentioned above, we have had to determine an appropriate strategy to preprocess the data -especially the BP IRP spectrophotometry-, as well as sorne mechanisms to characterize the clusters, such as a statistical IV Abstract description based on information gathered by Gaia itself, different indicators about tbe quality of the clusters -mainly based on intra.-cluster distances-, ar a hint Oil their astronomical cIass -obtained by means of a labeling procedure using synthetic templates-. Validation of tbe techniques llsed in the OA module in arder to assess ¡ts right functioning and performance within Apsis, using small sets of real data --arGund ten millian observations_. The main goal of this process i8 to guarantee the quality of the unsupervised analysis performed by the OA module, which will produce results that will be officially published fraro Data Release 3 onwards, expected far tbe end of 2021. To do this, we have also defined the data structures needed for the storage and dissemination to the scientific community through the platform designated by DPAC, the Gaia Archive, in which we have also collaborated during the analysis and validation of use case scenarios. In addition, during the comse of this Thesis, we have contributed to the development of Data Mining tools based on SOM --as well as to the visualization of their resultswhich allow for the scientific exploitation of the mission catalogo Specifically, the visualization tool developed by our research group, GUASOM, will be available from Data Release 3 onwards, with a specific version -GUASOM flavor DR-3- to analyze the products produced by the OA module. In the same way, we have also conducted a feasibility study on Common Artificial Neural Networks, and generative ones -based 011 genetic techniques and proposed by our research group-, in order to estimate steIlar astrophysical parameters within Apsis, under the GSP-Spec working package. Finally, we have applied the Al tecbniques used in the Gaia mission --or other similar tecbniques- to other catalogs, such as the astronomical survey ALHAMBRA, in which we have performed an unsupervised analysis of its catalog, or even otber fields, such as cybersecurity, in order to autbenticate users by analyzing their behavior through a continuous monitorization of their activity.