FOSS4G 2022 | Cluster Analysis: a comprehensive and versatile QGIS plugin for pattern recognition…

Показать описание

Cluster Analysis: a comprehensive and versatile QGIS plugin for pattern recognition in geospatial data

Cluster Analysis: a comprehensive and versatile QGIS plugin for pattern recognition in geospatial data As geospatial data continuously grows in complexity and size, the application of Machine Learning and Data Mining techniques to geospatial analysis is increasingly more essential to solve real-world problems. Although, in the last two decades, the research in this field produced innovative methodologies, they are usually applied to specific situations and not automatized for general use. Therefore, both generalization and integration of these methods with Geographic Information Systems (GIS) are necessary to support researchers and organizations in data exploration, pattern recognition, and prediction in the various applications of geospatial data. The lack of machine learning tools in GIS is especially clear for what concerns unsupervised learning and clustering. The most used clustering plugins in QGIS [1] contain few functionalities beyond the basic application of a clustering algorithm. In this work we present Cluster Analysis, a Python plugin that we developed for the open-source software QGIS and offers functionalities for the entire clustering process: from (i) pre-processing, to (ii) feature selection and clustering, and finally (iii) cluster evaluation. Our tool provides different improvements from the current solutions available in QGIS, but also in other widespread GIS software. The expanded features provided by the plugin allow the users to deal with some of the most challenging problems of geospatial data, such as high dimensional space, poor quality of data, and large size of data. In particular, the plugin is composed of three main sections: - feature cleaning: This part aims to provide some options to reduce the dimensionality of the dataset by removing the attributes that are most likely bad for the clustering process. This is important to achieve better results and faster execution time, avoiding the problems of clustering in high dimensionality. The first filter removes the features that are correlated above a user-defined threshold, since highly correlated features usually provide redundant information and can lead to overweight of some characteristics. The other two filters identify the attributes with constant values for all the data points or with few outliers differentiating from them. These types of features don’t provide any valuable information and can worsen the performance of clustering. To identify quasi-constant features, we use two different parameters introduced in the function NearZeroVar() from the Caret package developed for R [2]: the ratio between the two most frequent values and the number of unique values relative to the number of samples. - clustering: This section is used to perform clustering on the chosen vector layer. First of all, the user needs to select the features to use in the process. It is possible to select the features both manually and automatically. The automatic feature selection is done using an entropy-based algorithm [3] presented in two versions with different computational complexities. The currently available algorithms for clustering are K-Means and Agglomerative Hierarchical, and the users can select the one that best suits their needs. Before performing clustering, the plugin offers the possibility to scale the datasets with standardization or normalization, and to plot two different graphs to facilitate the choice of the number of clusters. - evaluation: In this section we show all the experiments carried out in the current session, with a recap of the settings and performances of the experiments and the possibility to save and load them with text files. To evaluate the quality of the experiments we calculate two indexes and the comparisons among experiments on the same dataset. The indexes are the internal metrics Silhouette coefficient and Davies-Bouldin index. To directly compare the clusters formed by two or more experiments we compute the score [4], which evaluates how many couples of data points are grouped together in all of the experiments or in none of them. Every experiment completed in the current session can be stored in a text file, and the experiments saved in previous sessions can be loaded in the plugin and are shown in the evaluation section along with the other ones. One of the major challenges during development has been allowing most of the functionalities on large datasets as well, both from the point of view of the number of samples and the number of dimensions. To achieve this, we also implemented algorithm options with good time complexities, as in the case of entropy with sampling and K-Means. Moreover, for all the data storage and manipulation done in the system, we use the data structures and functions…

Andrea Folini

#foss4g2022
#academictrack

FOSS4G

Рекомендации по теме

FOSS4G 2022 | Cluster Analysis: a comprehensive and versatile QGIS plugin for pattern recognition…

FOSS4G 2022 | Cluster Analysis: a comprehensive and versatile QGIS plugin for pattern recognition…

FOSS4G 2022 | Speed-related traffic accident analysis using GIS-based DBSCAN and NNH clustering

FOSS4G 2022 | A method for universal superpixels-based regionalization (preliminary results)

FOSS4G 2022 | Analysis of the spatiotemporal accumulation process of Mapillary data and its…

FOSS4G 2022 | Classifying American Viticultural Areas Based on Environmental Data

FOSS4G 2022 | OpenStreetMap Element Vectorisation - A tool for high resolution data insights and…

FOSS4G 2022 | MapMint: The service-oriented platform

FOSS4G 2022 | InforSAT: an online Sentinel-2 multi-temporal analysis toolset using R CRAN

FOSS4G 2022 | Agile Geo-Analytics: Stream processing of raster- and vector data with dask-…

FOSS4G 2022 | DistrictBuilder, or how TopoJSON was the cause of and solution to all of our problems

FOSS4G 2022 | State of GeoPandas and friends

FOSS4G 2022 | Mainstreaming metadata into research workflows to advance reproducibility and open…

FOSS4G 2022 | Not too big, not too small: open source geospatial units that are just right

FOSS4G 2022 | Status of OTBTF, the Orfeo ToolBox extension for deep learning

FOSS4G 2022 | HIECTOR: Hierarchical object detector for cost-efficient detection at scale

FOSS4G 2022 | What’s new in geospatial Elasticsearch

FOSS4G 2022 | Cloud Optimized Point Cloud: Compressed, Geospatial, Lossless and Compatible Data…

FOSS4G 2022 | Scaling-up deep learning predictions of hydrography from IFSAR data in Alaska

FOSS4G 2022 | pgRouting optimization: from technical to functional

FOSS4G 2022 | European (Inspire) Data Tour

FOSS4G 2022 | Using Sentinel 2 images to quantify agricultural encroachment in Burkina Faso’s…

FOSS4G 2022 | RINX: A Solution for Information Extraction from Big Raster Datasets

FOSS4G 2022 | Developing a privacy-aware map-based cross-platform social media dashboard for…

FOSS4G 2022 | EOEPCA - An Open Source Exploitation Platform