Title: | Efficient Spatial Thinning of Species Occurrences |
---|---|
Description: | Provides efficient geospatial thinning algorithms to reduce the density of coordinate data while maintaining spatial relationships. Implements K-D Tree and brute-force distance-based thinning, as well as grid-based and precision-based thinning methods. For more information on the methods, see Elseberg et al. (2012) <https://hdl.handle.net/10446/86202>. |
Authors: | Jorge Mestre-Tomás [aut, cre] |
Maintainer: | Jorge Mestre-Tomás <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.0.0 |
Built: | 2025-03-27 20:08:10 UTC |
Source: | https://github.com/jmestret/geothinner |
This function assigns a set of geographic coordinates (longitude and latitude) to grid cells based on a specified cell size.
assign_coords_to_grid(coords, cell_size)
assign_coords_to_grid(coords, cell_size)
coords |
A data frame or matrix with two columns: longitude and latitude. |
cell_size |
Numeric value representing the size of each grid cell, typically in degrees. |
A character vector of grid cell identifiers, where each identifier is formatted as "x_y", representing the grid cell coordinates.
coords <- data.frame(lon = c(-122.4194, 0), lat = c(37.7749, 0)) cell_size <- 1 assign_coords_to_grid(coords, cell_size)
coords <- data.frame(lon = c(-122.4194, 0), lat = c(37.7749, 0)) cell_size <- 1 assign_coords_to_grid(coords, cell_size)
This dataset contains a subset of global occurrences of the Loggerhead Sea Turtle (Caretta caretta), filtered for records in the Mediterranean Sea. The data were sourced from the Global Biodiversity Information Facility (GBIF).
data("caretta")
data("caretta")
A data frame with 6785 rows and 5 columns:
Numeric. Longitude coordinates (WGS84).
Numeric. Latitude coordinates (WGS84).
Integer. The year in which the occurrence was recorded.
Character. The scientific name of the species, i.e., Caretta caretta.
Numeric. The uncertainty of the coordinates in meters.
The dataset has been filtered to include only records within the Mediterranean Sea. The occurrence data cover multiple years, which provides information on the temporal distribution of the species in this region.
Global Biodiversity Information Facility (GBIF), https://www.gbif.org/species/8894817
Computes neighbors for each point in a set of coordinates using a greedy approach. All pairwise distances are calculated to identify neighbors within a specified distance threshold.
compute_neighbors_brute( coordinates, thin_dist, distance = c("haversine", "euclidean"), R = 6371 )
compute_neighbors_brute( coordinates, thin_dist, distance = c("haversine", "euclidean"), R = 6371 )
coordinates |
A matrix of coordinates to thin, with two columns representing longitude and latitude. |
thin_dist |
A positive numeric value representing the thinning distance in kilometers. |
distance |
A character string specifying the distance metric to use 'c("haversine", "euclidean")'. |
R |
A numeric value representing the radius of the Earth in kilometers. The default is 6371 km. |
A list where each element corresponds to a point and contains the indices of its neighbors.
set.seed(123) coords <- matrix(runif(20, min = -180, max = 180), ncol = 2) # Compute neighbors using brute fore neighbors <- compute_neighbors_brute(coords, thin_dist = 10,)
set.seed(123) coords <- matrix(runif(20, min = -180, max = 180), ncol = 2) # Compute neighbors using brute fore neighbors <- compute_neighbors_brute(coords, thin_dist = 10,)
Computes neighbors for each point in a set of coordinates using a kd-tree for efficient neighbor searches. This method is particularly useful for large datasets.
compute_neighbors_kdtree( coordinates, thin_dist, k = NULL, distance = c("haversine", "euclidean"), R = 6371 )
compute_neighbors_kdtree( coordinates, thin_dist, k = NULL, distance = c("haversine", "euclidean"), R = 6371 )
coordinates |
A matrix of coordinates to thin, with two columns representing longitude and latitude. |
thin_dist |
A positive numeric value representing the thinning distance in kilometers. |
k |
An integer specifying the maximum number of neighbors to consider for each point. |
distance |
A character string specifying the distance metric to use 'c("haversine", "euclidean")'. |
R |
A numeric value representing the radius of the Earth in kilometers. The default is 6371 km. |
This function uses kd-tree (via 'nabor' package) for efficient spatial searches. The kd-tree inherently works with Euclidean distances. If '"haversine"' is selected, the function first converts geographic coordinates to 3D Cartesian coordinates before constructing the kd-tree.
A list where each element corresponds to a point and contains the indices of its neighbors, excluding the point itself.
set.seed(123) coords <- matrix(runif(20, min = -180, max = 180), ncol = 2) # Compute neighbors using kd-tree neighbors <- compute_neighbors_kdtree(coords, thin_dist = 10,)
set.seed(123) coords <- matrix(runif(20, min = -180, max = 180), ncol = 2) # Compute neighbors using kd-tree neighbors <- compute_neighbors_kdtree(coords, thin_dist = 10,)
Divides the search area into a grid of local regions and constructs kd-trees for each region to compute neighbors efficiently. Neighbor regions are also considered to ensure a complete search.
compute_neighbors_local_kdtree( coordinates, thin_dist, distance = c("haversine", "euclidean"), R = 6371, n_cores = 1 )
compute_neighbors_local_kdtree( coordinates, thin_dist, distance = c("haversine", "euclidean"), R = 6371, n_cores = 1 )
coordinates |
A matrix of coordinates to thin, with two columns representing longitude and latitude. |
thin_dist |
A positive numeric value representing the thinning distance in kilometers. |
distance |
A character string specifying the distance metric to use 'c("haversine", "euclidean")'. |
R |
A numeric value representing the radius of the Earth in kilometers. The default is 6371 km. |
n_cores |
An integer specifying the number of cores to use for parallel processing. The default is 1. |
A list where each element corresponds to a point and contains the indices of its neighbors, excluding the point itself.
set.seed(123) coords <- matrix(runif(20, min = -180, max = 180), ncol = 2) # Compute neighbors using local kd-trees with Euclidean distance neighbors <- compute_neighbors_local_kdtree(coords, thin_dist = 10, n_cores = 1)
set.seed(123) coords <- matrix(runif(20, min = -180, max = 180), ncol = 2) # Compute neighbors using local kd-trees with Euclidean distance neighbors <- compute_neighbors_local_kdtree(coords, thin_dist = 10, n_cores = 1)
This function applies a distance-based thinning algorithm using a kd-tree or brute-force approach. Two modified algorithms based on kd-trees (local kd-trees and estimating the maximum number of neighbors) are implemented which scale better for large datasets. The function removes points that are closer than a specified distance to each other while maximizing spatial representation.
distance_thinning( coordinates, thin_dist = 10, trials = 10, all_trials = FALSE, search_type = c("kd_tree", "local_kd_tree", "k_estimation", "brute"), target_points = NULL, distance = c("haversine", "euclidean"), R = 6371, n_cores = 1 )
distance_thinning( coordinates, thin_dist = 10, trials = 10, all_trials = FALSE, search_type = c("kd_tree", "local_kd_tree", "k_estimation", "brute"), target_points = NULL, distance = c("haversine", "euclidean"), R = 6371, n_cores = 1 )
coordinates |
A matrix of coordinates to thin, with two columns representing longitude and latitude. |
thin_dist |
A positive numeric value representing the thinning distance in kilometers. |
trials |
An integer specifying the number of trials to run for thinning. Default is 10. |
all_trials |
A logical indicating whether to return results of all attempts ('TRUE') or only the best attempt with the most points retained ('FALSE'). Default is 'FALSE'. |
search_type |
A character string indicating the neighbor search method 'c("local_kd_tree", "k_estimation", "kd_tree", "brute")'. The defult value is 'local_kd_tree'. See details. |
target_points |
Optional integer specifying the number of points to retain. If 'NULL' (default), the function tries to maximize the number of points retained. |
distance |
Distance metric to use 'c("haversine", "euclidean")'. Default is Haversine for geographic coordinates. |
R |
Radius of the Earth in kilometers (default: 6371 km). |
n_cores |
Number of cores for parallel processing (only for '"local_kd_tree"'). Default is 1. |
- '"kd_tree"': Uses a single kd-tree for efficient nearest-neighbor searches. - '"local_kd_tree"': Builds multiple smaller kd-trees for better scalability. - '"k_estimation"': Approximates a maximum number of neighbors per point to reduce search complexity. - '"brute"': Computes all pairwise distances (inefficient for large datasets).
A list. If 'all_trials' is 'FALSE', the list contains a single logical vector indicating which points are kept in the best trial. If 'all_trials' is 'TRUE', the list contains a logical vector for each trial.
# Generate sample coordinates set.seed(123) result <- matrix(runif(20, min = -180, max = 180), ncol = 2) # 10 random points # Perform thinning with local kd-trees result_partitioned <- distance_thinning(result , thin_dist = 5000, trials = 5, search_type = "local_kd_tree", all_trials = TRUE) print(result_partitioned) # Perform thinning estimating max number of neighbors result_estimated <- distance_thinning(result , thin_dist = 5000, trials = 5, search_type = "k_estimation", all_trials = TRUE) print(result_estimated)
# Generate sample coordinates set.seed(123) result <- matrix(runif(20, min = -180, max = 180), ncol = 2) # 10 random points # Perform thinning with local kd-trees result_partitioned <- distance_thinning(result , thin_dist = 5000, trials = 5, search_type = "local_kd_tree", all_trials = TRUE) print(result_partitioned) # Perform thinning estimating max number of neighbors result_estimated <- distance_thinning(result , thin_dist = 5000, trials = 5, search_type = "k_estimation", all_trials = TRUE) print(result_estimated)
This function estimates the maximum value of k (the number of nearest neighbors) for kd-tree-based thinning by evaluating the densest regions of a spatial dataset. The function uses a histogram-based binning approach for efficiency and low memory usage.
estimate_k_max(coordinates, thin_dist, distance = c("haversine", "euclidean"))
estimate_k_max(coordinates, thin_dist, distance = c("haversine", "euclidean"))
coordinates |
A matrix of spatial coordinates with two columns for longitude and latitude. |
thin_dist |
A positive numeric value representing the thinning distance in kilometers. This defines the resolution of the grid used for density calculations. |
distance |
Distance metric used 'c("haversine", "euclidean")'. |
The function divides the spatial domain into grid cells based on the specified thinning distance. Grid cell sizes are determined assuming approximately 111.32 km per degree (latitude/longitude). The function identifies the densest grid cells and their immediate neighbors to compute the maximum k value.
A numeric value representing the maximum k (number of nearest neighbors) required for the densest regions in the dataset.
# Generate sample data set.seed(123) coordinates <- matrix(runif(200, min = -180, max = 180), ncol = 2) # Estimate k for kd-tree thinning k_max <- estimate_k_max(coordinates, thin_dist = 50) print(k_max)
# Generate sample data set.seed(123) coordinates <- matrix(runif(200, min = -180, max = 180), ncol = 2) # Estimate k for kd-tree thinning k_max <- estimate_k_max(coordinates, thin_dist = 50) print(k_max)
This function performs thinning of spatial points by assigning them to grid cells based on a specified resolution or thinning distance. It can either create a new raster grid or use an existing 'terra::SpatRaster' object.
grid_thinning( coordinates, thin_dist = NULL, resolution = NULL, origin = NULL, raster_obj = NULL, n = 1, trials = 10, all_trials = FALSE, crs = "epsg:4326", priority = NULL )
grid_thinning( coordinates, thin_dist = NULL, resolution = NULL, origin = NULL, raster_obj = NULL, n = 1, trials = 10, all_trials = FALSE, crs = "epsg:4326", priority = NULL )
coordinates |
A numeric matrix or data frame with two columns representing the x (longitude) and y (latitude) coordinates of the points. |
thin_dist |
A numeric value representing the thinning distance in kilometers. It will be converted to degrees if 'resolution' is not provided. |
resolution |
A numeric value representing the resolution (in degrees) of the raster grid. If provided, this takes priority over 'thin_dist'. |
origin |
A numeric vector of length 2 (e.g., 'c(0, 0)'), specifying the origin of the raster grid (optional). |
raster_obj |
An optional 'terra::SpatRaster' object to use for grid thinning. If provided, the raster object will be used instead of creating a new one. |
n |
A positive integer specifying the maximum number of points to retain per grid cell (default: 1). |
trials |
An integer specifying the number of trials to perform for thinning (default: 10). |
all_trials |
A logical value indicating whether to return results for all trials ('TRUE') or just the first trial ('FALSE', default). |
crs |
An optional CRS (Coordinate Reference System) to project the coordinates and raster (default WGS84, 'epsg:4326'). This can be an EPSG code, a PROJ.4 string, or a 'terra::crs' object. |
priority |
A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning. |
A list of logical vectors indicating which points to keep for each trial.
# Example: Grid thinning using thin_dist coords <- matrix(c(-122.4194, 37.7749, -122.4195, 37.7740, -122.4196, 37.7741), ncol = 2, byrow = TRUE) result <- grid_thinning(coords, thin_dist = 10, trials = 5, all_trials = TRUE) print(result) # Example: Grid thinning using a custom resolution result_res <- grid_thinning(coords, resolution = 0.01, n = 2, trials = 5) print(result_res) # Example: Using a custom raster object library(terra) rast_obj <- terra::rast(nrows = 100, ncols = 100, xmin = -123, xmax = -121, ymin = 36, ymax = 38) result_raster <- grid_thinning(coords, raster_obj = rast_obj, trials = 5) print(result_raster)
# Example: Grid thinning using thin_dist coords <- matrix(c(-122.4194, 37.7749, -122.4195, 37.7740, -122.4196, 37.7741), ncol = 2, byrow = TRUE) result <- grid_thinning(coords, thin_dist = 10, trials = 5, all_trials = TRUE) print(result) # Example: Grid thinning using a custom resolution result_res <- grid_thinning(coords, resolution = 0.01, n = 2, trials = 5) print(result_res) # Example: Using a custom raster object library(terra) rast_obj <- terra::rast(nrows = 100, ncols = 100, xmin = -123, xmax = -121, ymin = 36, ymax = 38) result_raster <- grid_thinning(coords, raster_obj = rast_obj, trials = 5) print(result_raster)
This function converts geographic coordinates, given as longitude and latitude in degrees, to Cartesian coordinates (x, y, z) assuming a spherical Earth model.
lon_lat_to_cartesian(lon, lat, R = 6371)
lon_lat_to_cartesian(lon, lat, R = 6371)
lon |
Numeric vector of longitudes in degrees. |
lat |
Numeric vector of latitudes in degrees. |
R |
Radius of the Earth in kilometers (default: 6371 km). |
A numeric matrix with three columns (x, y, z) representing Cartesian coordinates.
lon <- c(-122.4194, 0) lat <- c(37.7749, 0) lon_lat_to_cartesian(lon, lat)
lon <- c(-122.4194, 0) lat <- c(37.7749, 0) lon_lat_to_cartesian(lon, lat)
This function performs the core thinning algorithm used to reduce the density of points in spatial data while maintaining spatial representation. It iteratively removes the points with the most neighbors until no points with neighbors remain. The algorithm supports multiple trials to find the optimal thinning solution.
max_thinning_algorithm(neighbor_indices, trials, all_trials = FALSE)
max_thinning_algorithm(neighbor_indices, trials, all_trials = FALSE)
neighbor_indices |
A list of integer vectors where each element contains the indices of the neighboring points for each point in the dataset. |
trials |
A positive integer specifying the number of thinning trials to perform. Default is 10. |
all_trials |
A logical value indicating whether to return results of all attempts ('TRUE') or only the best attempt with the most points retained ('FALSE'). Default is 'FALSE'. |
A list of logical vectors indicating which points are kept in each trial if all_trials is TRUE; otherwise, a list with a single logical vector indicating the points kept in the best trial.
# Example usage within a larger thinning function neighbor_indices <- list(c(2, 3), c(1, 3), c(1, 2)) trials <- 5 all_trials <- FALSE kept_points <- max_thinning_algorithm(neighbor_indices, trials, all_trials) print(kept_points)
# Example usage within a larger thinning function neighbor_indices <- list(c(2, 3), c(1, 3), c(1, 2)) trials <- 5 all_trials <- FALSE kept_points <- max_thinning_algorithm(neighbor_indices, trials, all_trials) print(kept_points)
This function performs thinning of spatial points by rounding their coordinates to a specified precision and removing duplicates. It can perform multiple trials of this process and return the results for all or just the best trial.
precision_thinning( coordinates, precision = 4, trials = 10, all_trials = FALSE, priority = NULL )
precision_thinning( coordinates, precision = 4, trials = 10, all_trials = FALSE, priority = NULL )
coordinates |
A numeric matrix or data frame with two columns representing the longitude and latitude of points. |
precision |
A positive integer specifying the number of decimal places to which coordinates should be rounded. Default is 4. |
trials |
A positive integer specifying the number of thinning trials to perform. Default is 10. |
all_trials |
A logical value indicating whether to return results for all trials ('TRUE') or just the first/best trial ('FALSE'). Default is 'FALSE'. |
priority |
A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning. |
The function performs multiple trials to account for randomness in the order of point selection. By default, it returns the first trial, but setting 'all_trials = TRUE' will return the results of all trials.
If 'all_trials' is 'FALSE', returns a logical vector indicating which points were kept in the first trial. If 'all_trials' is 'TRUE', returns a list of logical vectors, one for each trial.
# Example usage coords <- matrix(c(-123.3656, 48.4284, -123.3657, 48.4285, -123.3658, 48.4286), ncol = 2) result <- precision_thinning(coords, precision = 3, trials = 5, all_trials = TRUE) print(result) # Example with a single trial and lower precision result_single <- precision_thinning(coords, precision = 2, trials = 1, all_trials = FALSE) print(result_single)
# Example usage coords <- matrix(c(-123.3656, 48.4284, -123.3657, 48.4285, -123.3658, 48.4286), ncol = 2) result <- precision_thinning(coords, precision = 3, trials = 5, all_trials = TRUE) print(result) # Example with a single trial and lower precision result_single <- precision_thinning(coords, precision = 2, trials = 1, all_trials = FALSE) print(result_single)
This function selects a specified number of points from a spatial dataset while maximizing the distance between selected points.
select_target_points( distance_matrix, target_points, thin_dist, trials, all_trials = FALSE )
select_target_points( distance_matrix, target_points, thin_dist, trials, all_trials = FALSE )
distance_matrix |
A matrix of pairwise distances between points. |
target_points |
An integer specifying the number of points to retain. |
thin_dist |
A positive numeric value representing the thinning distance in kilometers. |
trials |
A positive integer specifying the number of thinning trials to perform. Default is 10. |
all_trials |
A logical value indicating whether to return results of all attempts ('TRUE') or only the best attempt with the most points retained ('FALSE'). Default is 'FALSE'. |
A list of logical vectors indicating which points are kept in each trial if 'all_trials' is 'TRUE'; otherwise, a list with a single logical vector indicating the points kept in the best trial.
# Example distance matrix (3 points) dist_matrix <- matrix(c(0, 2, 5, 2, 0, 3, 5, 3, 0), ncol = 3) # Select 2 points maximizing distance result <- select_target_points(dist_matrix, target_points = 2, thin_dist = 4, trials = 5, all_trials = TRUE)
# Example distance matrix (3 points) dist_matrix <- matrix(c(0, 2, 5, 2, 0, 3, 5, 3, 0), ncol = 3) # Select 2 points maximizing distance result <- select_target_points(dist_matrix, target_points = 2, thin_dist = 4, trials = 5, all_trials = TRUE)
This function performs spatial thinning of geographic points to reduce point density while maintaining spatial representation. Points are thinned based on a specified distance, grid, or precision, and multiple trials can be performed to identify the best thinned dataset.
thin_points( data, lon_col = "lon", lat_col = "lat", group_col = NULL, method = c("distance", "grid", "precision"), trials = 10, all_trials = FALSE, seed = NULL, verbose = FALSE, ... )
thin_points( data, lon_col = "lon", lat_col = "lat", group_col = NULL, method = c("distance", "grid", "precision"), trials = 10, all_trials = FALSE, seed = NULL, verbose = FALSE, ... )
data |
A data frame or tibble containing the points to thin. Must contain longitude and latitude columns. |
lon_col |
Name of the column with longitude coordinates (default: "lon"). |
lat_col |
Name of the column with latitude coordinates (default: "lat"). |
group_col |
Name of the column for grouping points (e.g., species name, year). If NULL, no grouping is applied. |
method |
Thinning method to use 'c("distance", "grid", "precision")'. |
trials |
Number of thinning iterations to perform (default: 10). Must be a positive nummber. |
all_trials |
If TRUE, returns results of all attempts; if FALSE, returns the best attempt with the most points retained (default: FALSE). |
seed |
Optional; an integer seed for reproducibility of results. |
verbose |
If TRUE, prints progress messages (default: FALSE). |
... |
Additional parameters passed to specific thinning methods. See Details. |
The thinning methods available are:
Forces a specific minimum distance between points.
Applies a grid-based thinning method.
Utilizes precision-based thinning.
Distance-based thinning
The specific parameters for distance-based thinning are:
A positive numeric value representing the thinning distance in kilometers.
A character string indicating the neighbor search method 'c("local_kd_tree", "k_estimation", "kd_tree", "brute")'. The defult value is 'local_kd_tree'.
Distance metric to use 'c("haversine", "euclidean")'. Default is Haversine for geographic coordinates.
The radius of the Earth in kilometers. Default is 6371 km.
Optional integer specifying the number of points to retain. If 'NULL' (default), the function tries to maximize the number of points retained.
Number of cores for parallel processing (only for '"local_kd_tree"'). Default is 1.
Grid-based thinning
The specific parameters for grid-based thinning are:
A positive numeric value representing the thinning distance in kilometers.
A numeric value representing the resolution (in degrees) of the raster grid. If provided, this takes priority over 'thin_dist'.
A numeric vector of length 2 (e.g., 'c(0, 0)'), specifying the origin of the raster grid (optional).
An optional 'terra::SpatRaster' object to use for grid thinning. If provided, the raster object will be used instead of creating a new one.
A positive integer specifying the maximum number of points to retain per grid cell (default: 1).
An optional CRS (Coordinate Reference System) to project the coordinates and raster (default WGS84, 'epsg:4326'). This can be an EPSG code, a PROJ.4 string, or a 'terra::crs' object.
A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning.
Precision-based thinning
The specific parameters for precision-based thinning are:
A positive integer specifying the number of decimal places to which coordinates should be rounded. Default is 4.
A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning.
For more information on specific thinning methods and inputs, refer to their respective documentation:
'distance_thinning()'
'grid_thinning()'
'precision_thinning()'
A list with a data.frame/matrix/tibble of thinned points if 'all_trials = TRUE', or a combined result of all attempts if 'all_trials = TRUE'.
# Generate sample data set.seed(123) sample_data <- data.frame( lon = runif(100, -180, 180), lat = runif(100, -90, 90) ) # Perform thinning using distance method thinned_data <- thin_points(sample_data, lon_col = "lon", lat_col = "lat", method = "distance", trials = 5, verbose = TRUE) # Perform thinning with grouping sample_data$species <- sample(c("species_A", "species_B"), 100, replace = TRUE) thinned_grouped_data <- thin_points(sample_data, lon_col = "lon", lat_col = "lat", group_col = "species", method = "distance", trials = 10)
# Generate sample data set.seed(123) sample_data <- data.frame( lon = runif(100, -180, 180), lat = runif(100, -90, 90) ) # Perform thinning using distance method thinned_data <- thin_points(sample_data, lon_col = "lon", lat_col = "lat", method = "distance", trials = 5, verbose = TRUE) # Perform thinning with grouping sample_data$species <- sample(c("species_A", "species_B"), 100, replace = TRUE) thinned_grouped_data <- thin_points(sample_data, lon_col = "lon", lat_col = "lat", group_col = "species", method = "distance", trials = 10)