Package 'GeoThinneR'

Title: Efficient Spatial Thinning of Species Occurrences
Description: Provides efficient geospatial thinning algorithms to reduce the density of coordinate data while maintaining spatial relationships. Implements K-D Tree and brute-force distance-based thinning, as well as grid-based and precision-based thinning methods. For more information on the methods, see Elseberg et al. (2012) <https://hdl.handle.net/10446/86202>.
Authors: Jorge Mestre-Tomás [aut, cre]
Maintainer: Jorge Mestre-Tomás <[email protected]>
License: MIT + file LICENSE
Version: 2.0.0
Built: 2025-03-27 20:08:10 UTC
Source: https://github.com/jmestret/geothinner

Help Index


Assign Geographic Coordinates to Grid Cells

Description

This function assigns a set of geographic coordinates (longitude and latitude) to grid cells based on a specified cell size.

Usage

assign_coords_to_grid(coords, cell_size)

Arguments

coords

A data frame or matrix with two columns: longitude and latitude.

cell_size

Numeric value representing the size of each grid cell, typically in degrees.

Value

A character vector of grid cell identifiers, where each identifier is formatted as "x_y", representing the grid cell coordinates.

Examples

coords <- data.frame(lon = c(-122.4194, 0), lat = c(37.7749, 0))
cell_size <- 1
assign_coords_to_grid(coords, cell_size)

Loggerhead Sea Turtle (Caretta caretta) Occurrences in the Mediterranean Sea

Description

This dataset contains a subset of global occurrences of the Loggerhead Sea Turtle (Caretta caretta), filtered for records in the Mediterranean Sea. The data were sourced from the Global Biodiversity Information Facility (GBIF).

Usage

data("caretta")

Format

A data frame with 6785 rows and 5 columns:

decimalLongitude

Numeric. Longitude coordinates (WGS84).

decimalLatitude

Numeric. Latitude coordinates (WGS84).

year

Integer. The year in which the occurrence was recorded.

species

Character. The scientific name of the species, i.e., Caretta caretta.

coordinateUncertaintyInMeters

Numeric. The uncertainty of the coordinates in meters.

Details

The dataset has been filtered to include only records within the Mediterranean Sea. The occurrence data cover multiple years, which provides information on the temporal distribution of the species in this region.

Source

Global Biodiversity Information Facility (GBIF), https://www.gbif.org/species/8894817


Compute Neighbors Using Brute-Force

Description

Computes neighbors for each point in a set of coordinates using a greedy approach. All pairwise distances are calculated to identify neighbors within a specified distance threshold.

Usage

compute_neighbors_brute(
  coordinates,
  thin_dist,
  distance = c("haversine", "euclidean"),
  R = 6371
)

Arguments

coordinates

A matrix of coordinates to thin, with two columns representing longitude and latitude.

thin_dist

A positive numeric value representing the thinning distance in kilometers.

distance

A character string specifying the distance metric to use 'c("haversine", "euclidean")'.

R

A numeric value representing the radius of the Earth in kilometers. The default is 6371 km.

Value

A list where each element corresponds to a point and contains the indices of its neighbors.

Examples

set.seed(123)
coords <- matrix(runif(20, min = -180, max = 180), ncol = 2)

# Compute neighbors using brute fore
neighbors <- compute_neighbors_brute(coords, thin_dist = 10,)

Compute Neighbors Using kd-Tree

Description

Computes neighbors for each point in a set of coordinates using a kd-tree for efficient neighbor searches. This method is particularly useful for large datasets.

Usage

compute_neighbors_kdtree(
  coordinates,
  thin_dist,
  k = NULL,
  distance = c("haversine", "euclidean"),
  R = 6371
)

Arguments

coordinates

A matrix of coordinates to thin, with two columns representing longitude and latitude.

thin_dist

A positive numeric value representing the thinning distance in kilometers.

k

An integer specifying the maximum number of neighbors to consider for each point.

distance

A character string specifying the distance metric to use 'c("haversine", "euclidean")'.

R

A numeric value representing the radius of the Earth in kilometers. The default is 6371 km.

Details

This function uses kd-tree (via 'nabor' package) for efficient spatial searches. The kd-tree inherently works with Euclidean distances. If '"haversine"' is selected, the function first converts geographic coordinates to 3D Cartesian coordinates before constructing the kd-tree.

Value

A list where each element corresponds to a point and contains the indices of its neighbors, excluding the point itself.

Examples

set.seed(123)
coords <- matrix(runif(20, min = -180, max = 180), ncol = 2)

# Compute neighbors using kd-tree
neighbors <- compute_neighbors_kdtree(coords, thin_dist = 10,)

Compute Neighbors Using Local kd-Trees

Description

Divides the search area into a grid of local regions and constructs kd-trees for each region to compute neighbors efficiently. Neighbor regions are also considered to ensure a complete search.

Usage

compute_neighbors_local_kdtree(
  coordinates,
  thin_dist,
  distance = c("haversine", "euclidean"),
  R = 6371,
  n_cores = 1
)

Arguments

coordinates

A matrix of coordinates to thin, with two columns representing longitude and latitude.

thin_dist

A positive numeric value representing the thinning distance in kilometers.

distance

A character string specifying the distance metric to use 'c("haversine", "euclidean")'.

R

A numeric value representing the radius of the Earth in kilometers. The default is 6371 km.

n_cores

An integer specifying the number of cores to use for parallel processing. The default is 1.

Value

A list where each element corresponds to a point and contains the indices of its neighbors, excluding the point itself.

Examples

set.seed(123)
coords <- matrix(runif(20, min = -180, max = 180), ncol = 2)

# Compute neighbors using local kd-trees with Euclidean distance
neighbors <- compute_neighbors_local_kdtree(coords, thin_dist = 10, n_cores = 1)

Perform Distance-Based Thinning

Description

This function applies a distance-based thinning algorithm using a kd-tree or brute-force approach. Two modified algorithms based on kd-trees (local kd-trees and estimating the maximum number of neighbors) are implemented which scale better for large datasets. The function removes points that are closer than a specified distance to each other while maximizing spatial representation.

Usage

distance_thinning(
  coordinates,
  thin_dist = 10,
  trials = 10,
  all_trials = FALSE,
  search_type = c("kd_tree", "local_kd_tree", "k_estimation", "brute"),
  target_points = NULL,
  distance = c("haversine", "euclidean"),
  R = 6371,
  n_cores = 1
)

Arguments

coordinates

A matrix of coordinates to thin, with two columns representing longitude and latitude.

thin_dist

A positive numeric value representing the thinning distance in kilometers.

trials

An integer specifying the number of trials to run for thinning. Default is 10.

all_trials

A logical indicating whether to return results of all attempts ('TRUE') or only the best attempt with the most points retained ('FALSE'). Default is 'FALSE'.

search_type

A character string indicating the neighbor search method 'c("local_kd_tree", "k_estimation", "kd_tree", "brute")'. The defult value is 'local_kd_tree'. See details.

target_points

Optional integer specifying the number of points to retain. If 'NULL' (default), the function tries to maximize the number of points retained.

distance

Distance metric to use 'c("haversine", "euclidean")'. Default is Haversine for geographic coordinates.

R

Radius of the Earth in kilometers (default: 6371 km).

n_cores

Number of cores for parallel processing (only for '"local_kd_tree"'). Default is 1.

Details

- '"kd_tree"': Uses a single kd-tree for efficient nearest-neighbor searches. - '"local_kd_tree"': Builds multiple smaller kd-trees for better scalability. - '"k_estimation"': Approximates a maximum number of neighbors per point to reduce search complexity. - '"brute"': Computes all pairwise distances (inefficient for large datasets).

Value

A list. If 'all_trials' is 'FALSE', the list contains a single logical vector indicating which points are kept in the best trial. If 'all_trials' is 'TRUE', the list contains a logical vector for each trial.

Examples

# Generate sample coordinates
set.seed(123)
result  <- matrix(runif(20, min = -180, max = 180), ncol = 2) # 10 random points

# Perform thinning with local kd-trees
result_partitioned <- distance_thinning(result , thin_dist = 5000, trials = 5,
                                       search_type = "local_kd_tree", all_trials = TRUE)
print(result_partitioned)

# Perform thinning estimating max number of neighbors
result_estimated <- distance_thinning(result , thin_dist = 5000, trials = 5,
                                       search_type = "k_estimation", all_trials = TRUE)
print(result_estimated)

Estimate Maximum Neighbors for kd-Tree Thinning

Description

This function estimates the maximum value of k (the number of nearest neighbors) for kd-tree-based thinning by evaluating the densest regions of a spatial dataset. The function uses a histogram-based binning approach for efficiency and low memory usage.

Usage

estimate_k_max(coordinates, thin_dist, distance = c("haversine", "euclidean"))

Arguments

coordinates

A matrix of spatial coordinates with two columns for longitude and latitude.

thin_dist

A positive numeric value representing the thinning distance in kilometers. This defines the resolution of the grid used for density calculations.

distance

Distance metric used 'c("haversine", "euclidean")'.

Details

The function divides the spatial domain into grid cells based on the specified thinning distance. Grid cell sizes are determined assuming approximately 111.32 km per degree (latitude/longitude). The function identifies the densest grid cells and their immediate neighbors to compute the maximum k value.

Value

A numeric value representing the maximum k (number of nearest neighbors) required for the densest regions in the dataset.

Examples

# Generate sample data
set.seed(123)
coordinates <- matrix(runif(200, min = -180, max = 180), ncol = 2)

# Estimate k for kd-tree thinning
k_max <- estimate_k_max(coordinates, thin_dist = 50)
print(k_max)

Perform Grid-Based Thinning of Spatial Points

Description

This function performs thinning of spatial points by assigning them to grid cells based on a specified resolution or thinning distance. It can either create a new raster grid or use an existing 'terra::SpatRaster' object.

Usage

grid_thinning(
  coordinates,
  thin_dist = NULL,
  resolution = NULL,
  origin = NULL,
  raster_obj = NULL,
  n = 1,
  trials = 10,
  all_trials = FALSE,
  crs = "epsg:4326",
  priority = NULL
)

Arguments

coordinates

A numeric matrix or data frame with two columns representing the x (longitude) and y (latitude) coordinates of the points.

thin_dist

A numeric value representing the thinning distance in kilometers. It will be converted to degrees if 'resolution' is not provided.

resolution

A numeric value representing the resolution (in degrees) of the raster grid. If provided, this takes priority over 'thin_dist'.

origin

A numeric vector of length 2 (e.g., 'c(0, 0)'), specifying the origin of the raster grid (optional).

raster_obj

An optional 'terra::SpatRaster' object to use for grid thinning. If provided, the raster object will be used instead of creating a new one.

n

A positive integer specifying the maximum number of points to retain per grid cell (default: 1).

trials

An integer specifying the number of trials to perform for thinning (default: 10).

all_trials

A logical value indicating whether to return results for all trials ('TRUE') or just the first trial ('FALSE', default).

crs

An optional CRS (Coordinate Reference System) to project the coordinates and raster (default WGS84, 'epsg:4326'). This can be an EPSG code, a PROJ.4 string, or a 'terra::crs' object.

priority

A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning.

Value

A list of logical vectors indicating which points to keep for each trial.

Examples

# Example: Grid thinning using thin_dist
coords <- matrix(c(-122.4194, 37.7749,
                        -122.4195, 37.7740,
                        -122.4196, 37.7741), ncol = 2, byrow = TRUE)

result <- grid_thinning(coords, thin_dist = 10, trials = 5, all_trials = TRUE)
print(result)

# Example: Grid thinning using a custom resolution
result_res <- grid_thinning(coords, resolution = 0.01, n = 2, trials = 5)
print(result_res)

# Example: Using a custom raster object
library(terra)
rast_obj <- terra::rast(nrows = 100, ncols = 100, xmin = -123, xmax = -121, ymin = 36, ymax = 38)
result_raster <- grid_thinning(coords, raster_obj = rast_obj, trials = 5)
print(result_raster)

Convert Geographic Coordinates to Cartesian Coordinates

Description

This function converts geographic coordinates, given as longitude and latitude in degrees, to Cartesian coordinates (x, y, z) assuming a spherical Earth model.

Usage

lon_lat_to_cartesian(lon, lat, R = 6371)

Arguments

lon

Numeric vector of longitudes in degrees.

lat

Numeric vector of latitudes in degrees.

R

Radius of the Earth in kilometers (default: 6371 km).

Value

A numeric matrix with three columns (x, y, z) representing Cartesian coordinates.

Examples

lon <- c(-122.4194, 0)
lat <- c(37.7749, 0)
lon_lat_to_cartesian(lon, lat)

Thinning Algorithm for Spatial Data

Description

This function performs the core thinning algorithm used to reduce the density of points in spatial data while maintaining spatial representation. It iteratively removes the points with the most neighbors until no points with neighbors remain. The algorithm supports multiple trials to find the optimal thinning solution.

Usage

max_thinning_algorithm(neighbor_indices, trials, all_trials = FALSE)

Arguments

neighbor_indices

A list of integer vectors where each element contains the indices of the neighboring points for each point in the dataset.

trials

A positive integer specifying the number of thinning trials to perform. Default is 10.

all_trials

A logical value indicating whether to return results of all attempts ('TRUE') or only the best attempt with the most points retained ('FALSE'). Default is 'FALSE'.

Value

A list of logical vectors indicating which points are kept in each trial if all_trials is TRUE; otherwise, a list with a single logical vector indicating the points kept in the best trial.

Examples

# Example usage within a larger thinning function
neighbor_indices <- list(c(2, 3), c(1, 3), c(1, 2))
trials <- 5
all_trials <- FALSE
kept_points <- max_thinning_algorithm(neighbor_indices, trials, all_trials)
print(kept_points)

Precision Thinning of Spatial Points

Description

This function performs thinning of spatial points by rounding their coordinates to a specified precision and removing duplicates. It can perform multiple trials of this process and return the results for all or just the best trial.

Usage

precision_thinning(
  coordinates,
  precision = 4,
  trials = 10,
  all_trials = FALSE,
  priority = NULL
)

Arguments

coordinates

A numeric matrix or data frame with two columns representing the longitude and latitude of points.

precision

A positive integer specifying the number of decimal places to which coordinates should be rounded. Default is 4.

trials

A positive integer specifying the number of thinning trials to perform. Default is 10.

all_trials

A logical value indicating whether to return results for all trials ('TRUE') or just the first/best trial ('FALSE'). Default is 'FALSE'.

priority

A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning.

Details

The function performs multiple trials to account for randomness in the order of point selection. By default, it returns the first trial, but setting 'all_trials = TRUE' will return the results of all trials.

Value

If 'all_trials' is 'FALSE', returns a logical vector indicating which points were kept in the first trial. If 'all_trials' is 'TRUE', returns a list of logical vectors, one for each trial.

Examples

# Example usage
coords <- matrix(c(-123.3656, 48.4284, -123.3657, 48.4285, -123.3658, 48.4286), ncol = 2)
result <- precision_thinning(coords, precision = 3, trials = 5, all_trials = TRUE)
print(result)

# Example with a single trial and lower precision
result_single <- precision_thinning(coords, precision = 2, trials = 1, all_trials = FALSE)
print(result_single)

Select Target Number of Points for Spatial Thinning

Description

This function selects a specified number of points from a spatial dataset while maximizing the distance between selected points.

Usage

select_target_points(
  distance_matrix,
  target_points,
  thin_dist,
  trials,
  all_trials = FALSE
)

Arguments

distance_matrix

A matrix of pairwise distances between points.

target_points

An integer specifying the number of points to retain.

thin_dist

A positive numeric value representing the thinning distance in kilometers.

trials

A positive integer specifying the number of thinning trials to perform. Default is 10.

all_trials

A logical value indicating whether to return results of all attempts ('TRUE') or only the best attempt with the most points retained ('FALSE'). Default is 'FALSE'.

Value

A list of logical vectors indicating which points are kept in each trial if 'all_trials' is 'TRUE'; otherwise, a list with a single logical vector indicating the points kept in the best trial.

Examples

# Example distance matrix (3 points)
dist_matrix <- matrix(c(0, 2, 5,
                        2, 0, 3,
                        5, 3, 0), ncol = 3)

# Select 2 points maximizing distance
result <- select_target_points(dist_matrix, target_points = 2,
                              thin_dist = 4, trials = 5, all_trials = TRUE)

Spatial Thinning of Points

Description

This function performs spatial thinning of geographic points to reduce point density while maintaining spatial representation. Points are thinned based on a specified distance, grid, or precision, and multiple trials can be performed to identify the best thinned dataset.

Usage

thin_points(
  data,
  lon_col = "lon",
  lat_col = "lat",
  group_col = NULL,
  method = c("distance", "grid", "precision"),
  trials = 10,
  all_trials = FALSE,
  seed = NULL,
  verbose = FALSE,
  ...
)

Arguments

data

A data frame or tibble containing the points to thin. Must contain longitude and latitude columns.

lon_col

Name of the column with longitude coordinates (default: "lon").

lat_col

Name of the column with latitude coordinates (default: "lat").

group_col

Name of the column for grouping points (e.g., species name, year). If NULL, no grouping is applied.

method

Thinning method to use 'c("distance", "grid", "precision")'.

trials

Number of thinning iterations to perform (default: 10). Must be a positive nummber.

all_trials

If TRUE, returns results of all attempts; if FALSE, returns the best attempt with the most points retained (default: FALSE).

seed

Optional; an integer seed for reproducibility of results.

verbose

If TRUE, prints progress messages (default: FALSE).

...

Additional parameters passed to specific thinning methods. See Details.

Details

The thinning methods available are:

'distance'

Forces a specific minimum distance between points.

'grid'

Applies a grid-based thinning method.

'precision'

Utilizes precision-based thinning.

Distance-based thinning

The specific parameters for distance-based thinning are:

'thin_dist'

A positive numeric value representing the thinning distance in kilometers.

'search_type'

A character string indicating the neighbor search method 'c("local_kd_tree", "k_estimation", "kd_tree", "brute")'. The defult value is 'local_kd_tree'.

'distance'

Distance metric to use 'c("haversine", "euclidean")'. Default is Haversine for geographic coordinates.

'R'

The radius of the Earth in kilometers. Default is 6371 km.

'target_points'

Optional integer specifying the number of points to retain. If 'NULL' (default), the function tries to maximize the number of points retained.

'n_cores'

Number of cores for parallel processing (only for '"local_kd_tree"'). Default is 1.

Grid-based thinning

The specific parameters for grid-based thinning are:

'thin_dist'

A positive numeric value representing the thinning distance in kilometers.

'resolution'

A numeric value representing the resolution (in degrees) of the raster grid. If provided, this takes priority over 'thin_dist'.

'origin'

A numeric vector of length 2 (e.g., 'c(0, 0)'), specifying the origin of the raster grid (optional).

'raster_obj'

An optional 'terra::SpatRaster' object to use for grid thinning. If provided, the raster object will be used instead of creating a new one.

'n'

A positive integer specifying the maximum number of points to retain per grid cell (default: 1).

'crs'

An optional CRS (Coordinate Reference System) to project the coordinates and raster (default WGS84, 'epsg:4326'). This can be an EPSG code, a PROJ.4 string, or a 'terra::crs' object.

'priority'

A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning.

Precision-based thinning

The specific parameters for precision-based thinning are:

'precision'

A positive integer specifying the number of decimal places to which coordinates should be rounded. Default is 4.

'priority'

A numeric vector of the same length as the number of points with numerical values indicating the priority of each point. Instead of eliminating points randomly, higher values are preferred during thinning.

For more information on specific thinning methods and inputs, refer to their respective documentation:

  • 'distance_thinning()'

  • 'grid_thinning()'

  • 'precision_thinning()'

Value

A list with a data.frame/matrix/tibble of thinned points if 'all_trials = TRUE', or a combined result of all attempts if 'all_trials = TRUE'.

Examples

# Generate sample data
set.seed(123)
sample_data <- data.frame(
  lon = runif(100, -180, 180),
  lat = runif(100, -90, 90)
)

# Perform thinning using distance method
thinned_data <- thin_points(sample_data,
                             lon_col = "lon",
                             lat_col = "lat",
                             method = "distance",
                             trials = 5,
                             verbose = TRUE)

# Perform thinning with grouping
sample_data$species <- sample(c("species_A", "species_B"), 100, replace = TRUE)
thinned_grouped_data <- thin_points(sample_data,
                                     lon_col = "lon",
                                     lat_col = "lat",
                                     group_col = "species",
                                     method = "distance",
                                     trials = 10)