Using the GeoSpock DB Presto connector

The GeoSpock DB Presto connector enables you to run SQL queries on your ingested data, and has been optimized to run a subset of SQL functions. You can use other SQL functions in your queries but they may run more slowly.

The following guidelines will enable you to get the best performance from the GeoSpock DB Presto connector when constructing your queries. These guidelines cover:

Joining tables

To get the best performance from the GeoSpock database Presto connector from joining two tables, the table containing the larger amount of data (usually the event data) must be on the left of the JOIN and the table containing the smaller amount of data (usually the point of interest (POI) data) on the right. For example:

SELECT *
FROM geospock.default.events AS large
JOIN geospock.default.poi AS small
ON large.exampleid = small.exampleid;

There is currently a limit on how much data the right table of a spatial join can contain. If your join right table contains around 30 million rows or more you can get ExceededMemoryLimitException errors, to avoid those errors you can disable spatial joins and use cross joins. It will be slower overall, but it will not have the same memory restriction.

An example of a spatial join query, the error message and how to disable spatial joins:

SELECT *
FROM geospock.default.events AS large
JOIN geospock.default.poi AS big_poi_table
ON ST_DISTANCE(big_poi_table.longitude, big_poi_table.latitude, large.longitude, large.latitude) < 1;

This can fail with the following message:

Query exceeded per-node user memory limit of 6.08GB [Allocated: 6.08GB, Delta: 100.00MB, Top Consumers: {SpatialIndexBuilderOperator=6.08GB, ScanFilterAndProjectOperator=1.51kB}]

Then we can disable the spatial join and revert to cross joins using the following session variable:

SET SESSION spatial_join=false;

Remember to re-enable spatial joins after you run the affected query to return to the default behavior.

SET SESSION spatial_join=true ;

Rewriting queries with more than one join

The GeoSpock database Presto connector currently only optimizes joins between two tables. More than one join in a query will lead to queries taking longer to run. There are a couple of methods available to try to avoid multiple joins.

If we have an example query with two joins present:

WITH sel_poi AS (
  SELECT poi.*
    FROM geospock.default.cafes AS poi
    JOIN geospock.default.cityneighborhoods AS city
      ON ST_Within(ST_Point(poi.longitude, poi.latitude), ST_GeometryFromText(city.geometry))
     AND poi.cafe_operator = 'Starbucks'
),
sel_events AS (
  SELECT poi.latitude, poi.longitude
    FROM geospock.default.event AS event
    JOIN sel_poi AS poi
      ON ST_Distance(ST_Point(event.longitude, event.latitude), ST_Point(poi.longitude, poi.latitude)) < 0.00025
     AND event.timestamp BETWEEN TIMESTAMP '2017-05-01 09:00:00' AND TIMESTAMP '2017-05-01 09:59:59'
)
  SELECT poi.latitude, poi.longitude, COUNT(*) as counts
    FROM sel_events AS events
GROUP BY poi.latitude, poi.longitude

we could do either of the following:

  • Split this query into two, with the results of the first query saved to an intermediate Hive table:
CREATE TABLE hive.pois.cafes AS (
    SELECT poi.*
    FROM geospock.default.cafes AS poi
    JOIN geospock.default.cityneighborhoods AS city
      ON ST_Within(ST_Point(poi.longitude, poi.latitude), ST_GeometryFromText(city.geometry))
     AND poi.cafe_operator = 'Starbucks'
)

then

WITH sel_events AS (
  SELECT poi.latitude, poi.longitude
    FROM geospock.default.event AS event
    JOIN hive.pois.cafes AS poi
      ON ST_Distance(ST_Point(event.longitude, event.latitude), ST_Point(poi.longitude, poi.latitude)) < 0.00025
     AND event.timestamp BETWEEN TIMESTAMP '2017-05-01 09:00:00' AND TIMESTAMP '2017-05-01 09:59:59'
)
  SELECT poi.latitude, poi.longitude, COUNT(*) as counts
    FROM sel_events AS events
GROUP BY poi.latitude, poi.longitude
  • Get the required set of POIs from the POI table by filtering directly in the query using a literal geometry (e.g. a MULTIPOLYGON) rather than using a set of geometries from another table:
WITH sel_events AS (
  SELECT poi.latitude, poi.longitude
    FROM geospock.default.event AS event
    JOIN geospock.default.cafes AS poi
      ON ST_Distance(ST_Point(event.longitude, event.latitude), ST_Point(poi.longitude, poi.latitude)) < 0.00025
     AND event.timestamp BETWEEN TIMESTAMP '2017-05-01 09:00:00' AND TIMESTAMP '2017-05-01 09:59:59'
   WHERE ST_Within(ST_Point(poi.longitude, poi.latitude), ST_GeometryFromText("MULTIPOLYGON (((...)))"))
)
  SELECT poi.latitude, poi.longitude, COUNT(*) as counts
    FROM sel_events AS events
GROUP BY poi.latitude, poi.longitude

Defining geofences

The size of the geofence can be provided as a radius in either degrees or meters. For the best performance, you should define this value in your query, rather than deriving the geofence radius from an object. To define a geofence using:

  • degrees, your query will look like this:
SELECT *
FROM geospock.default.largetable AS large,
geospock.default.smalltable AS small
WHERE ST_DISTANCE(ST_POINT(small.longitude, small.latitude), ST_POINT(large.longitude, large.latitude)) < 0.0005;
  • meters, your query will look like this:
SELECT * 
FROM geospock.default.events event 
JOIN geospock.default.pois poi 
ON GS_Distance_Within(to_spherical_geography(ST_POINT(poi.longitude, poi.latitude)), to_spherical_geography(ST_POINT(event.longitude, event.latitude)), 10);

Note that this query uses the GeoSpock custom function GS_Distance_Within which has been optimized to run spatial join queries based on the distance (in meters) between two points using spherical geography. Alternatively, you could use the other optimized spatial join function, gs_great_circle_distance_within, instead:

SELECT *
FROM geospock.default.events event
JOIN geospock.default.pois poi
ON gs_great_circle_distance_within(poi.longitude, poi.latitude, event.longitude, event.latitude, 10);

Using a stored radius

If the POI table also includes a defined radius for each POI, you can use this value in the SQL query. However, in this case, you cannot apply an operational expression to this radius, such as multiplying the radius by a factor. If the radius is stored in degrees, for example, the SQL query would look something like this:

SELECT *
FROM geospock.default.largetable AS large,
geospock.default.smalltable AS small
WHERE ST_DISTANCE(ST_POINT(small.longitude, small.latitude), ST_POINT(large.longitude, large.latitude)) < small.radius;

Calculating the distance between two spatial locations

To get the best performance, use the following spatial function with the GeoSpock database:

ST_Distance(ST_POINT(longitude, latitude), ST_POINT(longitude, latitude))

The GeoSpock database can run other spatial functions but these queries may take longer to execute.

Naming columns for non-Geospock POI tables

Tables containing the point of interest data stored in a non-GeoSpock catalog (e.g. Hive) must have columns identifying the location of points named as longitude and latitude respectively.

Specifying catalogs and schemas

GeoSpock database does not support the USE function for specifying catalogs or schemas implicitly in queries. You must use fully-qualified table names when running queries against your ingested data into the GeoSpock database, in the format:

<catalog>.<schema>.<table> 

For example:

SELECT *
FROM geospock.default.largetable AS large
JOIN geospock.default.smalltable AS small
ON large.exampleid = small.exampleid;

Creating tables for results

When running queries involving large amounts of data, you can improve the GeoSpock database's memory usage and execution time of spatial JOIN queries by creating a table using the spatial_partitioning(Geometry) function. For example:

CREATE TABLE hive.default.t_partitioning AS SELECT 
spatial_partitioning(ST_POINT(longitude, latitude)) AS v FROM 
geospock.<schema_name>.<table_name>;

To use this table, set the SESSION property spatial_partitioning_table_name to the name of the table you have created, as follows:

SET SESSION spatial_partitioning_table_name = 'hive.default.t_partitioning';