Guidelines for using the Data Analyzer

The Spatial Big Data Platform's Data Analyzer enables you to run SQL queries on your ingested data, and has been optimized to run a subset of SQLfunctions. You can use other SQL functions on your data but these queries may run more slowly.

For the best performance from the Data Analyzer, follow these guidelines when running queries:

  • Table order: the table containing event data must be on the left of the JOIN and the table containing point of interest (POI) data on the right. For example:

    SELECT *
    FROM geospock.events.default AS large
    JOIN geospock.poi.default AS small
    ON large.exampleid = small.exampleid;
  • Defining geofences: the size of the geofence can be provided as a radius in either degrees or meters. For the best performance, you should define this value in your query, rather than deriving the geofence radius from an object as follows:

    Using degrees:

    SELECT *
    FROM geospock.largetable.default AS large,
    geospock.smalltable.default AS small
    WHERE ST_DISTANCE(ST_POINT(small.longitude, small.latitude), ST_POINT(large.longitude, large.latitude)) < 0.0005;

    Using meters:

    SELECT * 
    FROM geospock.events.default event
    JOIN geospock.pois.default poi
    ON ST_Distance(to_spherical_geography(ST_Point(poi.longitude, poi.latitude)), to_spherical_geography(ST_Point(event.longitude, event.latitude))) < 10;
  • Using a stored radius: if the POI table also includes a defined radius for each POI, you can use this value in the SQL query. However, in this case, you cannot apply an operational expression, such as multiplying the radius by a factor, to this radius. If the radius is stored in degrees, for example, the SQL query would look something like this:

    SELECT *
    FROM geospock.largetable.default AS large,
    geospock.smalltable.default AS small
    WHERE ST_DISTANCE(ST_POINT(small.longitude, small.latitude), ST_POINT(large.longitude, large.latitude)) < small.radius;
  • Presto spatial function: to get the best performance, use the following spatial function with Data Analyzer:

    ST_Distance(ST_Point(longitude, latitude), ST_Point(longitude, latitude))

    The Data Analyzer can run other spatial functions but these queries may take longer to execute.

  • POI table column names: the table containing the point of interest data must label the columns containing the location points as longitude and latitude respectively

Specifying catalogs and schemas

The Data Analyzer does not support the USE function for specifying catalogs or schemas implicitly in queries. You must use fully-qualified table names when running queries against your ingested data on the Platform, in the format:

<catalog>.<schema>.<table> 

For example:

SELECT *
FROM geospock.largetable.default AS large
JOIN geospock.smalltable.default AS small
ON large.exampleid = small.exampleid;

Creating tables for results

When running queries involving large amounts of data, you can improve the Data Analyzer's memory usage and execution time of spatial JOIN queries by creating a table using the spatial_partitioning(Geometry) function. For example:

CREATE TABLE hive.default.t_partitioning AS SELECT 
spatial_partitioning(ST_Point(longitude, latitude)) AS v FROM
geospock.<schema_name>.<table_name>;

To use this table, set the SESSION property spatial_partitioning_table_name to the name of the table you have created, as follows:

SET SESSION spatial_partitioning_table_name = 'hive.default.t_partitioning';