First Steps with GeoSpock DB Discovery

Once you have deployed GeoSpock DB Discovery, you can start experimenting with the platform. All deployments come with a set of sample datasets that you can use to test the capabilities of the system.

You can connect to GeoSpock DB using the instructions in this guide and then execute SQL queries to explore the data. To illustrate how to use these queries in practice, we provide example use cases for Python Jupyter Notebooks and Tableau.

For further details about GeoSpock DB, you are encouraged to read the wider GeoSpock DB product documentation.

Connecting to Discovery

You can connect to GeoSpock DB Discovery using Presto CLI, Python, DataGrip, Tableau, or any BI tool of your choice that uses a JDBC/ODBC connector that's supported by GeoSpock DB.

In the following sections, you can see some examples of how to connect to GeoSpock DB Discovery and explore the datasets. For further details on how to integrate with different tools, please see Integrating your third party tools.

Connection parameters

When connecting to the GeoSpock DB Discovery deployment, you will need to supply connection details. The details needed by each supported tool are described below.

In all cases, the connection details will depend on whether your deployment is set up for HTTP or HTTPS.

Under HTTP, you will generally need to supply the value of the GeoSpockDBDiscoveryHostname which can be found as a value with the same label on the "Outputs" tab of the CloudFormation Stack, e.g. disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com.

If you have enabled HTTPS / SSL then, instead, you will need to supply the domain or subdomain specified in your own custom DNS CNAME record, e.g. discovery.companyname.com.

Please note that, although GeoSpock DB Discovery does not use any authentication, some connection methods still require a username to be provided. We therefore recommend providing 'username' and no password in these cases.

Deploying a SQL client in the cloud

As the Discovery query cluster is located in the AWS Asia Pacific Singapore Region (ap-southeast-1), it may be worth considering deploying a SageMaker instance or any other third party tool supported by GeoSpock DB in the same region as an alternative to using a local analytics environment. This setup would reduce latency, data transfer costs and increase data transfer speeds between the cluster and the client.

Avoiding long-runnning queries

It would be generally advisable to avoid running queries that perform full table scans on large datasets, such as SELECT COUNT(*) FROM geospock.default.sg_erp2_synthesis;. This query would scan through 1.3 trillion rows. If you accidentally start a query that performs a full table scan on a large dataset such as sg_erp2_synthesis, it might be worth considering terminating such a query.

Presto CLI

The Presto CLI can be downloaded and installed following these instructions.

Connection

To run the Presto CLI, you will need to enter a connection string. The form of this connection string will depend on whether your deployment is set up for HTTP or HTTPS.

HTTP connection

The parameters used are as follows:

Parameter name Value
host <GeoSpockDBDiscoveryHostname>
port 80
username [not required]
password [not required]

where <GeoSpockDBDiscoveryHostname> is the value with the same label in the CloudFormation "Outputs" Tab (e.g. disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com).

Connection string:

presto --server http://<GeoSpockDBDiscoveryHostname>:80

HTTPS connection

parameter name value
host <GeoSpockDeploymentAlias>
port 443
username [not required]
password [not required]

where <GeoSpockDeploymentAlias> is the (sub)domain specified as an alias for the GeoSpock deployment as a CNAME record (e.g. discovery.companyname.com).

Connection string:

presto --server https://<GeoSpockDeploymentAlias>:443

Explore available tables

All available datasets can be found in the catalog geospock and schema default. The following query shows the tables available for querying:

presto> SHOW TABLES FROM geospock.default;

which output is as follows and that it may come in a different oder depending on the name and number of tables available:

               Table               
-----------------------------------
sg_air_temperature
sg_erp1_camera_location
sg_erp2_synthesis  
...

Query 20200902_063348_00239_8qm98, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:17 [13 rows, 513B] [0 rows/s, 30B/s]

Please note that Tableau does not support the SHOW tables FROM geospock.default query used above. Tableau will work with the queries specified in the Tableau section of the guide.

Individual table inspection

The available tables can be inspected using the following methodology.

1.- Use DESCRIBE to view the table column description, for example sg_erp2_synthesis:

presto>  DESCRIBE geospock.default.sg_erp2_synthesis;

for which the output is:

      Column       |   Type    | Extra |     Comment     
-------------------+-----------+-------+-----------------
 longitude         | double    |       | LONGITUDE index
 latitude          | double    |       | LATITUDE index  
 erp_vehicle_id    | varchar   |       | Nullable        
 timestamp         | timestamp |       | TIME index      
 erp_vehicle_types | varchar   |       | Nullable        
(5 rows)

Query 20200916_082735_00490_cqshz, FINISHED, 11 nodes
Splits: 172 total, 172 done (100.00%)
0:22 [5 rows, 442B] [0 rows/s, 20B/s]

2.- To view a small sample of the same table:

presto> SELECT * FROM geospock.default.sg_erp2_synthesis LIMIT 10;

for which the output is:


     longitude      |      latitude      |          erp_vehicle_id          |        timestamp        | erp_vehicle_types
--------------------+--------------------+----------------------------------+-------------------------+-------------------
 103.73445229884074 | 1.3400738315291727 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:21.255 | Car               
 103.73426582083782 |  1.339926387628572 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:22.255 | Car               
 103.73430252560004 | 1.3397610994525762 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:23.255 | Car               
 103.73421822015905 |  1.339612635428762 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:24.255 | Car               
 103.73416292653349 | 1.3394466020267752 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:25.255 | Car               
  103.7341947795847 | 1.3393167331370714 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:26.255 | Car               
 103.73409371197609 | 1.3391274762104368 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:27.255 | Car               
 103.73417451022304 | 1.3389805961732097 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:28.255 | Car               
 103.73417334337866 | 1.3388262946437501 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:29.255 | Car               
 103.73405110217102 |  1.338642939779033 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:30.255 | Car               
(10 rows)

Query 20200916_083040_00498_cqshz, FINISHED, 10 nodes
Splits: 1,017 total, 75 done (7.37%)
0:09 [59.4K rows, 1.31GB] [6.6K rows/s, 149MB/s]

Please note that your results may differ slightly from the output example listed above.

Python

While there are several ways of installing Python, we would recommend starting with the Anaconda distribution, which includes the most popular packages. Please bear in mind that some examples may require installation of additional libraries, such as Pandas and the PrestoDB plugin for Python.

As the the Presto SQL cluster is located in the AWS Asia Pacific Singapore Region (ap-southeast-1), it may be worth considering deploying a SageMaker instance or any other third party tool supported by GeoSpock DB in the same region as an alternative to a your local analytics environment. This setup would reduce latency, data transfer costs and increase data transfer speeds between the cluster and the instance.

The code examples listed in this section can be found in our Git repository under discovery > sg_python.

Connection

The following command functions will help make connecting to and querying with GeoSpock DB easier. The connection code will depend slightly on whether your deployment is set up for HTTP or HTTPS.

HTTP Connection

The connection parameters used are as follows:

parameter name value
host <GeoSpockDBDiscoveryHostname>
port 80
user username
password ''
auth [not required]

where <GeoSpockDBDiscoveryHostname> is the value with the same label in the CloudFormation "Outputs" Tab (e.g. disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com).

While the following set of commands do not return any visible output, they assign the connection object to the variable newconn which will be used later for queries.

import prestodb  
import datetime
import pandas as pd

def makeconn(hstb,uname,pswd):
    return prestodb.dbapi.connect(
    host=hstb,
    port=80,
    user=uname.strip(),
    catalog='geospock',
    schema='default',
    http_scheme='http'
    )


hst      = 'GeoSpockDBDiscoveryHostname' # e.g `disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com`
username = 'username'
password = ''
newconn  = makeconn (hst,username,password)    

HTTPS Connection

The parameters used are as follows:

parameter name value
host <GeoSpockDeploymentAlias>
port 443
user username
password ''
auth prestodb.auth.BasicAuthentication(uname.strip(), pswd.strip())

where <GeoSpockDeploymentAlias> is the (sub)domain specified as an alias for the GeoSpock deployment as a CNAME record (e.g. discovery.companyname.com).

While the following set of commands do not return any visible output, it assigns the connection object to the variable newconn which can be later will be used to make queries.

import prestodb  
import datetime
import pandas as pd

def makeconn(hstb,uname,pswd):
    return prestodb.dbapi.connect(
    host=hstb,
    port=443,
    user=uname.strip(),
    catalog='geospock',
    schema='default',
    http_scheme='https',    
    auth=prestodb.auth.BasicAuthentication(uname.strip(), pswd.strip())
    )


hst      = 'GeoSpockDeploymentAlias' #e.g. discovery.companyname.com
username = 'username'
password = ''
newconn  = makeconn (hst,username,password)    

Explore available tables

The following function will enable you to run queries:


def runquery(cnn,cls,querytext):
    sttime = datetime.datetime.now()
    print("Starting query at: "+str(sttime)+".....")
    cur = cnn.cursor()
    cur.execute(querytext)
    rows = cur.fetchall()
    cur.close()
    endtime = datetime.datetime.now()
    print("Finished query at: "+str(endtime)+".")
    tdelt=(endtime-sttime).seconds
    print("Query duration: "+str(tdelt)+" seconds.")
    print("Rows returned: "+str(len(rows)))
    return pd.DataFrame(rows,columns=cls)

This query returns the available tables for querying, which may be returned in any order:

QUERY_tables = """
SHOW TABLES
FROM geospock.default
"""
cls     = ['tables']
tables = runquery(newconn,cls,QUERY_tables)
tables

This outputs:

tables
0 sg_air_temperature
1 sg_erp1_camera_location
2 sg_erp2_synthesis
... ...

Individual table description

The available tables can be inspected using the following methodology.

1.- Use DESCRIBE to view the table column description, for example sg_erp2_synthesis:

QUERY_describe = """
DESCRIBE geospock.default.sg_erp2_synthesis
"""
cls           = ['column', 'type', 'extra', 'comment']
describe_erp2 = runquery(newconn,cls,QUERY_describe)
describe_erp2

This returns:

column type extra comment
0 longitude double LONGITUDE index
1 latitude double LATITUDE index
2 erp_vehicle_id varchar Nullable
3 timestamp timestamp TIME index
4 erp_vehicle_type varchar Nullable

2.- To view a small sample of the same table:

QUERY_tables = """
SELECT *
FROM geospock.default.sg_erp2_synthesis
LIMIT 10
"""
cls         = ['longitude', 'latitiude', 'vehicle_id', 'timestamp', 'vehicle_type']
sample_erp2 = runquery(newconn,cls,QUERY_tables)
sample_erp2

This returns:

longitude latitude vehicle_id timestamp vehicle_type
0 103.864651 1.389269 100020beff0d3c8938eca28f9baa7ea 2019-01-06 21:02:47.358 car
1 103.864598 1.389288 100020beff0d3c8938eca28f9baa7ea 2019-01-06 21:02:48.358 car
... ... ... ... ... ...

Your results may differ slightly from the output listed above.

DataGrip

You can connect to your GeoSpock DB Discovery deployment using DataGrip, by downloading the Presto JDBC driver and following these instructions on how to add a JDBC driver and make a connection.

Connection

The instructions below will give you the necessary parameters to input as connection parameters when you specify the connection string, user and password.

HTTP connection

Parameters used:

parameter name value
host <GeoSpockDBDiscoveryHostname>
port 80
User username
Password none
SSL false

where <GeoSpockDBDiscoveryHostname> is the value with the same label in the CloudFormation "Outputs" Tab (e.g. disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com).

Connection string:

jdbc:presto://<GeoSpockDBDiscoveryHostname>:80?SSL=false

In the Data Source and Drivers section, this should look as follows:

datagrip_http

HTTPS connection

Parameters used:

parameter name value
host <GeoSpockDeploymentAlias>
port 443
User username
Password none
SSL true

where <GeoSpockDeploymentAlias> is the (sub)domain specified as an alias for the GeoSpock deployment as a CNAME record (e.g. discovery.companyname.com).

Connection string:

jdbc:presto://<GeoSpockDeploymentAlias>:443?SSL=true

In the Data Source and Drivers section, this should looks as follows:

datagrip_https

Explore available tables

This query returns the tables available for querying:

SHOW TABLES
FROM geospock.default

This returns:

datagrip_tables

Individual table inspection

The available tables can be inspected as follows:

1.- Use DESCRIBE to view the table column description, for example sg_erp2_synthesis:

DESCRIBE geospock.default.sg_erp2_synthesis;

This returns:

datagrip_describe_erp2

2.- To view a small sample of the same table:

SELECT *
  FROM geospock.default.sg_erp2_synthesis
 LIMIT 10

This returns:

datagrip_sample_erp2

Tableau

To start using Tableau, its trial version can be downloaded from here. To proceed with the setup of Tableau, follow the steps described here to install the Presto connector. The code examples listed in this section are used in the Tableau workbook example that can be found in our Git repository under discovery > sg_tableau.

Connection

After installing Tableau and setting up the Presto connection, select the Presto connection in the Connect menu:

tableau_connection

Then follow these steps depending on whether your deployment is set up for HTTP or HTTPS.

HTTP Connection

The parameters used are as follows:

parameter name value
Server <GeoSpockDBDiscoveryHostname>
Port 80
Authentication Username
Username username
Password [not required]
Require SSL [unchecked]

where <GeoSpockDBDiscoveryHostname> is the value with the same label in the CloudFormation "Outputs" Tab (e.g. disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com).

This can be seen in the following picture:

tableau_http

HTTPS Connection

The parameters used are as follows:

parameter name value
Server <GeoSpockDeploymentAlias>
Port 443
Authentication Username
Username username
Password [not required]
Require SSL [checked]

where <GeoSpockDeploymentAlias> is the (sub)domain specified as an alias for the GeoSpock deployment as a CNAME record (e.g. discovery.companyname.com).

This can be seen in the following picture:

tableau_https

Individual table inspection

From the menu on the left hand side, click Select Schema, enter default then click on the search icon and select the default result:

tableau_schema

The available tables can be inspected using the following methodology suitable for Tableau. Select "New Custom SQL," enter in the SQL commands below, click on "OK" then click on "Update Now."

1.- To view a small sample of the same table:

SELECT *
  FROM geospock.default.sg_erp2_synthesis
 LIMIT 10

This returns:

tableau_sample_erp2

Next Step

Continue your journey through GeoSpock DB Discovery Introduction to Geospatial Queries.