First Steps with GeoSpock DB Discovery

Once you have deployed GeoSpock DB Discovery, you can start experimenting with the platform. New deployments have a set of sample datasets that you can use to test the capabilities of the system.

You can connect to GeoSpock DB using the instructions in this guide and then execute SQL queries to explore the data. To illustrate how to use these queries in practice, we provide example use cases for Python Jupyter Notebooks and Tableau.

Finally, for specific GeoSpock DB details, users are encouraged to visit the GeoSpock DB main product documentation.

Connecting to GeoSpock DB Discovery

You can connect to GeoSpock DB Discovery using Presto CLI, Python, DataGrip, Tableau or any BI tool of your choice that uses a JDBC/ODBC connector that's supported by GeoSpock DB.

In the following sections you can see some examples of how to connect to GeoSpock DB Discovery and explore the datasets. For further details on how to integrate with different tools, please visit the main product documentation.

As the the Presto SQL cluster is located in the AWS Asia Pacific Singapore Region (ap-southeast-1), it may be worth considering deploying a SageMaker instance or any other third party tool supported by GeoSpock DB in the same region as an alternative to a your local analytics environment. This setup would reduce latency, data transfer costs and increase data transfer speeds between the cluster and the instance.

For HTTP deployments, the value of “GeoSpockDBDiscoveryHostname” should be replaced by the "Value," which can be found under the "Outputs" tab of the CloudFormation Stack – e.g., disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com. If you have enabled HTTPS / SSL then this will be your own custom DNS CNAME entry – e.g., discovery.companyname.com. GeoSpock DB Discovery doesn’t use any authentication, however, some connection methods require a Username to be provided and we therefore provide Username and no password in the examples.

Warning: It would be generally advisable to avoid performing queries that perform full table scans on large datasets, such as SELECT COUNT(*) FROM geospock.default.sg_erp2_synthesis; as this would go through 1.3 trillion rows. If you accidentally start a query that performs a full table scan on a large dataset such as sg_erp2_synthesis, it might be worth considering terminating such query.

Presto CLI

The Presto CLI can be downloaded and installed following these instructions.

Connection

The following auxiliary functions will enable you to connect to GeoSpock DB to start running queries. As there are two possible authentication methods based on HTTP and HTTPS, depending if you specified a SSL certificate during the CloudFormation step, the following commands will allow you to connect to GeoSpock DB in these two cases.

1.- HTTP connection

The parameters used are as follows:

parameter name value
host GeoSpockDBDiscoveryHostname
port 80
username ''
password ''

Connection string:

presto --server http://<GeoSpockDBDiscoveryHostname>:80

2.- HTTPS connection

parameter name value
host GeoSpockDBDiscoveryHostname
port 443
username ''
password ''

Connection string:

presto --server https://<GeoSpockDBDiscoveryHostname>:443

Explore available tables

All available datasets can be found in the catalog geospock and schema default. The following query shows the tables available for querying:

presto> SHOW TABLES FROM geospock.default;

which output is as follows and that it may come in a different oder depending on the name and number of tables available:

               Table               
-----------------------------------
sg_air_temperature
sg_erp1_camera_location
sg_erp2_synthesis  
...

Query 20200902_063348_00239_8qm98, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:17 [13 rows, 513B] [0 rows/s, 30B/s]

Please note that Tableau does not support the SHOW tables FROM geospock.default query used above. Tableau will work with the queries specified in the Tableau section of the guide.

Individual table inspection

The available tables can be inspected using the following methodology.

1.- Use DESCRIBE to view the table column description, for example sg_erp2_synthesis:

presto>  DESCRIBE geospock.default.sg_erp2_synthesis;

for which the output is:

      Column       |   Type    | Extra |     Comment     
-------------------+-----------+-------+-----------------
 longitude         | double    |       | LONGITUDE index
 latitude          | double    |       | LATITUDE index  
 erp_vehicle_id    | varchar   |       | Nullable        
 timestamp         | timestamp |       | TIME index      
 erp_vehicle_types | varchar   |       | Nullable        
(5 rows)

Query 20200916_082735_00490_cqshz, FINISHED, 11 nodes
Splits: 172 total, 172 done (100.00%)
0:22 [5 rows, 442B] [0 rows/s, 20B/s]

2.- To view a small sample of the same table:

presto> SELECT * FROM geospock.default.sg_erp2_synthesis LIMIT 10;

for which the output is:


     longitude      |      latitude      |          erp_vehicle_id          |        timestamp        | erp_vehicle_types
--------------------+--------------------+----------------------------------+-------------------------+-------------------
 103.73445229884074 | 1.3400738315291727 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:21.255 | Car               
 103.73426582083782 |  1.339926387628572 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:22.255 | Car               
 103.73430252560004 | 1.3397610994525762 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:23.255 | Car               
 103.73421822015905 |  1.339612635428762 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:24.255 | Car               
 103.73416292653349 | 1.3394466020267752 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:25.255 | Car               
  103.7341947795847 | 1.3393167331370714 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:26.255 | Car               
 103.73409371197609 | 1.3391274762104368 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:27.255 | Car               
 103.73417451022304 | 1.3389805961732097 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:28.255 | Car               
 103.73417334337866 | 1.3388262946437501 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:29.255 | Car               
 103.73405110217102 |  1.338642939779033 | 10001dfc0cc1e0539eb68dd0705e3b14 | 2019-01-02 19:29:30.255 | Car               
(10 rows)

Query 20200916_083040_00498_cqshz, FINISHED, 10 nodes
Splits: 1,017 total, 75 done (7.37%)
0:09 [59.4K rows, 1.31GB] [6.6K rows/s, 149MB/s]

Please note that your results may differ slightly from the output example listed above.

Python

While there are several ways of installing Python, we would recommend starting with the Anaconda distribution, which includes the most popular packages. Please bear in mind that some examples may require installation of additional libraries, such as Pandas and the PrestoDB plugin for Python.

The code examples listed in this section can be found in our Git repository under discovery > sg_python.

The following command functions will help make connecting to and querying with GeoSpock DB easier. There are two possible authentication methods based on HTTP and HTTPS, depending on whether a SSL certificate was specified during the CloudFormation setup. In either case, plead read on for guidance to help you connect to GeoSpock DB.

1.- HTTP Connection

The connection parameters used are as follows:

parameter name value
host GeoSpockDBDiscoveryHostname
port 80
username username
password ''
auth commented

While the following set of commands do not return any visible output, they assign the connection object to the variable newconn which will be used later for queries.

import prestodb  
import datetime
import pandas as pd

def makeconn(hstb,uname,pswd):
    return prestodb.dbapi.connect(
    host=hstb,
    port=80,
    user=uname.strip(),
    catalog='geospock',
    schema='default',
    http_scheme='http'
    )


hst      = 'GeoSpockDBDiscoveryHostname' # e.g `disco-Prest-26Z2LD7S11IPM-2177915702.ap-southeast-1.elb.amazonaws.com`
username = 'username'
password = ''
newconn  = makeconn (hst,username,password)    

2.- HTTPS Connection

The parameters used are as follows:

parameter name value
host GeoSpockDBDiscoveryHostname
port 443
username username
password ''
auth prestodb.auth.BasicAuthentication(uname.strip(), pswd.strip())

While the following set of commands do not return any visible output, it assigns the connection object to the variable newconn which can be later will be used to make queries.

import prestodb  
import datetime
import pandas as pd

def makeconn(hstb,uname,pswd):
    return prestodb.dbapi.connect(
    host=hstb,
    port=443,
    user=uname.strip(),
    catalog='geospock',
    schema='default',
    http_scheme='https',    
    auth=prestodb.auth.BasicAuthentication(uname.strip(), pswd.strip())
    )


hst      = 'GeoSpockDBDiscoveryHostname' #e.g. discovery.companyname.com
username = 'username'
password = ''
newconn  = makeconn (hst,username,password)    

Explore available tables

The following function will enable you to run queries:

import pandas

def runquery(cnn,cls,querytext):
    sttime = datetime.datetime.now()
    print("Starting query at: "+str(sttime)+".....")
    cur = cnn.cursor()
    cur.execute(querytext)
    rows = cur.fetchall()
    cur.close()
    endtime = datetime.datetime.now()
    print("Finished query at: "+str(endtime)+".")
    tdelt=(endtime-sttime).seconds
    print("Query duration: "+str(tdelt)+" seconds.")
    print("Rows returned: "+str(len(rows)))
    return pd.DataFrame(rows,columns=cls)

This query returns the available tables for querying, which may be returned in any order:

QUERY_tables = """
SHOW TABLES
FROM geospock.default
"""
cls     = ['tables']
tables = runquery(newconn,cls,QUERY_tables)
tables

This outputs:

tables
0 sg_air_temperature
1 sg_erp1_camera_location
2 sg_erp2_synthesis
... ...

Individual table description

The available tables can be inspected using the following methodology.

1.- Use DESCRIBE to view the table column description, for example sg_erp2_synthesis:

QUERY_describe = """
DESCRIBE geospock.default.sg_erp2_synthesis
"""
cls           = ['column', 'type', 'extra', 'comment']
describe_erp2 = runquery(newconn,cls,QUERY_describe)
describe_erp2

This returns:

column type extra comment
0 longitude double LONGITUDE index
1 latitude double LATITUDE index
2 erp_vehicle_id varchar Nullable
3 timestamp timestamp TIME index
4 erp_vehicle_type varchar Nullable

2.- To view a small sample of the same table:

QUERY_tables = """
SELECT *
FROM geospock.default.sg_erp2_synthesis
LIMIT 10
"""
cls         = ['longitude', 'latitiude', 'vehicle_id', 'timestamp', 'vehicle_type']
sample_erp2 = runquery(newconn,cls,QUERY_tables)
sample_erp2

This returns:

longitude latitude vehicle_id timestamp vehicle_type
0 103.864651 1.389269 100020beff0d3c8938eca28f9baa7ea 2019-01-06 21:02:47.358 car
1 103.864598 1.389288 100020beff0d3c8938eca28f9baa7ea 2019-01-06 21:02:48.358 car
... ... ... ... ... ...

Your results may differ slightly from the output listed above.

DataGrip

You can connect to your GeoSpock DB Discovery deployment using DataGrip, by downloading the Presto JDBC driver and following these instructions on how to add a JDBC driver and make a connection. These instructions below will give you the necessary parameters to input as connection parameters when you specify the connection string, user and password in the steps required.

1.- HTTP connection

Parameters used:

parameter name value
host GeoSpockDBDiscoveryHostname
port 80
username username
password none
SSL false

Connection string:

jdbc:presto://GeoSpockDBDiscoveryHostname:80?SSL=false

In the Data Source and Drivers section, this should looks as follows:

datagrip_http

2.- HTTPS connection

Parameters used:

parameter name value
host GeoSpockDBDiscoveryHostname
port 443
username username
password none
SSL true

Connection string:

jdbc:presto://GeoSpockDBDiscoveryHostname:443?SSL=true

In the Data Source and Drivers section, this should looks as follows:

datagrip_https

Explore available tables

This query returns the tables available for querying:

SHOW TABLES
FROM geospock.default

This returns:

datagrip_tables

Individual table inspection

The available tables can be inspected as follows:

1.- Use DESCRIBE to view the table column description, for example sg_erp2_synthesis:

DESCRIBE geospock.default.sg_erp2_synthesis;

This returns:

datagrip_describe_erp2

2.- To view a small sample of the same table:

SELECT *
  FROM geospock.default.sg_erp2_synthesis
 LIMIT 10

This returns:

datagrip_sample_erp2

Tableau

To start using Tableau, its trial version can be downloaded from here. To proceed with the setup of Tableau, follow the steps described here to install the Presto connector. The code examples listed in this section are used in the Tableau workbook example that can be found in our Git repository under discovery > sg_tableau.

Connection

After installing Tableau and setting up the Presto connection, select the Presto connection in the Connect menu:

tableau_connection

Then follow these step depending on which authentication method was selected in the CloudFormation setup as there are two possible authentication methods based on HTTP and HTTPS.

1.- HTTP Connection

The parameters used are as follows:

parameter name value
server GeoSpockDBDiscoveryHostname
port 80
authentication method username
username username
password ''
Require SSL unchecked

This can be seen in the following picture:

tableau_http

2.- HTTPS Connection

The parameters used are as follows:

parameter name value
server GeoSpockDBDiscoveryHostname
port 443
authentication method username
username username
password ''
Require SSL checked

This can be seen in the following picture:

tableau_https

Individual table inspection

From the menu on the left hand side, click Select Schema, enter default then click on the search icon and select the default result:

tableau_schema

The available tables can be inspected using the following methodology suitable for Tableau. Select "New Custom SQL," enter in the SQL commands below, click on "OK" then click on "Update Now."

1.- To view a small sample of the same table:

SELECT *
  FROM geospock.default.sg_erp2_synthesis
 LIMIT 10

This returns:

tableau_sample_erp2