Discover Singapore using Jupyter Notebooks

This page will run you through some queries that can be used to extract insights from the Smart Singapore dataset using Python Notebooks. The queries reflect some of the data access patterns described in the Introduction to Geospatial Queries. If you have never used Python Jupyter Notebooks, these resources might help you to get started:

Note that the pieces of code described here are extracts from a complete notebook which can be found in the examples GitHub repository.

In order to run the queries yourself, please download the whole notebook. You may also need to install some additional libraries, such as Kepler, Pandas, and the PrestoDB plugin for Python.

For these example queries we recommend using a query cluster with 10 worker nodes.

Case Study: Accident investigation in Singapore

Please note that the ERP data referenced here is synthetic, and the storyline described around it is purely fictional.

Extracting a needle from a haystack

A traffic accident occurred on 30th August, 2019, which is of interest to various authorities.

An investigator has asked for our help in examining the accidents data from that date, with a view to identifying the accident, and the behavior of vehicles associated with it.

Q1. Accidents on 30-08-2019

We first get a list of all the accidents on a particular day, from 764,839 yearly incidents. This query takes advantage of timestamp indexing:

QUERY_accidents_on_date="""
SELECT event.*
  FROM geospock.default.sg_traffic_incidents AS event
 WHERE event.timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00'
   AND event.type IN ('Accident')
"""

After taking out duplicates, we obtain the number of road accidents that occurred on the day in question.

len(accidentsondate)
46

We see that there are 46 accidents on the day of interest.

The investigator informs us that the accident occurring at 17:51 near the airport is the one of particular interest. The speed camera locations are potential sources of key information so we also query this dataset to examine alongside the accidents data.

Q2. Speed camera locations

We look for the speed cameras in the vicinity of the accident:

QUERY_speedcameralocations="""
SELECT *
  FROM geospock.default.singaporespeedlasercamera
"""

Combining the results of queries Q1 and Q2 we can visualize the cameras in yellow and the accidents in red, with the accident-of-interest in blue.

q2_cameralocations

This allows us to notice that there were no speed cameras near the accident. However, we can query the dataset to discover the vehicles which were within 2.5m of the accident location during the 30s preceding the accident. This query is performed on a table containing 1.3 trillion rows.

Q3. Query vehicles near accident

The following query will return the vehicles that were within 2.5m of the accident location during the 30s preceding the accident. This allows us to find out what vehicles were in the vicinity of the accident. This query takes advantage of the geotemporal indexing by accessing a historical localized snapshot:

QUERY_vehiclesnearaccident = """
SELECT DISTINCT(vehicles.erp_vehicle_id) as vehicleid
  FROM geospock.default.sg_erp2_synthesis AS vehicles
 WHERE vehicles.longitude BETWEEN 103.950711575 AND 103.955711575
   AND vehicles.latitude BETWEEN 1.36305117308 AND 1.36805117308
   AND (vehicles.timestamp BETWEEN TIMESTAMP '2019-08-30 17:51:00' - INTERVAL '30' SECOND AND TIMESTAMP '2019-08-30 17:51:00' )
"""    

The results of the query above tell us that there were 106 vehicles that were located within 2.5m of the accident during the 30s beforehand, with the following IDs:

VehicleID
0 b6764ecee8a9ffb02e3db517702825ce
1 fef0517dc50a919071ef28858db4c1fe
2 a466f7c585da79e5313f048f54778cb
... ...

Having identified the vehicles of interest, we would like to know where these vehicles were travelling to and from on their journeys...

Q4. Get vehicle trajectories

With the vehicles IDs we can find their trajectories using the following query that takes advantage of unique vehicle or fleet history by device ID set.

QUERY_ERPtrajectory_vehsnearaccident="""
SELECT erppoints.*
  FROM geospock.default.sg_erp2_synthesis AS erppoints
 WHERE erppoints.erp_vehicle_id IN """+str(tuple(vehiclesnearaccident.VehicleID))+"""
   AND erppoints.timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00'
"""

The vehicle's route visualizations looks as follows:

q4_getvehicletrajectories

Furthermore, the investigator has a theory that the accident may have involved a taxi driver.

In order to assist the inquiry, we query where any taxis which were in the vicinity of the accident were traveling to/from, and the route they took.

Q5. Get taxi trajectories

The investigator can now identify which taxis were in the vicinity of the accident using unique vehicle or fleet history by device ID set and/or secondary field.

QUERY_ERPtrajectory_taxisnearaccident="""
SELECT erppoints.*
  FROM geospock.default.sg_erp2_synthesis AS erppoints
 WHERE erppoints.erp_vehicle_id IN """+str(tuple(vehiclesnearaccident.VehicleID))+"""
   AND erppoints.erp_vehicle_types IN ('Taxi')
   AND erppoints.timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00'
"""

That results in:

trajectories_taxis_accident.erp_vehicle_id.unique()
array(['587200174ae46ad63551ccaff3588e11'], dtype=object)

... hence, it can be seen that only one taxi was passing through the accident location at the time of interest, and the investigator can easily see where it was going to and from:

q5_taxitrajectories

The investigator can now make enquiries about this taxi and the driver!

We see that the accident took place at a busy time of day around the area of interest, so it is possible that heavy traffic could have contributed to the accident.

Q6. Individual vehicle analysis. All origins and destinations of one taxi over a month

The investigator now wishes to delve further into the behavior of the particular taxi identified as being associated with the accident. It is thought that the driver is involved in criminal activity!

To assist the investigator, we can query all the origins and destinations of the taxi over the period of interest (one month, in this case).

QUERY_individualvehiclestartends= """
   SELECT
          date_trunc('day',vehicles.timestamp) AS dayref,
          min_by( vehicles.latitude , vehicles.timestamp )  AS startlat,
          min_by( vehicles.longitude , vehicles.timestamp ) AS startlon,
          max_by( vehicles.latitude , vehicles.timestamp ) AS endlat,
          max_by( vehicles.longitude , vehicles.timestamp ) AS endlon,
          min(vehicles.timestamp) as starttime,
          max(vehicles.timestamp) as endtime
     FROM geospock.default.sg_erp2_synthesis AS vehicles
    WHERE vehicles.erp_vehicle_id IN ('587200174ae46ad63551ccaff3588e11')
      AND vehicles.timestamp BETWEEN TIMESTAMP '2019-08-01 00:00:00' AND TIMESTAMP '2019-09-01 00:00:00'
GROUP BY 1
ORDER BY 1
"""

From the graphics visualization it can be seen that the taxi hasn't made too many trips. Perhaps this is due to the fact that this taxi is a masquerade for other more nefarious activities. The investigator now has a few leads to explore thanks to the data analysis conducted. Our job here is done!

q6_origindestinationsoveramonth

Virtual ERP Gantry

Our accident investigator wishes to understand whether heavy traffic may have contributed to the accident. We can assist by querying the counts per minute of distinct vehicle IDs within a defined radius of a gantry located near the accident site.

Q7. Query all vehicles passing through a defined spatial window over a day (grouped by minute)

The following query takes advantage of geotemporal indexing by using the data access pattern known as historical localized snapshot.

QUERY_vehiclecounts_gantry="""
  SELECT date_trunc('minute',erp.timestamp), COUNT(DISTINCT(erp.erp_vehicle_id))
    FROM geospock.default.sg_erp2_synthesis AS erp
   WHERE erp.longitude BETWEEN 103.950711575 AND 103.955711575
     AND erp.latitude BETWEEN 1.36305117308 AND 1.36805117308
     AND timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00'
GROUP BY 1
ORDER BY 1
"""

The vehicle counts of this virtual EPR gantry are shown in the figure below. From there it can be seen that the traffic around that area looks like what we would expect, with peaks at around 6:00-8:00 and then 16:00-19:00 due to rush hour traffic.

q7_vehiclespassingthroughspatialwindow

Q8. Traffic speeds. Speed bands in vicinity of accident

Moreover, to assess the effect that the accident had on traffic, the investigator wishes to understand the traffic speeds in the vicinity of the accident.

We thus query the minimum traffic speeds within a few meters of the accident location over the time period of the accident. This query uses the data access pattern spatial join of events vs. POIs.

QUERY_speedbands_near_accident = """
SELECT speedbandtable.*
  FROM geospock.default.sg_speed_bands AS speedbandtable
 WHERE (ST_Distance(ST_Point(speedbandtable.longitude1, speedbandtable.latitude1), ST_Point(103.95321157469873, 1.3655511730786898)) < 0.00025
   AND speedbandtable.timestamp BETWEEN TIMESTAMP '2019-08-30 15:00:00' AND TIMESTAMP '2019-08-31 00:00:00')
"""

q8_speedbands

We have explored a few example queries you can run using Python and Jupyter Notebooks with GeoSpock DB Discovery. This is by no means an exhaustive list of the types of analysis you can do – please continue your Discovery journey by trying out new queries, using our examples for inspiration. You never know when the investigator will be back with more questions about that taxi…