Discover Singapore using Jupyter Notebooks
This page will run you through some queries that can be used to extract insights from the Smart Singapore dataset using Python Notebooks. The queries reflect some of the data access patterns described in the Introduction to Geospatial Queries. If you have never used Python Jupyter Notebooks, these resources might help you to get started:
Note that the pieces of code described here are extracts from a complete notebook which can be found in the examples GitHub repository.
For these example queries we recommend using a query cluster with 10 worker nodes.
Case Study: Accident investigation in Singapore
Please note that the ERP data referenced here is synthetic, and the storyline described around it is purely fictional.
Extracting a needle from a haystack
A traffic accident occurred on 30th August, 2019, which is of interest to various authorities.
An investigator has asked for our help in examining the accidents data from that date, with a view to identifying the accident, and the behavior of vehicles associated with it.
Q1. Accidents on 30-08-2019
We first get a list of all the accidents on a particular day, from 764,839 yearly incidents. This query takes advantage of timestamp indexing:
QUERY_accidents_on_date=""" SELECT event.* FROM geospock.default.sg_traffic_incidents AS event WHERE event.timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00' AND event.type IN ('Accident') """
After taking out duplicates, we obtain the number of road accidents that occurred on the day in question.
We see that there are 46 accidents on the day of interest.
The investigator informs us that the accident occurring at 17:51 near the airport is the one of particular interest. The speed camera locations are potential sources of key information so we also query this dataset to examine alongside the accidents data.
Q2. Speed camera locations
We look for the speed cameras in the vicinity of the accident:
QUERY_speedcameralocations=""" SELECT * FROM geospock.default.singaporespeedlasercamera """
This allows us to notice that there were no speed cameras near the accident. However, we can query the dataset to discover the vehicles which were within 2.5m of the accident location during the 30s preceding the accident. This query is performed on a table containing 1.3 trillion rows.
Q3. Query vehicles near accident
The following query will return the vehicles that were within 2.5m of the accident location during the 30s preceding the accident. This allows us to find out what vehicles were in the vicinity of the accident. This query takes advantage of the geotemporal indexing by accessing a historical localized snapshot:
QUERY_vehiclesnearaccident = """ SELECT DISTINCT(vehicles.erp_vehicle_id) as vehicleid FROM geospock.default.sg_erp2_synthesis AS vehicles WHERE vehicles.longitude BETWEEN 103.950711575 AND 103.955711575 AND vehicles.latitude BETWEEN 1.36305117308 AND 1.36805117308 AND (vehicles.timestamp BETWEEN TIMESTAMP '2019-08-30 17:51:00' - INTERVAL '30' SECOND AND TIMESTAMP '2019-08-30 17:51:00' ) """
The results of the query above tell us that there were 106 vehicles that were located within 2.5m of the accident during the 30s beforehand, with the following IDs:
Having identified the vehicles of interest, we would like to know where these vehicles were travelling to and from on their journeys...
Q4. Get vehicle trajectories
With the vehicles IDs we can find their trajectories using the following query that takes advantage of unique vehicle or fleet history by device ID set.
QUERY_ERPtrajectory_vehsnearaccident=""" SELECT erppoints.* FROM geospock.default.sg_erp2_synthesis AS erppoints WHERE erppoints.erp_vehicle_id IN """+str(tuple(vehiclesnearaccident.VehicleID))+""" AND erppoints.timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00' """
The vehicle's route visualizations looks as follows:
Furthermore, the investigator has a theory that the accident may have involved a taxi driver.
In order to assist the inquiry, we query where any taxis which were in the vicinity of the accident were traveling to/from, and the route they took.
Q5. Get taxi trajectories
The investigator can now identify which taxis were in the vicinity of the accident using unique vehicle or fleet history by device ID set and/or secondary field.
QUERY_ERPtrajectory_taxisnearaccident=""" SELECT erppoints.* FROM geospock.default.sg_erp2_synthesis AS erppoints WHERE erppoints.erp_vehicle_id IN """+str(tuple(vehiclesnearaccident.VehicleID))+""" AND erppoints.erp_vehicle_types IN ('Taxi') AND erppoints.timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00' """
That results in:
... hence, it can be seen that only one taxi was passing through the accident location at the time of interest, and the investigator can easily see where it was going to and from:
The investigator can now make enquiries about this taxi and the driver!
We see that the accident took place at a busy time of day around the area of interest, so it is possible that heavy traffic could have contributed to the accident.
Q6. Individual vehicle analysis. All origins and destinations of one taxi over a month
The investigator now wishes to delve further into the behavior of the particular taxi identified as being associated with the accident. It is thought that the driver is involved in criminal activity!
To assist the investigator, we can query all the origins and destinations of the taxi over the period of interest (one month, in this case).
QUERY_individualvehiclestartends= """ SELECT date_trunc('day',vehicles.timestamp) AS dayref, min_by( vehicles.latitude , vehicles.timestamp ) AS startlat, min_by( vehicles.longitude , vehicles.timestamp ) AS startlon, max_by( vehicles.latitude , vehicles.timestamp ) AS endlat, max_by( vehicles.longitude , vehicles.timestamp ) AS endlon, min(vehicles.timestamp) as starttime, max(vehicles.timestamp) as endtime FROM geospock.default.sg_erp2_synthesis AS vehicles WHERE vehicles.erp_vehicle_id IN ('587200174ae46ad63551ccaff3588e11') AND vehicles.timestamp BETWEEN TIMESTAMP '2019-08-01 00:00:00' AND TIMESTAMP '2019-09-01 00:00:00' GROUP BY 1 ORDER BY 1 """
From the graphics visualization it can be seen that the taxi hasn't made too many trips. Perhaps this is due to the fact that this taxi is a masquerade for other more nefarious activities. The investigator now has a few leads to explore thanks to the data analysis conducted. Our job here is done!
Virtual ERP Gantry
Our accident investigator wishes to understand whether heavy traffic may have contributed to the accident. We can assist by querying the counts per minute of distinct vehicle IDs within a defined radius of a gantry located near the accident site.
Q7. Query all vehicles passing through a defined spatial window over a day (grouped by minute)
The following query takes advantage of geotemporal indexing by using the data access pattern known as historical localized snapshot.
QUERY_vehiclecounts_gantry=""" SELECT date_trunc('minute',erp.timestamp), COUNT(DISTINCT(erp.erp_vehicle_id)) FROM geospock.default.sg_erp2_synthesis AS erp WHERE erp.longitude BETWEEN 103.950711575 AND 103.955711575 AND erp.latitude BETWEEN 1.36305117308 AND 1.36805117308 AND timestamp BETWEEN TIMESTAMP '2019-08-30 00:00:00' AND TIMESTAMP '2019-08-31 00:00:00' GROUP BY 1 ORDER BY 1 """
The vehicle counts of this virtual EPR gantry are shown in the figure below. From there it can be seen that the traffic around that area looks like what we would expect, with peaks at around 6:00-8:00 and then 16:00-19:00 due to rush hour traffic.
Q8. Traffic speeds. Speed bands in vicinity of accident
Moreover, to assess the effect that the accident had on traffic, the investigator wishes to understand the traffic speeds in the vicinity of the accident.
We thus query the minimum traffic speeds within a few meters of the accident location over the time period of the accident. This query uses the data access pattern spatial join of events vs. POIs.
QUERY_speedbands_near_accident = """ SELECT speedbandtable.* FROM geospock.default.sg_speed_bands AS speedbandtable WHERE (ST_Distance(ST_Point(speedbandtable.longitude1, speedbandtable.latitude1), ST_Point(103.95321157469873, 1.3655511730786898)) < 0.00025 AND speedbandtable.timestamp BETWEEN TIMESTAMP '2019-08-30 15:00:00' AND TIMESTAMP '2019-08-31 00:00:00') """
We have explored a few example queries you can run using Python and Jupyter Notebooks with GeoSpock DB Discovery. This is by no means an exhaustive list of the types of analysis you can do – please continue your Discovery journey by trying out new queries, using our examples for inspiration. You never know when the investigator will be back with more questions about that taxi…