Data source description files

For each dataset you want to ingest, you will need to create a data source description. A data source description is a JSON file that describes the overall structure of the source data, and each of the fields within the data. The GeoSpock database uses field types to generate appropriate indexes and to arrange the ingested data so that it can be retrieved efficiently when SQL queries are run to analyze the data.

The source data description contains the following sections:

  • the source data file format definition, that states the format of the source data file; see Specifying the source data file format
  • the event block or location block, that provides the fields in the source data that describe the event or location; see Specifying the type of dataset
  • (optional) a properties block, that describes any other properties in the source data that you want to ingest into the GeoSpock database; see Defining other fields

Note that source fields should only be referenced once in a data source description.

Specifying the source data file format

The lines at the top of the data source description describe the format of the source files. For example:

{
  "format": "TABULAR",
  "tabularFormatSeparator": "\t",
  "location": [
    {
      "id": "longitude",
      ...

The following table indicates the appropriate values for format and tabularFormatSeparator (if needed) depending on the format of the input data files:

Source data file format Data source description configuration Field object definition
Comma separated values (csv) "format": "TABULAR", "tabularFormatSeparator": ",", See Describing CSV format data
Tab separated values (tsv) "format": "TABULAR", "tabularFormatSeparator": "\t", See Describing TSV format data
JSON Lines (jsonl) "format": "JSON_LINES", See Describing JSON lines data
Parquet (parquet) "format": "PARQUET", See Describing Parquet data

Specifying the type of dataset

The GeoSpock database supports datasets containing:

  • location-based data representing fixed locations, for example "Points of Interest"
  • event-based data representing events that occur at an instant in time, with or without a location. (Event data without a spatial element is sometimes called "Time series" data.)

For each type of dataset, you must provide the details of the fields in the source data that describe either the event or the location.

Location-based data

For location-based data, the data source description must contain a location block, which should define:

  • a latitude field
  • a longitude field
  • (recommended) a source_id field

The following is an example of a location block for a CSV or TSV input file:

...
"location": [
  {
    "id": "longitude",
    "sourceFieldIndex": 0,
    "purpose": "LONGITUDE",
    "sqlType": "DOUBLE"
  },
  {
    "id": "latitude",
    "sourceFieldIndex": 1,
    "purpose": "LATITUDE",
    "sqlType": "DOUBLE"
  },
  {
    "id": "organization_id",
    "sourceFieldIndex": 7,
    "purpose": "SOURCE_ID",
    "sqlType": "VARCHAR"
  }
],
...

A row will be considered valid only if both of the values associated with the LATITUDE and LONGITUDE fields defined in the location block are not NULL.

Event-based data

For event-based data, the data source description must contain an event block, which should define:

  • a timestamp field
  • a latitude field (if the data has a geospatial element)
  • a longitude field (if the data has a geospatial element)
  • (recommended) a source_id field

The following is an example of an event block for CSV or TSV input, where the events record both time and location:

...
"event": [
    {
      "id": "longitude",
      "sourceFieldIndex": 0,
      "purpose": "LONGITUDE",
        "sqlType": "DOUBLE"
    },
    {
       "id": "latitude",
       "sourceFieldIndex": 1,
       "purpose": "LATITUDE",
       "sqlType": "DOUBLE"
    },
    {
       "id": "date",
       "sourceFieldIndex": 6,
       "sqlType": "TIMESTAMP",
       "sourceFieldFormat": "TIMESTAMP_ISO_8601"
    },
    {
       "id": "taxi-id",
       "sourceFieldIndex": 7,
       "purpose": "SOURCE_ID",
       "sqlType": "VARCHAR"
    }
],
...

The following is an example of an event block for CSV or TSV input, recording time series data (with no associated location):

...
"event": [
  {
    "id": "date",
      "sourceFieldIndex": 6,
      "sqlType": "TIMESTAMP",
      "sourceFieldFormat": "TIMESTAMP_ISO_8601"
    }
],
...

Of the fields defined in an event block, only those fields defined with purpose SOURCE_ID can be NULL. Any other field with a NULL value will cause the row to be considered invalid and, in that case, the entire row will be skipped during ingestion.

It is strongly recommended that if you have more than one field for latitude, longitude or timestamp, you should avoid using ids that are identical apart from an underscore and a number for these fields, such as timestamp_1, as this negatively impacts the query optimizations in the GeoSpock database.

Defining other fields

If your source data file contains fields in addition to those that describe the event or location, you can describe these in the properties block:

...
"properties": [
  {
    "id": "catchment_square_meters",
    "sourceFieldIndex": 0,
    "sqlType": "BIGINT"
  },
  {
    "id": "street_number",
    "sourceFieldIndex": 3,
    "sqlType": "INTEGER"
  },
  {
    "id": "category_id",
    "sourceFieldIndex": 11,
    "sqlType": "SMALLINT"
  },
  {
    "id": "num_floors",
    "sourceFieldIndex": 5,
    "sqlType": "TINYINT"
  },
  {
    "id": "zip_code",
    "sourceFieldIndex": 6,
    "sqlType": "VARCHAR"
  },
  ...

Any field in the properties block allows NULL values.

Be aware that the properties block must not include a LATITUDE or LONGITUDE field.

Example data source description

Consider the following sample of data, representing the movements of vehicles around a city.

latitude  longitude timestamp vehicle ID  vehicle type
1.3468452048845734  103.77759138975713  1547281171805 c072816570920127844e4a3a957f976c  Car
1.3468362566023402  103.77753151768414  1547304203290 ca763bcdd1ddba480e432c9b6332333a  Private Hire
1.346850537582713 103.77749907923095  1547300495504 450df73ceb28fa5c8147091101d300c2  Motorcycle
1.3468589195201877  103.77765162984358  1547282393230 3e350f6de15609061d13b744e2347bc Car
1.3468487293400808  103.7775971056981 1547286744324 e6e80b138123926d6d4c5ec24584a13f  Car
1.346878279672841 103.77746281842732  1547281172805 c072816570920127844e4a3a957f976c  Car
1.3468864714920847  103.77747397980588  1547304204290 ca763bcdd1ddba480e432c9b6332333a  Private Hire
1.346893793872903 103.77752868814059  1547300496504 450df73ceb28fa5c8147091101d300c2  Motorcycle
1.3469003733615503  103.77738732840608  1547282394230 3e350f6de15609061d13b744e2347bc Car
1.3468916397979644  103.77742773619538  1547286745324 e6e80b138123926d6d4c5ec24584a13f  Car

This dataset represents event-based data in TSV format. Each vehicle is identified with a unique "vehicle ID". The complete data source description for this data would look like:

{
  "format": "TABULAR",
  "tabularFormatSeparator": "\t",
  "event": [
    {
      "id": "latitude",
      "purpose": "LATITUDE",
      "sqlType": "DOUBLE",
      "sourceFieldIndex": 0
    },
    {
      "id": "longitude",
      "purpose": "LONGITUDE",
      "sqlType": "DOUBLE",
      "sourceFieldIndex": 1
    },
    {
      "id": "timestamp",
      "sqlType": "TIMESTAMP",
      "sourceFieldIndex": 2,
      "sourceFieldFormat": "TIMESTAMP_EPOCH_MILLIS"
    },
    {
      "id": "vehicle_id",
      "purpose": "SOURCE_ID",
      "sqlType": "VARCHAR",
      "sourceFieldIndex": 3
    }
  ],
  "properties": [
    {
      "id": "vehicle_type",
      "sqlType": "VARCHAR",
      "sourceFieldIndex": 4
    }
  ]
}

This example demonstrates a number of different values for sqlType. Please refer to Types of data for more information on supported values for sqlType.

In this case, the time values are represented as milliseconds since the UNIX epoch (January 1st, 1970). For more information on this and other supported time formats, see Timestamp formats.

The example also illustrates the use of purpose to identify fields for Latitude, Longitude and "Source ID". For more information on identifying fields with a specific purpose, see Special fields.