Data source description files

For each dataset you want to ingest, you will need to create a data source description. A data source description is a JSON file that describes the overall structure of the source data, and each of the fields within the data. The GeoSpock database uses field types to generate appropriate indexes and to arrange the ingested data so that it can be retrieved efficiently when SQL queries are run to analyze the data.

The source data description contains the following sections:

  • the source data file format definition, that states the format of the source data file; see Specifying the source data file format
  • the event block or location block, that provides the fields in the source data that describe the event or location; see Types of dataset
  • (optional) a properties block, that describes any other properties in the source data that you want to ingest into the GeoSpock database; see Defining other fields

Note that source fields should only be referenced once in a data source description.

Specifying the source data file format

At the top of the data source description, there is a section that describes the format of the source files. For example:

{
    "format": "TABULAR","tabularFormatSeparator": "\t",
    "location": [
        {
            "id": "longitude",
    ...

The following table indicates the appropriate values for format and tabularFormatSeparator (if needed) depending on the format of the input data files:

Source data file format Data source description configuration Field object definition
Comma separated values (csv) "format": "TABULAR", "tabularFormatSeparator": ",", See Describing CSV format data
Tab separated values (tsv) "format": "TABULAR", "tabularFormatSeparator": "\t", See Describing TSV format data
JSON Lines (jsonl) "format": "JSON_LINES", See Describing JSON lines data
Parquet (parquet) "format": "PARQUET", See Describing Parquet data

Specifying the type of dataset

The GeoSpock database supports datasets containing:

  • location-based data
  • events-based data

For each type of dataset, you must provide the details of the fields in the source data that describe either the event or the location.

Location-based data

The data source description must contain a location block, which should define:

  • a latitude field
  • a longitude field
  • (recommended) a source_id field

The following is an example of a location block for a CSV or TSV input file:

...
"location": [
        {
            "id": "longitude",
            "sourceFieldIndex": 1,
            "purpose": "LONGITUDE",
            "sqlType": "DOUBLE"
        },
        {
            "id": "latitude",
            "sourceFieldIndex": 2,
            "purpose": "LATITUDE",
            "sqlType": "DOUBLE"
        },
        {
            "id": "organization_id",
            "sourceFieldIndex": 8,
            "purpose": "SOURCE_ID",
            "sqlType": "VARCHAR"
        }
    ],
...

A row will be considered valid only if both of the values associated with the LATITUDE and LONGITUDE fields defined in the location block are not NULL.

Event-based data

The data source description must contain an event block, which should define:

  • a timestamp field
  • a latitude field for geospatial data
  • a longitude field for geospatial data
  • (recommended) a source_id field

The following is an example of an event block for CSV or TSV input, where the events record both time and location:

...
"event": [
    {
        "id": "longitude",
        "sourceFieldIndex": 1,
        "purpose": "LONGITUDE",
        "sqlType": "DOUBLE"
    },
    {
        "id": "latitude",
        "sourceFieldIndex": 2,
        "purpose": "LATITUDE",
        "sqlType": "DOUBLE"
    },
    {
        "id": "date",
        "sourceFieldIndex": 7,
        "sqlType": "TIMESTAMP",
        "sourceFieldFormat": "TIMESTAMP_ISO_8601"
    },
    {
        "id": "taxi-id",
        "sourceFieldIndex": 8,
        "purpose": "SOURCE_ID",
        "sqlType": "VARCHAR"
    }
    ],
...

The following is an example of an event block for CSV or TSV input, recording time series data (with no associated location):

...
"event": [
    {
        "id": "date",
        "sourceFieldIndex": 7,
        "sqlType": "TIMESTAMP",
        "sourceFieldFormat": "TIMESTAMP_ISO_8601"
    }
    ],
...

The only field that allows NULL values is SOURCE_ID. Any other field with a NULL value will cause the row to be considered invalid and, in that case, the entire row will be skipped during ingestion.

It is strongly recommended that if you have more than one field for latitude, longitude or timestamp, you should avoid using ids that are identical apart from an underscore and a number for these fields, such as timestamp_1, as this negatively impacts the query optimizations in the GeoSpock database.

Defining other fields

If your source data file contains fields in addition to those that describe the event or location, you can describe these in the properties block:

...
"properties": [
        {
            "id": "catchment_square_meters",
            "sourceFieldIndex": 0,
            "sqlType": "BIGINT"
        },
        {
            "id": "street_number",
            "sourceFieldIndex": 3,
            "sqlType": "INTEGER"
        },
        {
            "id": "category_id",
            "sourceFieldIndex": 11,
            "sqlType": "SMALLINT"
        },
        {
            "id": "num_floors",
            "sourceFieldIndex": 5,
            "sqlType": "TINYINT"
        },
        {
            "id": "zip_code",
            "sourceFieldIndex": 6,
            "sqlType": "VARCHAR"
        },
...

Any field in the properties block allows NULL values.

Be aware that the properties block must not include a LATITUDE or LONGITUDE field.

For more information on the format required to describe fields in your source data, refer to:

For more information on supported values for the purpose property, see Special fields.

For more information on supported values for the sqlType property, see Types of data.