Creating a data source description for a dataset
A data source description is a file in JSON format that describes the:
- source data file format
- type of dataset (event-based or location-based)
- format of each field that you want to ingest
Building a data source description
Rather than creating a data source description file from scratch, you can use the GeoSpock CLI
data-source-description command to create one. This command parses the source input data and returns a
data source description for the dataset that fits the sampled data.
Be aware that:
- this command does not support compressed data, so you should extract the data in the sample file before running this command
- there is a file size limit on this sample file of 300MB
$ geospock data-source-description --data-url "s3://path-to-file/nyc-new-taxi-data-Pickups-Snapped-sample.csv"
You should review this data source description to make sure that SQL data types assigned to each field are correct and the source data is going to get ingested in the way you want.
Reviewing the data source description
Before using a data source description generated by the CLI, you should check the:
- structure of the data source description; The data source description structure
- source file format; see Specifying the source data file format
- dataset type has been defined correctly; see Types of dataset
- (optional) properties block contains any other fields not in the event or location block that you want to ingest; see Defining other fields
- each field that you want to ingest has been described correctly, including its:
- id: the name for the column in the resulting SQL table. It is strongly recommended that if you have
more than one field for latitude, longitude or timestamp, you should avoid using ids that are
identical apart from an underscore and a number for these fields, such as
timestamp_1, as this negatively impacts the query optimizations in the Geospock database
- source field: the field in the source input data
- (optional) purpose: see Special fields (purpose) for more information about this setting
- data type: this describes how this field is going to be stored in the GeoSpock database; ensure this is correct before ingesting your source input data as once the data has been ingested, you will be unable to change it
- id: the name for the column in the resulting SQL table. It is strongly recommended that if you have more than one field for latitude, longitude or timestamp, you should avoid using ids that are identical apart from an underscore and a number for these fields, such as
See Types of data for more information.
Saving the data source description
The data source description generated by the GeoSpock CLI will not be persisted so you must save this file either locally or to an S3 bucket for use when you ingest your data.