Generating a data source description
Rather than creating a data source description file from scratch, you can use the GeoSpock CLI
data-source-description
command to create one. This command parses the source input data and returns a
data source description for the dataset that fits the sampled data.
Be aware that:
- this command does not support compressed data, so you should extract the data in the sample file before running this command
- there is a file size limit on this sample file of 300MB
$ geospock data-source-description --data-url "s3://path-to-file/nyc-new-taxi-data-Pickups-Snapped-sample.csv"
You should review this data source description to make sure that SQL data types assigned to each field are correct and the source data is going to get ingested in the way you want.
Reviewing the data source description
Before using a data source description created by the CLI, you should check that the generated file is correct.
In particular:
- Check the source file format; see Specifying the source data file format
- Check the dataset type; see Types of dataset
- Check (if appropriate) that a
properties
block has been generated, containing any fields not in the event or location block that you want to ingest; see Defining other fields
You should also verify that each field you want to ingest has been described correctly, including:
- its id: the name for the column in the resulting SQL table. It is strongly recommended that if you have
more than one field for latitude, longitude or timestamp, you should avoid using ids that are
identical apart from an underscore and a number for these fields, such as
timestamp_1
, as this negatively impacts the query optimizations in the Geospock database - source field: the field in the source input data
- (optional) purpose: see Special fields for more information about this setting
- data type: this describes how this field is going to be stored in the GeoSpock database; ensure this is correct before ingesting your source input data as once the data has been ingested, you will be unable to change it. See Types of data for more information.
Saving the data source description
The data source description generated by the GeoSpock CLI will not be persisted so you must save this file either locally or to an S3 bucket for use when you ingest your data.