Ingesting source input data

The ingestion process loads your source input data into the GeoSpock database, storing and indexing it so that it can be accessed by the data analysis tools. You ingest data into the GeoSpock database using a data source description that configures how the source input data is stored and indexed. The ingested data is accessible as a dataset.

Creating data source descriptions

For each dataset, you should create a data source description that describes each of the source field that you want to ingest, to make sure that your source input data is ingested correctly (see Creating a data source description for a dataset). The GeoSpock database uses field types to index and store ingested data so that it can be retrieved and analyzed appropriately. During the ingestion of this data into the GeoSpock database, indexes are created that will be used by the SQL analysis tools to analyze this data.


To ingest source input data into the GeoSpock database you need:

Calculating the size of the ingestor cluster

The GeoSpock database creates an ingestor cluster every time you ingest data into the GeoSpock database. As part of the GeoSpock CLI command that ingests data, you must specify the number of ingestor instances you want to include in this cluster. It is recommended that you have one ingestor instance for every one billion source input data rows.

Once the ingest is complete, the GeoSpock database automatically destroys the ingestor cluster.

Ingesting data for a new dataset

To create a new dataset using the GeoSpock CLI, use the dataset-create command as follows:

$ geospock dataset-create --dataset-name taxiData --data-url "s3://path-to-file/nyc-new-taxi-data-Pickups-Snapped-sample.csv" --data-source-description-file c:/path/to/the/sampleDescription.json --instance-count 5 


Setting Description Data type
--dataset-name The name of the dataset. This setting accepts lowercase alphanumeric characters only. ID
--data-url The location (URL) of the source input data. This can be either a folder rather than a single file. If this URL is invalid or cannot be accessed, the ingest will fail - use the dataset-status command to track progress; refer to Checking the progress of your ingest String
--data-source-description-file The location of the file describing the fields in this source input data. This file can be in a local directory or an AWS S3 bucket. String
--instance-count The number of ingestor instances to use for this ingest. It is recommended that the cluster size should ensure that the total disk and memory size should be at least 2.5 times the size of the (raw, compressed) input data. This has a minimum value of 3 - values lower than this will cause the ingest to fail. Integer

This creates the specified dataset and triggers the ingestion of its data from a specified source (URL), using the specified data source description file.

It is strongly recommended that the data ingested when creating a dataset covers at least 24 hours (preferably longer), as this first ingest determines the internal structure of the dataset in the database. A representative set of data will lead to more efficient data querying as more data is ingested.

For more information about this command, use the GeoSpock CLI's help command.

Checking the progress of your ingest

To check the status of the dataset whilst your data is being ingested, use the following GeoSpock CLI command:

$ geospock dataset-status --dataset-name <dataset name> 

This gets information about the specified dataset, including its title, a summary of its contents and the status of the most recent operation on that dataset, for example.

$ geospock dataset-status --dataset-name nycTaxiData 
    "id": "nycTaxiDat",
    "title": "nycTaxiDat",
    "description": "Ingested from \nycTaxiDat\u201d on 1/23/2020-09:45:49",
    "createdDate": "2020-01-23T09:45:50Z",]
    "operationStatus": {
        "id": "opr-ingesttest1-7",
        "label": "Data ingested",
        "type": "INGEST",
        "status": "COMPLETED",
        "lastModifiedDate": "2020-01-22T16:57:14.985Z",
        "createdDate": "2020-01-23T09:45:51Z"

For more information about this command, use the GeoSpock CLI's help command.

Before you can use this datasest in your queries, you must give GeoSpock database users permission to access your ingested data; see Adding permissions to your ingested data.

Ingesting data for an existing dataset

If you already have a dataset, you can add new data when it becomes available by running an incremental ingest. This enables you to add new data to an existing set of data without re-ingesting all of the source input data. Say, for example, you have a set of sensors collecting data around a city and you ingest all the data you have collected so far from these sensors into a dataset. When required, you can add the latest sensor data to this dataset, enabling you to see how the trends in your data change over time.

To run an incremental ingest, use the GeoSpock CLI to run the dataset-add-data command:

$ geospock dataset-add-data --dataset-name bankData --data-url "s3://path-to-file/osm-us-banks-no-other-with-radius-metres.csv" --instance-count 5

To make sure that duplicated data rows are not created during an incremental ingest, you must only ingest new data rows in the files or folders. Re-processing data rows that have already been ingested results in duplicate data rows, as well as increasing the cost of the incremental ingest because each data row has to be processed. To keep your datasets accurate and to reduce your ingestion costs, consider creating a directory in your S3 bucket that contains just the new data rows. For guidance on how to organize your source input data files, see Organizing your data files.

You should also make sure that new data rows have the same field order and data format as the original data.

As with creating a new dataset, the minimum instance-count value is 3, and values lower than this will cause the ingest to fail.

Use the dataset-status command to check the status of your incremental ingest.