Ingesting source input data

The ingestion process loads your source input data into the GeoSpock database, storing and indexing it so that it can be accessed by the data analysis tools. You ingest data into the GeoSpock database using a data source description that configures how the source input data is stored and indexed. The ingested data is accessible as a dataset.

Creating data source descriptions

For each dataset, you should create a data source description that describes each of the source field that you want to ingest, to make sure that your source input data is ingested correctly (see Creating a data source description for a dataset). The GeoSpock database uses field types to index and store ingested data so that it can be retrieved and analyzed appropriately. During the ingestion of this data into the GeoSpock database, indexes are created that will be used by the SQL analysis tools to analyze this data.

Prerequisites

To ingest source input data into the GeoSpock database you need:

Calculating the size of the ingestor cluster

The GeoSpock database creates an ingestor cluster every time you ingest data into the GeoSpock database. As part of the GeoSpock CLI command that ingests data, you must specify the number of ingestor instances you want to include in this cluster. It is recommended that you have one ingestor instance for every one billion source input data rows.

Once the ingest is complete, the GeoSpock database automatically destroys the ingestor cluster.

Ingesting data for a new dataset

To create a new dataset using the GeoSpock CLI, use the dataset-create command as follows:

$ geospock dataset-create --dataset-id taxiData --data-url "s3://path-to-file/nyc-new-taxi-data-Pickups-Snapped-sample.csv" --data-source-description-file c:/path/to/the/sampleDescription.json --instance-count 5 

...where:

Setting

Description

Data type

--dataset-id The name of the dataset. This setting accepts lowercase alphanumeric characters only. ID
--data-url

The location (URL) of the source input data. This can be either a folder rather than a single file.

String
--data-source-description-file

The location of the file describing the fields in this source input data. This file can be in a local directory or an AWS S3 bucket.

String
--instance-count

The number of ingestor instances to use for this ingest.

As a guide, you should divide the number of source input data rows you want to ingest by 1 billion and that gives you the number of instances you should use. So if you want to ingest 50 billion lines of source input data, you should use 50 ingestor instances in your cluster.

Integer

This creates the specified dataset and triggers the ingestion of its data from a specified source (URL), using the specified data source description file.

For more information about this command, use the GeoSpock CLI's help command.

Checking the progress of your ingest

To check the status of the dataset whilst your data is being ingested, use the following GeoSpock CLI command:

$ geospock dataset-status --dataset-id <dataset ID> 

This gets information about the specified dataset, including its title, a summary of its contents and the status of the most recent operation on that dataset, for example.

$ geospock dataset-status --dataset-id nycTaxiData 
{
    "id": "nycTaxiDat",
    "title": "nycTaxiDat",
    "description": "Ingested from \nycTaxiDat\u201d on 1/23/2020-09:45:49",
    "createdDate": "2020-01-23T09:45:50Z",
    "operationStatus": {
        "id": "opr-ingesttest1-7",
        "label": "Data ingested",
        "type": "INGEST",
        "status": "COMPLETED",
        "lastModifiedDate": "2020-01-22T16:57:14.985Z",
        "createdDate": "2020-01-23T09:45:51Z"
    }
}

For more information about this command, use the GeoSpock CLI's help command.

Before you can use this datasest in your queries, you must give GeoSpock database users permission to access your ingested data; see Adding permissions to your ingested data.

Ingesting data for an existing dataset

If you already have a dataset, you can add new data when it becomes available by running an incremental ingest. This enables you to add new data to an existing set of data without re-ingesting all of the source input data. Say, for example, you have a set of sensors collecting data around a city and you ingest all the data you have collected so far from these sensors into a dataset. When required, you can add the latest sensor data to this dataset, enabling you to see how the trends in your data change over time.

To run an incremental ingest, use the GeoSpock CLI to run the dataset-add-data command:

$ geospock dataset-add-data --dataset-id bankData --data-url "s3://path-to-file/osm-us-banks-no-other-with-radius-metres.csv" --instance-count 5

To make sure that duplicated data rows are not created during an incremental ingest, you must only ingest new data rows in the files or folders. Re-processing data rows that have already been ingested results in duplicate data rows, as well as increasing the cost of the incremental ingest because each data row has to be processed. To keep your datasets accurate and to reduce your ingestion costs, consider creating a directory in your S3 bucket that contains just the new data rows. For guidance on how to organize your source input data files, see Organizing your data files.

You should also make sure that new data rows have the same field order and data format as the original data.

Use the dataset-status command to check the status of your incremental ingest.