Ingesting data

The ingestion process loads your source input data into the GeoSpock database, storing and indexing it so that it can be accessed by the data analysis tools.

Prerequisites

Before you try to ingest data into the GeoSpock database, ensure you have understood The onboarding process, that your source input data is in a supported format (Source input data formats), and that you have created a data source description.

Your source input data should be uploaded to an S3 bucket - refer to Organizing your data files for recommendations on directory structure within the bucket.

In order to ingest data, you will also need dataset administration (MODIFY) permissions, allowing you to create datasets; see Dataset administration permissions

All aspects of the ingest process are controlled through the GeoSpock CLI, so you will need this installed in order to proceed.

The ingest cluster

When you trigger an ingest, a new cluster of machines will automatically be deployed. The cluster is only used for data ingestion and will automatically be destroyed once the ingest is complete.

The ingest cluster consists of a single coordinator machine and a number of worker instances. You can control the size of the cluster (specifically, the number of worker machines) using an argument to the CLI command that initiates the ingest.

Calculating the size of the ingestor cluster

As part of the GeoSpock CLI command that ingests data, you must specify the number of worker instances you want to include in the ingest cluster.

The size of your ingest cluster will generally depend on the amount of input data to be ingested.

When following the guidelines below, observe that:

  • the minimum cluster size is the smallest number of worker machines needed to ensure the cluster has enough resources to complete the ingest;
  • the optimal cluster size is the number of worker machines that will complete the ingest in the least time. You can add machines up to this number to increase the speed of ingest but adding further machines beyond this number is unlikely to reduce ingest times and will unnecessarily add to the cost of the ingest.

Use the following guidelines to select a suitable cluster size, depending on the amount of input data you have to ingest (measured in GB).

Minimum cluster size

The minimum size for the ingestor cluster will be the size of your input data (compressed, measured in GB) / 480 (rounded up)

For example, for 1 TB of data (1000 GB) a minimum of 3 worker machines is recommended.

Optimal cluster size

The optimal size for the ingestor cluster is the size of your input data (compressed or uncompressed, measured in GB) / 28

For example, for 1 TB of data (1000 GB) the ingest will complete fastest using 36 worker machines.

Maximum cluster size

Note that using a cluster size of more than 300 machines is not recommended. If the guidelines above would suggest a cluster larger than 300, you may wish to consider splitting your data and ingesting in smaller batches.

Ingesting data for a new dataset

To create a new dataset using the GeoSpock CLI, use the dataset-create command, as follows:

$ geospock dataset-create --dataset-name <dataset-name> --data-url <source-data-url>
  --data-source-description-file <path-to-data-source-description> --instance-count <count>

where:

Setting
Description Data type
--dataset-name The name of the dataset. This setting accepts lowercase alphanumeric characters only. String
--data-url The location (URL) of the source input data. This can be either a folder rather than a single file. If this URL is invalid or cannot be accessed, the ingest will fail - use the dataset-status command to track progress; refer to Checking the progress of your ingest String
--data-source-description-file The location of the file describing the fields in this source input data. This file can be in a local directory or an AWS S3 bucket. String
--instance-count The number of worker instances to use for this ingest. See Calculating the size of the ingestor cluster. Integer

The dataset-create command creates the specified dataset and triggers the ingestion of its data from the specified source (URL), using the referenced data source description file.

For time-based datasets, it is strongly recommended that the initial data ingest covers at least 24 hours (preferably longer). This first ingest determines the internal structure of the dataset in the database. A representative set of data in the initial ingest will therefore lead to more efficient data querying, when additional data is ingested into the dataset.

For more information about this command, use the GeoSpock CLI's help command.

Data validation

During ingestion, the source input data is processed row-wise, based on the file format of the source input data.

For all datasets:

  • a row will be considered invalid if the value of any of the fields is invalid, determined by the data source description;
  • an invalid row will be excluded from the dataset.

Checking the progress of your ingest

To check the status of the dataset whilst your data is being ingested, use the following GeoSpock CLI command:

$ geospock dataset-status --dataset-name <dataset-name>

This command gets information about the specified dataset, including its title, a summary of its contents and the status of the most recent operation on that dataset. For example:

$ geospock dataset-status --dataset-name nycTaxiData
{
    "id": "nycTaxiData",
    "title": "nycTaxiData",
    "description": "Ingested from \nycTaxiData\u201d on 1/23/2020-09:45:49",
    "createdDate": "2020-01-23T09:45:50Z",]
    ...
    "operationStatus": {
        "id": "opr-ingesttest1-7",
        "label": "Data ingested",
        "type": "INGEST",
        "status": "COMPLETED",
        "lastModifiedDate": "2020-01-22T16:57:14.985Z",
        "createdDate": "2020-01-23T09:45:51Z"
    }
}

For more information about this command, use the GeoSpock CLI's help command.

Granting access to your dataset

Before anyone can use this datasest in a query, you must give GeoSpock database users permission to access your ingested data. Refer to Managing dataset access for more information.

Ingesting data for an existing dataset

If you already have a dataset, you can add new data when it becomes available by running an incremental ingest. This enables you to add new data to an existing set of data without re-ingesting all of the source input data. Say, for example, you have a set of sensors collecting data around a city and you ingest all the data you have collected so far from these sensors into a dataset. When required, you can add the latest sensor data to this dataset, enabling you to see how the trends in your data change over time.

To run an incremental ingest, use the GeoSpock CLI to run the dataset-add-data command:

$ geospock dataset-add-data --dataset-name <dataset-name> --data-url <source-data-url> --instance-count <count>

To make sure that duplicated data rows are not created during an incremental ingest, you must only ingest new data rows in the files or folders. Re-processing data rows that have already been ingested will result in duplicate data rows, as well as increasing the cost of the incremental ingest because each data row has to be processed. To keep your datasets accurate and to reduce your ingestion costs, consider creating a directory in your S3 bucket that contains just the new data rows. For guidance on how to organize your source input data files, see Organizing your data files.

You should also make sure that new data rows have the same field order and data format as the original data.

You should use the same guidelines for choosing the instance-count value as for an initial ingest - see Calculating the size of the ingestor cluster.

Use the dataset-status command to check the status of your incremental ingest.