The onboarding process
Data is ingested into the GeoSpock database, and made available for querying, as a dataset.
A dataset is a set of indexed data, from a single source, such as event data, Point of Interest (POI) data, or sensor data. The GeoSpock database supports datasets consisting of:
- location-based data
- events-based data
A database index improves the speed and efficiency of data retrieval, reducing the time it takes to provide a response to your query in the GeoSpock database. By default, the GeoSpock database's ingestion engine creates indexes for the following types of data:
- location (latitude and longitude)
- source ID (an identifier for a device or other source of the data)
The GeoSpock database creates its indexes whilst ingesting data. The indexes that the GeoSpock database creates depends upon the type of data which is being ingested and is determined by the data source description file used during the ingest.
To create a dataset with your data, you will need:
- to prepare your source input data
- to create a data source description for the input data
- to ingest the data using the GeoSpock command line interface.
The whole process for onboarding your data into the GeoSpock database is supported through the GeoSpock CLI. Using the CLI, you can:
- create a data source description for your dataset
- create a new dataset and ingest data into it
- add data to an existing dataset, sometimes referred to as an incremental ingest.
Preparing your data for ingest
Data is ingested into the GeoSpock database from an S3 bucket. It is recommended that you use a bucket in the same AWS account and region as your deployment; if one does not already exist, contact your database administrator to create one for your deployment.
Before you upload your source data to this bucket, prepare your data by making sure that:
- the directory structure and the size of the input files enable the ingestor to process the data efficiently; see Organizing your data files
- the input files are in a supported data format; see Source input data formats
Creating a data source description for a dataset
A data source description is a JSON file that defines how the source data is structured and configures how the data is stored and indexed when it is loaded into the database.
The file describes:
- the source data file format
- the type of dataset (event-based or location-based)
- the purpose and format of each field that you want to ingest.
More information on the data source description is provided in the following pages:
You can create a data source description from scratch or generate one from a sample file; see Generating a data source description.
Once you have prepared your data and have a data source description, you can ingest your data using the GeoSpock CLI. See Ingesting data for guidance on how to do this.
When new data becomes available, you can add this to an existing dataset using an incremental ingest.