Organizing your data files

To get the best performance from the ingestor, you should split your data into smaller files, aligned with the structure of your data. Having a large number of small files uses the ingestor's ability to process data in parallel. For example, you could create files containing data for each week of the year, rather than a single file containing all the data for the entire year.

Directory structure

As the ingestor has been optimized to process data files in parallel, it is recommended that you split your data into multiple files. Depending on your data, you could split your data into separate files based on:

  • time period, such as days, weeks, or months; for example:

    /events/2018-W28/<file>

    /events/2018/11/05/<file>

    /events/2018/08/02/23/<file>

  • locale, such as city, country or continent; for example

    /poi/europe/<file>

    /poi/europe/spain/<file>

    /poi/europe/spain/madrid/<file>

  • data source, such as, personal devices and industrial equipment

    /events/roadrunner/2018/11/05/<file>

    /events/wileycoyote/2018/11/05/<file>

    /events/boston/cars/2018/11/05/<file>

    /events/boston/buses/2018/11/05/<file>

Using a directory structure such as this enables you to do:

  • recursive ingests: If you organize your data into a directory structure based on time period, locale or data source, you can ingest the files by using the parent directory in the ingest command. For example using /events/2018/11/05/ ingests a whole day's worth of data, while /events/2018/11/ ingests a whole month
  • incremental ingests: By adding the latest data to a directory for the day or week, you can ingest this data to your GeoSpock database, to be analyzed by the data analysis tools

File size

The optimum size for a data file is 50 to 500 MiB (uncompressed), although the ingestor can process files of over 1 GiB. You should aim to distribute your data across a number of files, although you should avoid creating lots of small files (<10 MiB), as the overhead of processing each file is greater than the processing required to ingest the data, reducing the efficiency of the ingestor.

For example, if you are generating multiple GiB of data per day, you should split the data into files of approximately similar size (either in terms of GiB, or rows), such as:

/events/2018/11/05/00001.csv.gz

/events/2018/11/05/00002.csv.gz

/events/2018/11/05/00003.csv.gz

/events/2018/11/05/00004.csv.gz

/events/2018/11/05/00005.csv.gz

/events/2018/11/05/00006.csv.gz

Description of the source data fields

It is recommended that you provide a description of the data for each data source, describing what each field represents, including the key for any enumerated values. During the creation of the data source description, this enables you to select the correct type for each field. This is important because once you have ingested the data, you will be unable to change the type of the field without ingesting all the data again.

For example, it is useful to know if a field labeled accuracy is measured in meters, kilometers, of feet, or miles, or even if it is a percentage. Similarly, it is also helpful to include the context of the field: "the accuracy of the longitude and latitude position".