Organizing your data files

To get the best performance from the ingestor, you should split your data into smaller files, aligned with the structure of your data. Having a large number of small files uses the ingestor's ability to process data in parallel. For example, you could create files containing data for each week of the year, rather than a single file containing all the data for the entire year.

Directory structure

As the ingestor has been optimized to process data files in parallel, it is recommended that you split your data into multiple files. Depending on your data, you could split your data into separate files based on:

  • time period, such as days, weeks, or months; for example:

    /events/2018-W28/<file>

    /events/2018/11/05/<file>

    /events/2018/08/02/23/<file>

  • locale, such as city, country or continent; for example

    /poi/europe/<file>

    /poi/europe/spain/<file>

    /poi/europe/spain/madrid/<file>

  • data source, such as, personal devices and industrial equipment

    /events/roadrunner/2018/11/05/<file>

    /events/wileycoyote/2018/11/05/<file>

    /events/boston/cars/2018/11/05/<file>

    /events/boston/buses/2018/11/05/<file>

Using a directory structure such as this enables you to do:

  • recursive ingests: If you organize your data into a directory structure based on time period, locale or data source, you can ingest the files by using the parent directory in the ingest command. For example using /events/2018/11/05/ ingests a whole day's worth of data, while /events/2018/11/ ingests a whole month
  • incremental ingests: By adding the latest data to a directory for the day or week, you can ingest this data to your Platform, to be analyzed by the data analysis tools

File size

The optimum size for a data file is 50 to 500 MiB (uncompressed), although the ingestor can process files of over 1 GiB. You should aim to distribute your data across a number of files, although you should avoid creating lots of small files (<10 MiB), as the overhead of processing each file is greater than the processing required to ingest the data, reducing the efficiency of the ingestor.

For example, if you are generating multiple GiB of data per day, you should split the data into files of approximately similar size (either in terms of GiB, or rows), such as:

/events/2018/11/05/00001.csv.gz

/events/2018/11/05/00002.csv.gz

/events/2018/11/05/00003.csv.gz

/events/2018/11/05/00004.csv.gz

/events/2018/11/05/00005.csv.gz

/events/2018/11/05/00006.csv.gz

Data schema

Whilst the ingestor does not require a schema to process your source input data, it is recommended that you provide one for each data source, describing what each field represents, including the key for any enumerated values. This information can be used to create a dataset description (the JSON file that describes the data), which is used by the ingestor to create the indexes and user interface labels used by the data analysis tools. Knowing the context and meaning of each field makes it easier to write this file.

For example, it is useful to know if a field labeled accuracy is measured in meters, kilometers, of feet, or miles, or even if it is a percentage. Similarly, it is also helpful to include the context of the field: "the accuracy of the longitude and latitude position".