Organizing your data files

To get the best performance from the ingestor, it is important to organize your source input data files appropriately.

When arranging your data, you should consider

  • the size of individual data files
  • the directory structure

File size

The ingestor is designed to process data in parallel. For optimal ingest performance, you should aim to distribute your data across a number of files.

The optimum size for a data file is 50 to 500 MiB (uncompressed), although the ingestor can process files of over 1 GiB. It is recommended that you avoid creating lots of small files (<10 MiB), as the overhead of processing each file is greater than the processing required to ingest the data, reducing the efficiency of the ingestor.

For example, if you are generating multiple GiB of data per day, you should split the data into files of approximately similar size (either in terms of GiB, or rows), such as:

/events/2018/11/05/00001.csv.gz
/events/2018/11/05/00002.csv.gz
/events/2018/11/05/00003.csv.gz
/events/2018/11/05/00004.csv.gz
/events/2018/11/05/00005.csv.gz
/events/2018/11/05/00006.csv.gz

When splitting your data into smaller files, also consider creating new files to follow the distribution of your data. For example, if your data relates to events over the course of a year, it may be natural that each of the files contains data for one week of the year.

Directory structure

As well as splitting up your data into smaller files, it is recommended that you organize those files in a logical directory structure, reflecting the distribution of the underlying data.

Depending on your data, you could split your data into separate files based on:

  • time period, such as days, weeks, or months; for example:
/events/2018-W28/<file>
/events/2018/11/05/<file>
/events/2018/08/02/23/<file>
  • locale, such as city, country or continent; for example
/poi/europe/<file>
/poi/europe/spain/<file>
/poi/europe/spain/madrid/<file>
  • data source, such as, personal devices and industrial equipment
/events/roadrunner/2018/11/05/<file>
/events/wileycoyote/2018/11/05/<file>
/events/boston/cars/2018/11/05/<file>
/events/boston/buses/2018/11/05/<file>

Using a directory structure such as this enables you to do:

  • hierarchical ingests: If you organize your data into a directory structure based on time period, locale or data source, you can ingest the data from any level in the hierarchy by specifying the corresponding directory in the ingest command. For example using /events/2018/11/05/ ingests the data for a single day, while /events/2018/11/ ingests a whole month.
  • incremental ingests: By adding new data, as it arrives, to a new sub-directory within the overall directory structure, you can easily identify and ingest just the new data into an existing dataset.

Recording the source data fields

For each source of data, it is recommended that you keep a record of the kinds of data being represented, describing what each field represents, including a key for any enumerated values. Having such a record will help you write the data source description for the dataset, and will help you keep your data consistent as more data is added to your datasets.

For example, it is useful to know if a field labeled accuracy is measured in meters, kilometers, or feet, or miles, or even if it is a percentage. Similarly, it is also helpful to include the context of the field: "the accuracy of the longitude and latitude position".

It is important to keep this record because, once you have created a dataset, you cannot change data types without ingesting all of the data again.