Source input data formats

File formats

The GeoSpock ingestor supports the following file formats:

  • TSV format
  • CSV format
  • JSON Lines format containing flat objects
  • Parquet format

For all file formats, field values must be string, numeric or boolean.

If your source input data is in a different format, you will have to process it so that it conforms to one of the supported ingest formats.

TSV format

Data in TSV format should comply with the following:

  • Fields must be separated by a Horizontal Tab (character 9)
  • Content that contains Horizontal Tabs must be quoted with double quotes (")
  • Data must be encoded with UTF-8
  • Lines must be terminated by a Line Feed (character 10)
  • File names must be suffixed by .tsv

Be aware that:

  • The ingestor does not trim spaces from fields. If you do not want spaces in the ingested data, you must remove them from the fields before you ingest the source input data
  • The ingestor will strip double quotes that surround a field
  • The ingestor ignores the header line in a file, and does not use it to determine the content of the fields
  • Field ordering must remain the same between files which are part of the same dataset

Example content

In this example, \t indicates a tab character:

2aadb-99d-97943\t42.32365\t44.538375\t12.5\t1041037198

For information on how to describe TSV format data in a data source description, refer to Describing TSV format data.

CSV format

Data in CSV format should comply with the following:

  • Fields must be separated by a comma
  • Data must be encoded with UTF-8
  • Lines must be terminated by a Line Feed (character 10)
  • File names must be suffixed by .csv

Be aware that:

  • The ingestor does not trim spaces from fields. If you do not want spaces in the ingested data, you must remove them from the fields before you ingest the source input data
  • Content that contains commas must be quoted with double quotes (").
  • If a string contains double quotes, you must add double quotes around the quoted content, for example:
"BA ""Bad Attitude"" Baracas"
  • The ingestor ignores the header line in a file, and does not use it to determine the content of the fields
  • Field ordering must remain the same between files which are part of the same dataset

Example content

"2aadb-99d-97943",42.32365,44.538375,12.5,1041037198

For information on how to describe CSV format data in a data source description, refer to Describing CSV format data.

JSON Lines format

Data in JSON Lines format should comply with the following:

  • A single line must contain JSON encoded data, terminated by a Line Feed (For more information, see http://jsonlines.org/)
  • The root element must be an object
  • Elements of that object must be either numbers, booleans or strings
  • File names should be suffixed by .jsonl
  • The file must contain only flat objects

Be aware that:

  • The object properties may be in any order. The ingestor uses the property name to differentiate the fields, so you must ensure that property names are consistent between files
  • Within the JSON Lines structural elements, your data can include spaces:
    • between the braces
    • before or after quotes surrounding content
    • around the colon or comma

Example content

{"uuid": "2aadb-99d-97943", "lat": 42.32365, "lon": 44.538375, "calories": 12.5, "timestamp": 1041037198}

For information on how to describe JSON Lines format data in a data source description, refer to Describing JSON Lines data.

Parquet format

Apache Parquet is an open source file format for Hadoop, that can store nested data structures in a flat columnar format. It is available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

For information on how to describe Parquet format data in a data source description, refer to Describing Parquet data.

File compression

Data files may be uncompressed, or compressed with:

  • bzip2 (with the .bz2 suffix)
  • lzo (with the .lzo suffix)
  • gzip (with the .gz suffix)
  • Snappy (with the .snappy suffix)
  • lz4 (with the .lz4 suffix)
  • deflate

For compressed data files, you must add a file extension for each file to enable the ingestor to process the data correctly.

Note that the ingestor does not support split archives, so you should make sure that your data files are small enough to be compressed. For further guidance, see the documentation about file size.