Parquet format

Apache Parquet is an open source file format for Hadoop, that can store nested data structures in a flat columnar format. It is available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Describing Parquet data

To ingest your source input data, you need to provide a description of the source data for the ingestor. The ingestor uses this data source description to store the ingested data correctly in the GeoSpock database, enabling you to run your queries and do your data analysis. For more information see Creating a data source description for a dataset.

The following table shows the fields you must provide when describing this format of data in a data source description.

Setting Description
id

The name of the column in the SQL table

Example: event_elevation

The ID specified should contain only numbers, lowercase letters and uppercase letters. The id must not contain spaces or any of the following characters: ,;{}()\\n\\t=_-

sourceFieldName

The name of the field in the Parquet object

Example: "height1"

purpose

(Optional) This setting enables you to identify the following fields:

  • latitude
  • longitude
  • elevation
  • source_id

See Special fields (purpose) for more information.

sqlType

The data type for this field. For more information about the data types supported, see Types of data.

Example: REAL

For example:

{
	"id": "taxi_id",
	"sourceFieldName": "tid",
	"purpose": "SOURCE_ID",
	"sqlType": "VARCHAR"
}

Data validation

For parquet format data, a valid parquet file will have rows consistent with its own internal schema, therefore your source data is unlikely to get rejected because of invalid data.

For a given row, if the source field:

  • is referenced that does not exist, the value associated with that field will be considered invalid.
  • is an empty string, it will be interpreted as an empty string (the validity of this is based on its field specification)
  • has a value null, it will be interpreted as NULL (the validity of this is based on its field specification)