The Analytics API concepts

Before you start using the Analytics API, you should familiarize yourself with some of its basic concepts, including:

  • Spark datasets: these are fundamental to being able to use the Analytics API APIs but are different to the GeoSpock datasets that are used by the Data Explorer, illumin8
  • standard Spark operations: you are able to use standard Spark operations alongside the Analytics API APIs to enable your custom analytics

Using Spark datasets

To use the Analytics API APIs, you will create a Spark dataset, which contains columns for:

  • latitude
  • longitude
  • time
  • (optionally) additional dataset-specific columns

This Spark dataset is the equivalent to a data layer in a GeoSpock dataset that the Data Explorer uses. Your source input data is ingested into a GeoSpock dataset and during this process, a number of data layers are created. Each data layer contains a subset of your data, such as taxi pick up points, supermarket locations or ad requests. The data layers in your dataset are determined by your dataset description. When you create a Spark dataset, you need to specify the data layer that you want to use, and the Analytics API creates a Spark dataset containing that data. If you want to use the data from more than one data layer, for example to perform a comparison, you should create a separate Spark dataset for each layer.

For more information about how to identify the layers in a GeoSpock dataset and create a Spark dataset, see Creating a Spark dataset object.

The data layers available in your GeoSpock dataset depend on the dataset description for your particular data, and this will have been created in collaboration with your GeoSpock account manager.

Working with standard Spark operations

Whilst the Analytics API APIs extend the capabilities of Spark, you can use standard Spark operations as part of your custom analytics, to complement the functionality provided by the Analytics API. The examples in this documentation include standard Spark operations, such as:

  • caching a Spark dataset

    ...

    val cachedDataset = dataset

    .cache()

    ...

  • filtering datasets for specific property values:

    ...

    17    dataset

    18        .filter(s"lat > $lat")

    ...

  • producing summary statistics for a dataset (average, minimum, maximum):

    ...

    17    dataset

    22        .agg(avg((col("salary")))

    23        .collect.(0)

    24        .getAs[Double](0)

    ...

  • adding new columns with User Defined Functions

    ...

    val distanceFunction = udf(SphericalEarth.distance _)

    val footfallWithDistance = footfallFromToken.withColumn("distance", distanceFunction(col("lat"), col("lon"), col("impressionlat"), col("impressionlon")))

    ...

  • exporting a dataset to file

    ...

    dataset.coalesce(1).write.csv("path/to/file")

    ...

  • exposing a dataset as an SQL table

    ...

    dataset.createOrReplaceTempView("TABLE_NAME")

    ...

For more information about Spark and its operations, refer to the Spark documentation.