GBQ Views

Guide for creating views in Google BigQuery when sending data to Arize

Minimize errors and increase performance & efficiency when sending data to Arize via a GBQ view. For more information on FAQs with BigQuery and Arize, please refer to the FAQ section here.

GBQ Views Best Practices

  1. The column used for the change_timestamp field should be partitioned to ensure that Arize doesn't rescan all of the data each time the view is queried.

  2. It is recommended that the change_timestamp field be unique when possible for optimal performance. While this isn't a requirement, it will help increase performance.

  3. Ensure that the logic in the view does not allow for new rows to be added with a timestamp below the watermark of the last query. Otherwise, those rows will not be ingested.

  4. Specify a time range for data in the view.

    • Use a WHERE clause to filter data using date columns. For example, if you want to send the last 3 months of data to Arize for your historical backfill, use WHERE date > current_date() - 90. Only send the amount of historical data necessary for your use case.

  5. Have separate views for historical backfill and for sending data on an ongoing basis when the historical data volume is quite large.

    • In order to increase efficiency for backfilling historical data while sending new data on an ongoing basis, you can set up a separate view for historical data. You can then have two import jobs running simultaneously - one backfilling historical data in the background while another ingests more recent data.

    • To accomplish this, we would specify a different date filter in the view with recent data compared to the historical view. It’s important that the dates should be specified so there is no overlap in data between the two views. This will likely be on a case-by-case basis. If the size of the historical view is not too large, it may not make sense to create a separate view for ingesting data regularly.

    • Alternatively, if the historical backfill contains a very large number of rows (e.g., ~30+ million), you can export that historical data to GCS and use the regular file importer method, which may speed up ingestion for that data compared to the GBQ table integration method (note: you can also try tuning the ingestion job parameters to more efficiently read the historical data - for example, changing the query window size to an interval containing at most a single partition).

    • You can export to a parquet file to preserve the correct data types. If you need to export from a view due to transformations that are needed, you can export from a SQL query using the below:

EXPORT DATA OPTIONS(
    uri='gs://bucket/folder/*', 
    format='PARQUET',
    overwrite=true
) 
AS
SELECT 
    field1, field2 
FROM mydataset.table1 
ORDER BY field1 
  1. Use a materialized view for complex joins.

  2. Use an authorized view for views that access a lot of different underlying tables.

    • If your view accesses a lot of different underlying tables, you can create an authorized view if you don’t want to grant access to all of the underlying tables.

Last updated

Copyright © 2023 Arize AI, Inc