Notebook Instance with Greater than 20GB of Data
Using Arize with a Notebook Instance with more than 20G of data
Last updated
Using Arize with a Notebook Instance with more than 20G of data
Last updated
Copyright © 2023 Arize AI, Inc
This section covers modifications to the SDK API call for sending over 20GB+ of data on a SageMaker Notebook instance
In order to speed the transfer of data the Arize SDK pandas call makes use of 2 features on Notebook instances.
The Arize SDK serializes a pandas dataframe from Python to the file system using a fast serialization library that leverages C++.
The file is uploaded to the server from the file system using methods that maximize throughput.
The choice to serialize to the file system was done after extensive testing showed this was the fastest open method to serialize a pandas dataframe. This method was compared extensively to serializing and uploading from Python directly.
The above diagram shows how the SDK uses the SageMaker instance local file system to store a file prior to sending. The SDK will quickly serialize a file to the local file system and then upload that file to the Arize platform. In the case of files smaller than 20Gig this method is transparent to the user of the SDK.
The /tmp directory used by default by the Arize SDK is limited to 20GB and is not related to the instance size of the file system. In order to support larger files:
Set the path variable of the python SDK pandas to point to the local file system
Insure the instance is setup with enough local file storage to store your data
The above example shows how to set the path variable in the SDK to point to the local file system.
As you might want to check the available storage in "/tmp" or "/home/ec2-user/SageMaker" you can use the code.
The default instance is setup with 5GB in order to set a higher value click the advance section and put in a larger value for file system storage.
The attached volume section determines how much space is available in the ~/SageMaker volume.
The volume size does not change the "/tmp" directory size which is the default used by the SDK. The path variable must still be used to point to the local volume.