When working with machine learning models, especially the large ones, security is a constant concern. Most developers rely on the format and the ecosystem with hopes that everything works as advertised. This is where formats like safetensors become crucial. Recent independent audits have confirmed that safetensors is not only safe but also ready to be the new standard for AI model formats.
Why Format Safety Matters
Imagine downloading a model, integrating it into your system, and unknowingly inviting malicious code. This is a tangible risk with traditional formats like PyTorch’s .pt
or TensorFlow’s .pb
, which can execute arbitrary code during deserialization. safetensors addresses these concerns by separating metadata from tensor values, avoiding arbitrary code execution altogether. Its philosophy is straightforward—what you load is what you get, and nothing more.
Audit Insights
The audit was thorough, focusing on how the safetensors format and parser operate, conducted by the reputable Trail of Bits. Key findings include:
- No code execution pathways: At no point does the loading or parsing process allow external or embedded code execution.
- Proper bounds checks: Every operation within the file is checked to prevent buffer overflows or unauthorized access.
- Clear separation of data and metadata: Fields are distinct, preventing metadata from introducing dangerous elements.
- Stable and consistent behavior: The format performs consistently across languages and platforms.
The results were clean, with only low-priority suggestions for documentation and fuzzing improvements.
How safetensors Works
The simplicity of safetensors is its strength. Here’s a breakdown:
- Header: The file starts with a JSON blob containing metadata like name, shape, and dtype of each tensor.
- Raw Data: Follows immediately with no gaps or executable code.
- Loading: The parser reconstructs tensors using metadata, bypassing executable instructions.
This ensures no surprises or backdoors. Performance-wise, safetensors loads faster than pickle-based formats, especially with large tensors, combining safety with speed.
Transitioning to safetensors
For developers familiar with safetensors, the audit is a relief. For others, it’s a green light to switch. Here’s how:
Step 1: Install the Required Library
For Python, it’s straightforward:
pip install safetensors
This provides access to safe_open
for reading and save_file
for writing.
Step 2: Convert Existing Models
Convert a PyTorch model from .pt
to .safetensors
:
from safetensors.torch import save_file
import torch
model = torch.load("model.pt", map_location="cpu")
save_file(model, "model.safetensors")
Ensure your model’s state dict contains plain tensors, as custom modules or non-tensor objects won’t transfer—intentionally.
Step 3: Load When Needed
Loading is simple:
from safetensors.torch import safe_open
with safe_open("model.safetensors", framework="pt") as f:
for key in f.keys():
tensor = f.get_tensor(key)
This guarantees no unexpected code execution.
Step 4: Update Your Ecosystem
Update your tools, APIs, or training pipelines to prioritize .safetensors
. Major libraries like Hugging Face and Transformers already support it.
Publishing models in this format signals safety and ease of use.
Closing Thoughts
safetensors has proven its worth through rigorous auditing. It offers a reliable, fast, and secure alternative to pickle-based formats, delivering exactly what it promises. Its simplicity and security are its greatest assets, fostering trust and confidence among developers. The audit by Trail of Bits confirms that trust is well-placed, making safetensors a dependable choice for the future of AI model formats.