Skip to content

Apache Arrow-compatible space-efficient "tape" class in pure Rust to be used with StringZilla for GPU and disk transfers of variable length strings

License

Notifications You must be signed in to change notification settings

ashvardanian/StringTape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StringTape

StringTape banner

A memory-efficient collection for variable-length strings, co-located on a contiguous "tape".

  • Convertible to Apache Arrow String/LargeString & Binary/LargeBinary arrays
  • Compatible with UTF-8 & binary strings in Rust via CharsTape and BytesTape
  • Usable in no_std and with custom allocators for GPU & embedded use cases
  • Sliceable into zero-copy borrow-checked views with [i..n] range syntax

Quick Start

use stringtape::{CharsTapeI32, BytesTapeI32, StringTapeError};

// Create a new CharsTape with 32-bit offsets
let mut tape = CharsTapeI32::new();
tape.push("hello")?;
tape.push("world")?;

assert_eq!(tape.len(), 2);
assert_eq!(&tape[0], "hello");
assert_eq!(tape.get(1), Some("world"));

// Iterate over strings
for s in &tape {
    println!("{}", s);
}

// Build from iterator
let tape2: CharsTapeI32 = ["a", "b", "c"].into_iter().collect();
assert_eq!(tape2.len(), 3);

// Binary data with BytesTape
let mut bytes = BytesTapeI32::new();
bytes.push(b"hi")?;
assert_eq!(&bytes[0], b"hi");

# Ok::<(), StringTapeError>(())

Memory Layout

CharsTape and BytesTape use the same memory layout as Apache Arrow string and binary arrays:

Data buffer:    [h,e,l,l,o,w,o,r,l,d]
Offset buffer:  [0, 5, 10]

API Overview

Basic Operations

use stringtape::CharsTapeI32;

let mut tape = CharsTapeI32::new();
tape.push("hello")?;                    // Append one string
tape.extend(["world", "foo"])?;         // Append an array
assert_eq!(&tape[0], "hello");          // Direct indexing
assert_eq!(tape.get(1), Some("world")); // Safe access

for s in &tape { // Iterate
    println!("{}", s);
}

// Construct from an iterator
let tape2: CharsTapeI32 = ["a", "b", "c"].into_iter().collect();

BytesTape provides the same interface for arbitrary byte slices.

Views and Slicing

let view = tape.view();              // View entire tape
let subview = tape.subview(1, 3)?;   // Items [1, 3)
let nested = subview.subview(0, 1)?; // Nested subviews
let raw_bytes = &tape.view()[1..3];  // Raw byte slice

// Views have same API as tapes
assert_eq!(subview.len(), 2);
assert_eq!(&subview[0], "world");

Memory Management

// Pre-allocate capacity
let tape = CharsTapeI32::with_capacity(1024, 100)?; // 1KB data, 100 strings

// Monitor usage
println!("Items: {}, Data: {} bytes", tape.len(), tape.data_len());

// Modify
tape.clear();           // Remove all items
tape.truncate(5);       // Keep first 5 items

// Custom allocators
use allocator_api2::alloc::Global;
let tape = CharsTape::new_in(Global);

Apache Arrow Interop

True zero-copy conversion to/from Arrow arrays:

// CharsTape → Arrow (zero-copy)
let (data_slice, offsets_slice) = tape.arrow_slices();
let data_buffer = Buffer::from_slice_ref(data_slice);
let offsets_buffer = OffsetBuffer::new(ScalarBuffer::new(
    Buffer::from_slice_ref(offsets_slice), 0, offsets_slice.len()
));
let arrow_array = StringArray::new(offsets_buffer, data_buffer, None);

// Arrow → CharsTapeView (zero-copy)
let view = unsafe {
    CharsTapeViewI32::from_raw_parts(
        arrow_array.values(),
        arrow_array.offsets().as_ref(),
    )
};

BytesTape works the same way with Arrow BinaryArray/LargeBinaryArray types.

Unsigned Offsets

In addition to the signed offsets (i32/i64 via CharsTapeI32/CharsTapeI64), the library also supports unsigned offsets (u32/u64) when you prefer non-negative indexing:

  • CharsTapeU32, CharsTapeU64
  • BytesTapeU32, BytesTapeU64
  • CharsTapeViewU32<'_>, CharsTapeViewU64<'_>
  • BytesTapeViewU32<'_>, BytesTapeViewU64<'_>

Note, that unsigned offsets cannot be converted to/from Arrow arrays.

no_std Support

StringTape can be used in no_std environments:

[dependencies]
stringtape = { version = "2", default-features = false }

In no_std mode:

  • All functionality is preserved
  • Requires alloc for dynamic allocation
  • Error types implement Display but not std::error::Error

Testing

Run tests for both std and no_std configurations:

cargo test                          # Test with std (default)
cargo test --doc                    # Test documentation examples
cargo test --no-default-features    # Test without std
cargo test --all-features           # Test with all features enabled

About

Apache Arrow-compatible space-efficient "tape" class in pure Rust to be used with StringZilla for GPU and disk transfers of variable length strings

Topics

Resources

License

Stars

Watchers

Forks

Languages