Let's first create a 28k x 15k sparse TileDBArray object:
library(HDF5Array)
library(ExperimentHub)
hub <- ExperimentHub()
brain_path <- hub[["EH1040"]] # 1.3 Million Brain Cell Dataset
brain <- HDF5Array(brain_path, "counts")
library(TileDBArray)
path <- tempfile()
## Takes about 30-40s (resulting dataset is 154M on disk):
B <- writeTileDBArray(brain[ , 1:15000], sparse=TRUE, path=path)
dim(B)
# [1] 27998 15000
Extracting a random subset of 8000 x 5000 values uses about > 26 GB of memory on my laptop (Ubuntu Linux 24.04):
set.seed(111)
index <- list(sample(nrow(B), 8000), sample(ncol(B), 5000))
m <- extract_array(B, index) # 'top' command reports > 26 GB of memory usage
That's A LOT!
Trying to extract anything slightly bigger will exhaust the 32GB of RAM of my laptop and kill my R session (Linux OOM Killer in action).
For comparison, loading the full array in memory as an ordinary array (dense representation) only consumes 5.1 GB:
b <- as.array(B) # 'top' command reports about 5.1 GB of memory usage
And from there I can extract the random subset very efficiently:
m2 <- extract_array(b, index)
but this of course kind of defeats the purpose of using a sparse representation in the first place.
H.
sessionInfo():
R version 4.6.0 alpha (2026-04-05 r89793)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /home/hpages/R/R-4.6.r89793/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.6.r89793/lib/libRlapack.so; LAPACK version 3.12.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] RcppSpdlog_0.0.28 TileDBArray_1.21.1
[3] TENxBrainData_1.31.0 SingleCellExperiment_1.33.2
[5] SummarizedExperiment_1.41.1 Biobase_2.71.0
[7] GenomicRanges_1.63.2 Seqinfo_1.1.0
[9] ExperimentHub_3.1.0 AnnotationHub_4.1.0
[11] BiocFileCache_3.1.0 dbplyr_2.5.2
[13] HDF5Array_1.39.1 h5mread_1.3.3
[15] rhdf5_2.55.16 DelayedArray_0.37.1
[17] SparseArray_1.11.13 S4Arrays_1.11.1
[19] IRanges_2.45.0 abind_1.4-8
[21] S4Vectors_0.49.1 MatrixGenerics_1.23.0
[23] matrixStats_1.5.0 BiocGenerics_0.57.0
[25] generics_0.1.4 Matrix_1.7-5
loaded via a namespace (and not attached):
[1] KEGGREST_1.51.1 httr2_1.2.2 lattice_0.22-9
[4] rhdf5filters_1.23.3 vctrs_0.7.2 tools_4.6.0
[7] curl_7.0.0 tibble_3.3.1 AnnotationDbi_1.73.1
[10] RSQLite_2.4.6 blob_1.3.0 pkgconfig_2.0.3
[13] data.table_1.18.2.1 lifecycle_1.0.5 compiler_4.6.0
[16] Biostrings_2.79.5 nanoarrow_0.8.0 yaml_2.3.12
[19] pillar_1.11.1 crayon_1.5.3 cachem_1.1.0
[22] RcppCCTZ_0.2.14 tiledb_0.33.0 tidyselect_1.2.1
[25] dplyr_1.2.1 purrr_1.2.1 BiocVersion_3.23.1
[28] fastmap_1.2.0 grid_4.6.0 cli_3.6.6
[31] magrittr_2.0.5 spdl_0.0.5 withr_3.0.2
[34] filelock_1.0.3 rappdirs_0.3.4 bit64_4.6.0-1
[37] nanotime_0.3.13 XVector_0.51.0 httr_1.4.8
[40] bit_4.6.0 zoo_1.8-15 png_0.1-9
[43] memoise_2.0.1 rlang_1.2.0 Rcpp_1.1.1
[46] glue_1.8.0 DBI_1.3.0 BiocManager_1.30.27
[49] R6_2.6.1 Rhdf5lib_1.33.6
Let's first create a 28k x 15k sparse TileDBArray object:
Extracting a random subset of 8000 x 5000 values uses about > 26 GB of memory on my laptop (Ubuntu Linux 24.04):
That's A LOT!
Trying to extract anything slightly bigger will exhaust the 32GB of RAM of my laptop and kill my R session (Linux OOM Killer in action).
For comparison, loading the full array in memory as an ordinary array (dense representation) only consumes 5.1 GB:
And from there I can extract the random subset very efficiently:
but this of course kind of defeats the purpose of using a sparse representation in the first place.
H.
sessionInfo():