Skip to content

BUG: Basic Slicing of StringDType arrays faulty. #29279

@FabricioArendTorres

Description

@FabricioArendTorres

Describe the issue:

Hi,

I think there is an issue with the slicing of numpy arrays of type StringDType.
It is inconsistent with regular numpy slicing, and also with e.g. "<U30" fixed length string arrays.

Below you can find the self-contained example and corresponding expected and obtained outputs.

Not that this is only for "multi-index" slicing, e.g. [[5]], but doesnt occur for e.g. [5].

Reproduce the code example:

# /// script
# requires-python = "==3.11.11"
# dependencies = [
#   "numpy==2.2.2",
# ]
# ///


import numpy as np


def main():
    STRINGDTYPE_Array = np.array(
        [
            ["AAAAAAAAAAAAAAAAA"],
            ["BBBBBBBBBBBBBBBBBBBBBBBBBBBBB"],
            ["CCCCCCCCCCCCCCCCC"],
            ["DDDDDDDDDDDDDDDDD"],
        ],
        dtype=np.dtypes.StringDType,
    )

    U30_Array = np.array(
        [
            ["AAAAAAAAAAAAAAAAA"],
            ["BBBBBBBBBBBBBBBBBBBBBBBBBBBBB"],
            ["CCCCCCCCCCCCCCCCC"],
            ["DDDDDDDDDDDDDDDDD"],
        ],
        dtype="U30",
    )

    expected = []
    for i in range(U30_Array.shape[0]):
        expected.append(U30_Array[[i]])

    obtained = []
    for i in range(STRINGDTYPE_Array.shape[0]):
        obtained.append(STRINGDTYPE_Array[[i]])

    print(f"{expected=}")
    print(f"{obtained=}")


if __name__ == "__main__":
    main()

Error message:

expected=[array([['AAAAAAAAAAAAAAAAA']], dtype='<U30'), array([['BBBBBBBBBBBBBBBBBBBBBBBBBBBBB']], dtype='<U30'), array([['CCCCCCCCCCCCCCCCC']], dtype='<U30'), array([['DDDDDDDDDDDDDDDDD']], dtype='<U30')]
obtained=[array([['AAAAAAAAAAAAAAAAA']], dtype=StringDType()), array([['AAAAAAAAAAAAAAAAA\x1dBBBBBBBBBBB']], dtype=StringDType()), array([['AAAAAAAAAAAAAAAAA']], dtype=StringDType()), array([['AAAAAAAAAAAAAAAAA']], dtype=StringDType())]

Python and NumPy Versions:

print(numpy.version); print(sys.version)
2.2.2
3.11.11 | packaged by conda-forge | (main, Dec 5 2024, 14:17:24) [GCC 13.3.0]

Runtime Environment:

[{'numpy_version': '2.2.2',
  'python': '3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:17:24) '
            '[GCC 13.3.0]',
  'uname': uname_result(system='Linux', node='963b179a5b20', release='5.15.0-135-generic', version='#146-Ubuntu SMP Sat Feb 15 17:06:22 UTC 2025', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2'],
                      'not_found': ['AVX512F',
                                    'AVX512CD',
                                    'AVX512_KNL',
                                    'AVX512_KNM',
                                    'AVX512_SKX',
                                    'AVX512_CLX',
                                    'AVX512_CNL',
                                    'AVX512_ICL',
                                    'AVX512_SPR']}},
 {'architecture': 'Haswell',
  'filepath': '/opt/conda/lib/libopenblasp-r0.3.28.so',
  'internal_api': 'openblas',
  'num_threads': 20,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.28'}]

Context for the issue:

I have the suspicion that this inconsistent indexing may lead to issues in downstream libraries e.g. zarr-developers/zarr-python#3174

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions