Skip to content

0.36 Nested Collections not stored in a separate Sub Index #1721

@dwlmt

Description

@dwlmt

Initial Checks

  • I have read and followed the docs and still think this is a bug

Description

I have a document structure as per the docs given below. Until version 0.36, the nested fields "paths" were stored in separate collections with Qdrant. So with a collection name of "channel_category" for the parent doc, the paths would be stored in "channel_category__paths". With 0.36 the nested paths vectors and their collections are being held in the parent collection "channel_category" as separate records. Is this an intended change or a bug in how nested data is stored?

class MetaPathDoc(BaseDoc):
    path_id: str
    level: int
    text: str
    embedding: Optional[AnyTensor] = Field(
        space=similarity_space, dim=dim_size)

class MetaCategoryDoc(BaseDoc):
    node_id: Optional[str]
    node_name: Optional[str]
    name: Optional[str]
    product_type_definitions: Optional[str]
    leaf: bool
    paths: Optional[DocList[MetaPathDoc]]
    embedding: Optional[AnyTensor] = Field(
        space=similarity_space, dim=dim_size)
    channel: str
    lang: str

Example Code

I'm loading documents to QDrant via a Jina executor like this:

import os
import sys

import more_itertools
from docarray import DocList
from docarray.index import QdrantDocumentIndex
from utils.docs import MetaCategoryDoc
from jina import Executor, requests
from jina.logging.logger import JinaLogger
from qdrant_client.http import models

QDRANT_LOCATION = os.getenv('QDRANT_LOCATION', "http://localhost:6333")
QDRANT_API_KEY = os.getenv('QDRANT_API_KEY', None)

class MetaChannelCategoryIndexingExec(Executor):
    def __init__(self,
                 collection_name: str = "channel_category",
                 batch_size: str = 64,
                 qdrant_location: str = QDRANT_LOCATION,
                 qdrant_api_key: str = QDRANT_API_KEY,
                 *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.logger = JinaLogger('meta_channel_category_indexing')


        db_config = QdrantDocumentIndex.DBConfig(
            location=qdrant_location,
            api_key=qdrant_api_key,
            collection_name=collection_name,
            quantization_config=models.ScalarQuantization(
                scalar=models.ScalarQuantizationConfig(
                    type=models.ScalarType.INT8,
                    quantile=0.99,
                    always_ram=False,
                )
            ),
            optimizers_config=models.OptimizersConfigDiff(
                memmap_threshold=20000, indexing_threshold=20000),
            on_disk_payload=True,
            hnsw_config=models.HnswConfigDiff(m=16,ef_construct=100,on_disk=True),
            wal_config=models.WalConfigDiff(
                wal_capacity_mb=64, wal_segments_ahead=1),
            prefer_grpc=False)

        self.doc_index = QdrantDocumentIndex[MetaCategoryDoc](db_config)
        self.batch_size = batch_size

    @requests(
        request_schema=DocList[MetaCategoryDoc],
        response_schema=DocList[MetaCategoryDoc]
    )
    def index_metadata(self, docs, **kwargs):
        """ Save products to the Vector DB.
        """
        for doc_batch in more_itertools.chunked(docs, self.batch_size):
            # Indexing the documents
            self.doc_index.index(
                doc_batch
            )

Python, Pydantic & OS Version

0.36.0

Affected Components

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions