[Feature Request]: Cosdata Versioning System For TF-IDF

May 14, 2025 by ADMIN 56 views

Introduction

The current VersionedVec approach in Cosdata is not scaling well as the number of transactions grow exponentially. This is especially true when using Synchronous Txns heavily by the user. In this feature request, we propose a new versioning system for TF-IDF that addresses this issue and provides a more efficient and scalable solution.

Problem Statement

The current VersionedVec approach has several limitations that make it unsuitable for large-scale transactions. These limitations include:

Scalability: The current approach is not designed to handle a large number of transactions, which can lead to performance issues and decreased efficiency.
Complexity: The current approach requires complex logic to handle versioning, which can make it difficult to maintain and update.

Solution

To address these limitations, we propose a new versioning system for TF-IDF that uses a more efficient and scalable approach. The proposed system consists of two main components:

Versioning Tag: A single byte that indicates whether the versioning is single or multiple.
Versioning Data: A structure that contains the versioning information, including the version hash, max version, and versions offset.

The proposed versioning system is designed to be more efficient and scalable than the current approach. It uses a simple and intuitive structure that makes it easy to understand and maintain.

Versioning Tag

The versioning tag is a single byte that indicates whether the versioning is single or multiple. The tag is used to determine which versioning data to use.

Single: If the versioning tag is 0, it indicates that the versioning is single. In this case, the versioning data consists of a single version hash.
Multiple: If the versioning tag is 1, it indicates that the versioning is multiple. In this case, the versioning data consists of a max version, versions offset, and an optional per-item version array.

Versioning Data

The versioning data is a structure that contains the versioning information. The structure consists of the following fields:

Version Hash: A hash value that represents the version of the data.
Max Version: The maximum version of the data.
Versions Offset: The offset to the per-item version array.
Per-Item Version Array: An optional array of version hashes that represents the version of each item.

Indexing & Query-Time Behavior

The proposed versioning system is designed to be efficient and scalable. It uses a simple and intuitive structure that makes it easy to understand and maintain.

Indexing / Serialization Time: The proposed system uses a simple and efficient indexing scheme that minimizes the amount of data that needs to be written to the file.
Query-Time Filtering: The proposed system uses a simple and efficient query-time filtering scheme that minimizes the amount of data that needs to be loaded into memory.

Implementation

The proposed versioning system can be implemented using the following code:

pub enum Versioning {
    Single(VersionHash),
    Multiple {
        max_version: VersionHash,
        offset: FileOffset,
        version_details: Option<Box<[Hash]>>,
    },
}

pub struct VersionedVec<T> {
    pub serialized_at: Arc<RwLock<Option<FileOffset>>>,
    pub versioning: Versioning,
    pub list: Vec<T>,
    pub next: Option<Box<VersionedVec<T>>>,
}

The proposed system uses a simple and intuitive structure that makes it easy to understand and maintain. It is designed to be efficient and scalable, making it suitable for large-scale transactions.

Alternatives

We have considered the following alternatives:

Current VersionedVec approach: The current approach is not scalable and requires complex logic to handle versioning.
Other versioning systems: We have considered other versioning systems, but they are either too complex or too inefficient.

Conclusion

Q: What is the current `VersionedVec` approach in Cosdata?

A: The current VersionedVec approach in Cosdata is not scaling well as the number of transactions grow exponentially. This is especially true when using Synchronous Txns heavily by the user.

Q: What are the limitations of the current `VersionedVec` approach?

A: The current approach has several limitations, including:

Scalability: The current approach is not designed to handle a large number of transactions, which can lead to performance issues and decreased efficiency.
Complexity: The current approach requires complex logic to handle versioning, which can make it difficult to maintain and update.

Q: What is the proposed versioning system for TF-IDF?

A: The proposed versioning system for TF-IDF uses a more efficient and scalable approach. It consists of two main components:

Versioning Tag: A single byte that indicates whether the versioning is single or multiple.
Versioning Data: A structure that contains the versioning information, including the version hash, max version, and versions offset.

Q: How does the proposed versioning system handle versioning?

A: The proposed versioning system uses a simple and intuitive structure that makes it easy to understand and maintain. It uses a versioning tag to determine which versioning data to use.

Single: If the versioning tag is 0, it indicates that the versioning is single. In this case, the versioning data consists of a single version hash.
Multiple: If the versioning tag is 1, it indicates that the versioning is multiple. In this case, the versioning data consists of a max version, versions offset, and an optional per-item version array.

Q: How does the proposed versioning system handle indexing and query-time filtering?

A: The proposed versioning system uses a simple and efficient indexing scheme that minimizes the amount of data that needs to be written to the file. It also uses a simple and efficient query-time filtering scheme that minimizes the amount of data that needs to be loaded into memory.

Q: What are the benefits of the proposed versioning system?

A: The proposed versioning system has several benefits, including:

Improved scalability: The proposed system is designed to handle a large number of transactions, making it more efficient and scalable.
Simplified logic: The proposed system uses a simple and intuitive structure that makes it easy to understand and maintain.
Improved performance: The proposed system uses a simple and efficient indexing scheme and query-time filtering scheme, making it faster and more efficient.

Q: How can the proposed versioning system be implemented?

A: The proposed versioning system can be implemented using the following code:

pub enum Versioning {
    Single(VersionHash),
    Multiple {
        max_version: VersionHash,
        offset: FileOffset,
        version_details: Option<Box<[Hash]>>,
    },
}

pub struct VersionedVec<T> {
    pub serialized_at: Arc<RwLock<Option<File>>>,
    pub versioning: Versioning,
    pub list: Vec<T>,
    pub next: Option<Box<VersionedVec<T>>>,
}