Skip to content

Conversation

@youknowone
Copy link
Member

@youknowone youknowone commented Jul 29, 2025

cc @ever0de

I am going to add a new validation step for PyUtf8Str. This is useful for sqlite3

Summary by CodeRabbit

  • New Features

    • Introduced a distinct UTF-8 string type for Python strings, enabling stricter UTF-8 validation and handling.
    • Added new methods for safe downcasting and UTF-8 specific string representations in object and protocol handling.
    • Expanded public interfaces to include UTF-8 string types.
    • Added a macro for retrieving UTF-8 validated identifiers.
  • Bug Fixes

    • Improved error handling for string formatting and output involving invalid UTF-8 or surrogate characters.
    • Enhanced string comparison and attribute handling to support both UTF-8 and WTF-8 encoded strings.
  • Refactor

    • Updated numerous function signatures and internal logic to require or return UTF-8 validated string references, especially in I/O, database, and standard library modules.
    • Simplified and standardized string conversions and representations throughout the system.
    • Refined downcasting mechanisms with VM-based validation for safer type conversions.
  • Style

    • Streamlined code for string formatting, representation, and output for consistency and maintainability.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 29, 2025

Walkthrough

This change introduces a new UTF-8 validated string type (PyUtf8Str and PyUtf8StrRef) and systematically replaces generic Python string references (PyStrRef) with these types in public APIs and internal logic, especially for SQL, I/O, and standard library modules. It updates downcasting, formatting, and representation methods to enforce or utilize UTF-8 invariants, and adjusts related method signatures and trait implementations accordingly.

Changes

Cohort / File(s) Change Summary
SQLite UTF-8 Enforcement
stdlib/src/sqlite.rs
All SQL-related string parameters and variables changed from PyStrRef to PyUtf8StrRef; explicit UTF-8 validation removed; helper and constructor signatures updated; conversions now use UTF-8 guarantees.
UTF-8 String Type Introduction
vm/src/builtins/str.rs
Introduces PyUtf8Str as a transparent wrapper for UTF-8 validated strings; adds conversions, methods, and invariants; updates related string methods and type aliases.
Public Re-exports for UTF-8 Strings
vm/src/builtins/mod.rs
Publicly re-exports PyUtf8Str and PyUtf8StrRef from pystr module.
Downcasting Enhancements
vm/src/object/core.rs, vm/src/object/payload.rs
Adds VM-based safe downcasting methods to object core and payload traits, with error reporting and type checks.
Standard Library: UTF-8 String Adoption
vm/src/stdlib/codecs.rs, vm/src/stdlib/io.rs, vm/src/stdlib/posix.rs, vm/src/stdlib/pwd.rs, vm/src/stdlib/time.rs
Updates function signatures and logic to use PyUtf8StrRef for encoding, mode, and name parameters; removes redundant UTF-8 conversions.
Formatting and Representation Adjustments
vm/src/format.rs, vm/src/types/slot.rs, vm/src/builtins/slice.rs, vm/src/builtins/namespace.rs
Updates string conversion and formatting logic to use WTF-8 or UTF-8 views; adjusts trait and function signatures for string returns.
Builtins and Protocols: UTF-8 Variants
vm/src/protocol/object.rs, vm/src/vm/vm_ops.rs
Adds repr_utf8 and str_utf8 methods; updates existing methods to return PyRef<PyStr>; adds format_utf8 method to VM.
Error Handling and Warnings
vm/src/exceptions.rs, vm/src/stdlib/sys.rs, vm/src/warn.rs, vm/src/vm/mod.rs
Refactors error message formatting and output to handle WTF-8 and UTF-8 strings; adjusts fallback logic for invalid UTF-8.
Miscellaneous String Handling Updates
vm/src/stdlib/ctypes/structure.rs, vm/src/stdlib/operator.rs, vm/src/builtins/super.rs, vm/src/builtins/type.rs, vm/src/utils.rs, vm/src/builtins/builtin_func.rs
Updates string conversion and comparison logic to use bytes or WTF-8; adds ToCString for PyUtf8Str; minor refactors for clarity and safety.
Macros
vm/src/macros.rs
Adds a new macro identifier_utf8 for unchecked downcasting to PyUtf8Str for known ASCII identifiers.
TryFromObject Implementation
vm/src/convert/try_from.rs
Streamlines TryFromObject implementation for PyRef<T> by using VM-based downcast checks and early returns for errors.

Sequence Diagram(s)

sequenceDiagram
    participant VM
    participant UserCode
    participant PyStr
    participant PyUtf8Str

    UserCode->>VM: Call function expecting string (e.g., execute SQL)
    VM->>PyUtf8Str: Accepts PyUtf8StrRef (UTF-8 validated)
    PyUtf8Str-->>VM: Provides &str (guaranteed valid UTF-8)
    VM->>UserCode: Executes logic without extra UTF-8 validation

    Note over VM,PyUtf8Str: If PyStr received, VM calls try_into_utf8 for validation
    PyStr->>PyUtf8Str: try_into_utf8(vm)
    alt Valid UTF-8
        PyUtf8Str-->>VM: Success
    else Invalid UTF-8
        PyStr-->>VM: Error (UnicodeEncodeError)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • RustPython/RustPython#5969: Introduces the initial PyUtf8Str wrapper and surrogate validation in sqlite.rs, directly related to this PR's UTF-8 enforcement in the same module.
  • RustPython/RustPython#5981: Adds null character checks in sqlite.rs, related by focusing on SQL string input validation in the same context.
  • RustPython/RustPython#5962: Modifies str_to_ptr_len in sqlite.rs for UTF-8 validation, overlapping with this PR's changes to validation and string handling in the same area.

Poem

A stringy change, a bunny’s delight,
UTF-8 now shines so bright!
No more checks for every char,
PyUtf8Str takes us far.
SQL and IO, all in sync,
Bugs and surrogates on the brink.
🐇 Hopping forward, code in link!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
vm/src/builtins/str.rs (1)

1510-1513: Consider using pointer casts instead of transmute

While the transmute is safe due to repr(transparent), using pointer casts is more explicit about the operation being performed.

 pub fn try_into_utf8(self, vm: &VirtualMachine) -> PyResult<PyRef<PyUtf8Str>> {
     self.ensure_valid_utf8(vm)?;
-    Ok(unsafe { mem::transmute::<PyRef<PyStr>, PyRef<PyUtf8Str>>(self) })
+    Ok(unsafe { PyRef::map_unchecked(self, |s| &*(s as *const PyStr as *const PyUtf8Str)) })
 }
📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f402dee and 445c3bd.

📒 Files selected for processing (26)
  • stdlib/src/sqlite.rs (13 hunks)
  • vm/src/builtins/builtin_func.rs (2 hunks)
  • vm/src/builtins/mod.rs (1 hunks)
  • vm/src/builtins/namespace.rs (1 hunks)
  • vm/src/builtins/slice.rs (1 hunks)
  • vm/src/builtins/str.rs (11 hunks)
  • vm/src/builtins/super.rs (2 hunks)
  • vm/src/builtins/type.rs (1 hunks)
  • vm/src/exceptions.rs (3 hunks)
  • vm/src/format.rs (1 hunks)
  • vm/src/object/core.rs (2 hunks)
  • vm/src/object/payload.rs (2 hunks)
  • vm/src/protocol/object.rs (3 hunks)
  • vm/src/stdlib/codecs.rs (2 hunks)
  • vm/src/stdlib/ctypes/structure.rs (1 hunks)
  • vm/src/stdlib/io.rs (10 hunks)
  • vm/src/stdlib/operator.rs (3 hunks)
  • vm/src/stdlib/posix.rs (2 hunks)
  • vm/src/stdlib/pwd.rs (2 hunks)
  • vm/src/stdlib/sys.rs (1 hunks)
  • vm/src/stdlib/time.rs (3 hunks)
  • vm/src/types/slot.rs (4 hunks)
  • vm/src/utils.rs (2 hunks)
  • vm/src/vm/mod.rs (1 hunks)
  • vm/src/vm/vm_ops.rs (2 hunks)
  • vm/src/warn.rs (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit Inference Engine (.github/copilot-instructions.md)

**/*.rs: Follow the default rustfmt code style (cargo fmt to format)
Always run clippy to lint code (cargo clippy) before completing tasks. Fix any warnings or lints that are introduced by your changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • vm/src/builtins/builtin_func.rs
  • vm/src/stdlib/posix.rs
  • vm/src/format.rs
  • vm/src/utils.rs
  • vm/src/stdlib/operator.rs
  • vm/src/builtins/slice.rs
  • vm/src/builtins/mod.rs
  • vm/src/vm/vm_ops.rs
  • vm/src/stdlib/pwd.rs
  • vm/src/stdlib/sys.rs
  • vm/src/warn.rs
  • vm/src/stdlib/time.rs
  • vm/src/stdlib/ctypes/structure.rs
  • vm/src/builtins/namespace.rs
  • vm/src/exceptions.rs
  • vm/src/object/payload.rs
  • vm/src/builtins/super.rs
  • vm/src/types/slot.rs
  • vm/src/stdlib/codecs.rs
  • vm/src/vm/mod.rs
  • vm/src/protocol/object.rs
  • vm/src/stdlib/io.rs
  • vm/src/builtins/type.rs
  • vm/src/object/core.rs
  • vm/src/builtins/str.rs
  • stdlib/src/sqlite.rs
🧠 Learnings (20)
vm/src/builtins/builtin_func.rs (3)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration using #ifdef checks rather than providing fallback values for other platforms.

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration rather than providing fallback values for other platforms.

vm/src/stdlib/posix.rs (3)

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration using #ifdef checks rather than providing fallback values for other platforms.

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration rather than providing fallback values for other platforms.

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/format.rs (3)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: In most cases, Python code should not be edited. Bug fixes should be made through Rust code modifications only

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to Lib/test/**/*.py : When tests fail due to unsupported syntax, keep the test as @unittest.expectedFailure, document that it requires PEP 695 support, and focus on tests that can be fixed through Rust code changes only

vm/src/utils.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/stdlib/operator.rs (3)

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration using #ifdef checks rather than providing fallback values for other platforms.

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration rather than providing fallback values for other platforms.

vm/src/builtins/mod.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/vm/vm_ops.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/stdlib/pwd.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/stdlib/sys.rs (2)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to Lib/test/**/*.py : When tests fail due to unsupported syntax, keep the test as @unittest.expectedFailure, document that it requires PEP 695 support, and focus on tests that can be fixed through Rust code changes only

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Follow Rust best practices for error handling and memory management

vm/src/stdlib/time.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/exceptions.rs (2)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to Lib/test/**/*.py : When tests fail due to unsupported syntax, keep the test as @unittest.expectedFailure, document that it requires PEP 695 support, and focus on tests that can be fixed through Rust code changes only

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Follow Rust best practices for error handling and memory management

vm/src/object/payload.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/builtins/super.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/types/slot.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/stdlib/codecs.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/protocol/object.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/stdlib/io.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/builtins/type.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/builtins/str.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

stdlib/src/sqlite.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

🧬 Code Graph Analysis (4)
vm/src/stdlib/sys.rs (1)
vm/src/builtins/str.rs (1)
  • to_str (434-436)
vm/src/stdlib/time.rs (4)
vm/src/vm/vm_ops.rs (1)
  • format (495-520)
vm/src/builtins/str.rs (3)
  • format (936-940)
  • as_str (430-432)
  • as_str (2046-2053)
vm/src/stdlib/builtins.rs (1)
  • format (393-399)
common/src/str.rs (1)
  • as_str (262-266)
vm/src/object/core.rs (2)
vm/src/object/payload.rs (2)
  • try_downcast_from (33-49)
  • downcastable_from (29-31)
vm/src/builtins/str.rs (3)
  • try_downcast_from (1973-1976)
  • downcastable_from (1965-1971)
  • a (1571-1571)
vm/src/builtins/str.rs (4)
common/src/str.rs (5)
  • as_str (262-266)
  • is_utf8 (43-45)
  • is_empty (296-298)
  • char_len (301-306)
  • char_len (337-339)
vm/src/object/traverse.rs (1)
  • try_traverse (17-17)
vm/src/object/payload.rs (4)
  • payload_type_id (23-25)
  • std (24-24)
  • downcastable_from (29-31)
  • try_downcast_from (33-49)
wtf8/src/lib.rs (2)
  • as_str (837-839)
  • is_utf8 (983-985)
🔇 Additional comments (74)
vm/src/utils.rs (1)

5-5: LGTM! Properly extends ToCString trait for PyUtf8Str.

The import addition and trait implementation correctly enable PyUtf8Str to be converted to C strings, maintaining consistency with the existing PyStr implementation. This is essential for C API interoperability in modules like SQLite.

Also applies to: 38-38

vm/src/format.rs (1)

115-115: LGTM! Correctly uses WTF-8 string view.

The change from as_ref() to as_wtf8() is more explicit about the encoding and aligns with the broader refactor to handle UTF-8/WTF-8 string representations consistently across the codebase.

vm/src/builtins/slice.rs (1)

295-295: LGTM! Beneficial code simplification.

Removing the unnecessary .as_str() calls simplifies the code since PyStrRef implements the Display trait and can be used directly in the format! macro. This improves readability while maintaining the same functionality.

Also applies to: 300-300

vm/src/stdlib/ctypes/structure.rs (1)

42-42: LGTM! Streamlined string conversion.

The direct call to to_string() eliminates the unnecessary intermediate .as_str() step since PyStr implements the ToString trait directly. This simplifies the code while maintaining the same functionality.

vm/src/builtins/mod.rs (1)

62-62: LGTM! Properly exposes new UTF-8 string types.

Adding PyUtf8Str and PyUtf8StrRef to the public exports enables their use across the codebase as part of the UTF-8 validation refactor. The naming follows established conventions and placement with other string types is appropriate.

vm/src/vm/mod.rs (1)

475-475: LGTM! Consistent with UTF-8 string handling improvements.

This change aligns with the broader refactoring to use UTF-8 validated string types and removes the need for explicit .as_str() conversion when printing error messages.

vm/src/builtins/builtin_func.rs (2)

5-5: LGTM! Necessary import for WTF-8 string handling.

The import of Wtf8 supports the updated debug formatting that uses WTF-8 string types consistently.


31-31: LGTM! Consistent WTF-8 formatting for debug output.

The change ensures consistent type handling in debug formatting by using Wtf8::new("<unknown>") instead of a plain string literal, matching the Wtf8 type returned by m.as_wtf8() when the module is present.

vm/src/stdlib/pwd.rs (2)

9-9: LGTM! Updated import for UTF-8 validated string types.

The import change supports the function signature update to use PyUtf8StrRef for better type safety.


57-57: LGTM! Improved type safety with UTF-8 validated strings.

Changing the parameter type from PyStrRef to PyUtf8StrRef enforces UTF-8 validation at the API boundary, which is appropriate for usernames and aligns with the broader migration to UTF-8 validated string types.

vm/src/builtins/namespace.rs (1)

92-94: LGTM! Improved string representation handling with WTF-8.

The change from as_str() to as_wtf8() improves robustness by handling strings that may contain surrogate code points or ill-formed UTF-8 sequences, while maintaining the same representation logic for namespace objects.

vm/src/stdlib/posix.rs (2)

27-27: LGTM: Import addition supports UTF-8 validated strings.

The addition of PyUtf8StrRef to the import list is consistent with the PR's objective to introduce UTF-8 validated string types across the codebase.


2245-2245: LGTM: UTF-8 validated string conversion improves type safety.

The change from PyStrRef::try_from_object to PyUtf8StrRef::try_from_object ensures UTF-8 validity at the parameter level for system configuration name parsing. This eliminates the need for fallible UTF-8 conversions while maintaining the same parsing logic.

vm/src/stdlib/codecs.rs (2)

10-10: LGTM: Import addition supports UTF-8 validated strings.

The addition of PyUtf8StrRef to the import list enables the use of UTF-8 validated string references in the codec lookup functionality.


26-29: LGTM: UTF-8 validated encoding parameter simplifies codec lookup.

The changes improve the lookup function by:

  1. Accepting PyUtf8StrRef parameter to ensure UTF-8 validity of encoding names
  2. Using encoding.as_str() directly instead of the fallible try_to_str(vm)? conversion
  3. Eliminating potential error handling for UTF-8 validation

This is consistent with the broader refactoring to use UTF-8 validated string references across the codebase.

vm/src/builtins/type.rs (1)

343-343: LGTM: Efficient byte-level comparison for ASCII slot names.

This optimization replaces UTF-8 string slice comparisons with direct byte comparisons when checking for Python special method names (dunder methods). Since we're only checking for ASCII characters (__), this is both safe and more efficient, avoiding unnecessary UTF-8 validation overhead.

Also applies to: 349-349

vm/src/builtins/super.rs (1)

107-107: LGTM: Consistent byte-level comparison for __class__ checks.

Both changes replace string slice comparisons with byte slice comparisons when checking for the __class__ identifier. Since __class__ is pure ASCII, this optimization is safe and more efficient, avoiding UTF-8 validation overhead while maintaining identical behavior.

Also applies to: 165-165

vm/src/stdlib/operator.rs (3)

8-9: LGTM: Clean import organization.

Good organization of imports - adding Wtf8 and grouping Either with other function imports improves readability and prepares for WTF-8 string handling.


327-327: LGTM: Simplified ASCII check.

Direct call to is_ascii() on PyStrRef is cleaner and more readable than the previous approach, maintaining the same functionality with better code clarity.


374-382: LGTM: Optimized attribute name handling with WTF-8 support.

This change optimizes dotted attribute name processing by:

  1. Working at the byte level for splitting (more efficient)
  2. Using WTF-8 conversion to handle potentially ill-formed UTF-8 sequences
  3. Maintaining compatibility with the existing API

The approach is consistent with the broader string handling optimizations throughout the codebase and properly handles edge cases with non-standard string encodings.

vm/src/vm/vm_ops.rs (2)

4-5: LGTM: Necessary imports for UTF-8 string support.

Adding PyRef and PyUtf8Str imports to support the new format_utf8 method. The imports are properly organized and necessary for the new functionality.


521-523: LGTM: Well-designed UTF-8 formatting method.

The new format_utf8 method provides a convenient API for formatting objects into UTF-8 validated strings. The implementation:

  1. Reuses the existing format method appropriately
  2. Uses fallible conversion (try_into_utf8) to ensure UTF-8 validity
  3. Follows the established pattern for UTF-8 string type usage
  4. Provides clear error handling through the PyResult return type

This complements the broader UTF-8 string type introduction across the codebase.

vm/src/stdlib/time.rs (3)

38-38: LGTM: Import addition is correct.

The addition of PyUtf8StrRef to the imports is necessary and appropriate for the strftime function signature change.


347-351: LGTM: Function signature update is semantically correct.

Changing the format parameter from PyStrRef to PyUtf8StrRef is appropriate since strftime format strings should be valid UTF-8. This ensures the function receives UTF-8 validated input, eliminating the need for redundant validation.


362-363: LGTM: Direct UTF-8 string access leverages type guarantees.

The changes correctly utilize the UTF-8 validity guarantee of PyUtf8StrRef:

  • format.as_str() safely provides direct access to the UTF-8 string slice
  • format.to_string() appropriately handles the error fallback case

This simplifies the code by eliminating redundant UTF-8 validation that was previously required with PyStrRef.

vm/src/stdlib/sys.rs (1)

715-715: LGTM: Improved error handling for strings with surrogates.

The change from str.as_str() to str.to_str().unwrap_or("<str with surrogate>") is a defensive programming improvement that:

  • Safely handles strings containing surrogates without panicking
  • Provides a clear, descriptive fallback message for debugging
  • Prevents crashes in error handling paths when dealing with malformed UTF-8

This aligns well with the broader goal of safer string handling throughout the codebase.

vm/src/warn.rs (5)

125-125: LGTM: UTF-8 validation for warning action strings.

Changing from str(vm) to str_utf8(vm) is appropriate for warning action strings since they should be valid UTF-8 (e.g., "error", "ignore", "default"). This provides the necessary UTF-8 guarantee for downstream processing.


204-205: LGTM: Robust filename handling for potentially invalid UTF-8.

The changes improve filename handling robustness:

  • Line 204: Using as_bytes().ends_with(b".py") provides byte-level comparison that doesn't require UTF-8 validity
  • Line 205: Using as_wtf8() for slicing safely handles filenames with potentially invalid UTF-8 sequences

This prevents crashes when processing filenames containing invalid UTF-8, which can occur in real-world scenarios.


235-235: LGTM: WTF-8 handling for warning message text.

Changing from as_str() to as_wtf8() appropriately handles warning messages that might contain invalid UTF-8 sequences. WTF-8 can safely represent any byte sequence, preventing potential panics when processing malformed warning text.


281-281: LGTM: Consistent UTF-8 validation for action comparisons.

The changes from str(vm)?.as_str() to str_utf8(vm)?.as_str() are consistent with the UTF-8 validation approach for warning actions. Since these comparisons are against string literals ("error", "ignore"), UTF-8 validation is appropriate and ensures type consistency.

Also applies to: 285-285


348-348: LGTM: Simplified string formatting without explicit conversion.

Removing the .as_str() call is a good simplification since writeln! can handle PyStrRef directly. This also avoids potential panics if the warning text contains surrogates, making the code more robust.

vm/src/object/payload.rs (2)

29-31: LGTM: Simple and correct downcast viability check.

The downcastable_from method provides a clean interface for checking if an object can be downcasted to the payload type. The implementation correctly uses payload_type_id() for type comparison, following established patterns in the codebase.


33-49: LGTM: Well-designed error handling for safe downcasting.

The try_downcast_from method implementation demonstrates excellent Rust practices:

  • Type Safety: Uses downcastable_from check before proceeding
  • Performance: Cold function optimization prevents inlining the error path
  • Consistency: Leverages vm.new_downcast_type_error for uniform error messaging
  • Ergonomics: Returns PyResult<()> following established VM patterns

This provides a safe, efficient foundation for downcasting operations throughout the VM with proper error reporting.

vm/src/protocol/object.rs (6)

8-8: LGTM! Import addition aligns with UTF-8 string type introduction.

The addition of PyUtf8Str to the imports is consistent with the broader refactoring to introduce UTF-8 validated string types across the codebase.


331-333: LGTM! UTF-8 validated repr method implementation is correct.

The repr_utf8 method properly delegates to the existing repr method and converts the result to a UTF-8 validated string using try_into_utf8. This follows the established pattern for providing UTF-8 validated variants.


335-335: LGTM! Explicit return type clarification.

Making the return type explicitly PyRef<PyStr> instead of implicit helps distinguish between the generic string method and its UTF-8 validated counterpart (repr_utf8).


353-353: LGTM! Consistent usage of UTF-8 validated repr.

The ascii method now correctly uses repr_utf8() instead of repr(), which is appropriate since ASCII conversion requires valid UTF-8 input. This ensures type safety and UTF-8 validation.


358-360: LGTM! UTF-8 validated str method implementation is correct.

The str_utf8 method follows the same pattern as repr_utf8, properly delegating to the existing str method and converting to UTF-8 validated string. This provides a clean API for callers who need UTF-8 guarantees.


361-361: LGTM! Explicit return type clarification.

Similar to the repr method, making the return type explicitly PyRef<PyStr> helps distinguish this generic method from its UTF-8 validated counterpart (str_utf8).

vm/src/exceptions.rs (2)

201-205: LGTM! Cleaner filename handling with direct match.

The refactoring simplifies the filename handling by using a direct match statement instead of chaining unwrap_or_else and map. This is more readable and maintainable while preserving the same behavior - using the provided filename or defaulting to "<string>".


1497-1497: LGTM! Improved control flow with single return point.

The refactoring consolidates the return logic to use a single return point by assigning the result to a local str variable. This is a good practice in Rust as it:

  • Reduces code duplication
  • Makes the control flow clearer
  • Maintains the same functionality while improving readability

The logic remains identical - return the formatted OS error string for 2-argument cases, otherwise delegate to the base __str__ method.

Also applies to: 1517-1517, 1519-1521

vm/src/types/slot.rs (3)

172-172: LGTM! Type alias improvement for better clarity.

The change from PyStrRef to PyRef<PyStr> improves type explicitness and aligns with the broader refactoring effort to distinguish string types in the codebase.


253-253: LGTM! Function signature updated consistently.

The return type change aligns with the StringifyFunc type alias update and maintains consistency throughout the codebase.


980-980: LGTM! Trait method signatures updated consistently.

All Representable trait methods have been updated to use the more explicit PyRef<PyStr> return type, maintaining consistency with the StringifyFunc type alias and the broader string type refactoring effort.

Also applies to: 989-989, 994-994

vm/src/object/core.rs (4)

544-547: LGTM! Safe downcasting method with proper error handling.

The new try_downcast method provides a safe alternative to the existing downcast method by using T::try_downcast_from for validation and VM-based error generation. The unsafe downcast_unchecked is only called after successful validation, which is the correct pattern.


728-731: LGTM! Internal method for exposing TypeId.

The typeid method provides necessary access to the object's internal TypeId for downcasting operations. The pub(crate) visibility is appropriate for this internal implementation detail.


736-737: LGTM! Delegating downcast check to payload type.

The updated downcastable method delegates to T::downcastable_from(self), allowing payload types to customize their downcasting logic. This is particularly useful for types like PyUtf8Str that need additional validation beyond type ID matching.


740-746: LGTM! Safe reference-based downcasting method.

The new try_downcast_ref method provides a safe reference-based alternative to downcast_ref with VM-based error handling. The implementation correctly validates through T::try_downcast_from before performing the unsafe cast, maintaining consistency with the other safe downcasting methods.

vm/src/stdlib/io.rs (6)

123-123: LGTM! Import addition is correct.

The addition of PyUtf8StrRef to the imports is necessary for the UTF-8 string refactoring and is properly placed alphabetically.


1582-1582: LGTM! Consistent return type updates for repr methods.

The changes from PyStrRef to PyRef<PyStr> for both slot_repr and __repr__ methods are consistent with the broader string type refactoring. The implementation logic remains unchanged, maintaining functionality while improving type safety.

Also applies to: 1595-1595


2317-2317: LGTM! Improved string access pattern.

The change from encoding.try_to_str(vm)? to encoding.as_str() is correct and more efficient. Since the parameter is now PyUtf8StrRef (UTF-8 validated), the fallible conversion is no longer needed, and direct string slice access is appropriate.


2417-2417: LGTM! Consistent UTF-8 string access pattern.

The change from encoding.try_to_str(vm)? to encoding.as_str() follows the same correct pattern as other changes in the file, taking advantage of UTF-8 pre-validation to eliminate unnecessary error handling.


3895-3895: LGTM! Systematic UTF-8 string type adoption.

The changes systematically replace PyStrRef with PyUtf8StrRef for mode and encoding parameters in I/O-related structs:

  • IoOpenArgs.mode: Now uses UTF-8 validated strings for file modes
  • OpenArgs.encoding: Now uses UTF-8 validated strings for encoding parameters
  • FileIOArgs.mode: Consistent with the pattern
  • Import addition in fileio module enables the new types

These changes are consistent, well-structured, and align with the UTF-8 validation goals of the refactoring.

Also applies to: 3921-3921, 4133-4133, 4260-4260


4298-4299: LGTM! Completion of UTF-8 string refactoring pattern.

The changes correctly complete the UTF-8 string adoption:

  • Line 4298: Using PyUtf8Str::from("rb") instead of PyStr::from("rb") is appropriate since "rb" is valid UTF-8
  • Line 4299: Direct string access with as_str() replaces fallible conversion, taking advantage of UTF-8 pre-validation

These changes are consistent with the systematic refactoring throughout the file.

vm/src/builtins/str.rs (7)

18-18: LGTM: Import addition for trait implementation

The addition of MaybeTraverse, Traverse, and TraverseFn imports is necessary for the new PyUtf8Str type implementation.


67-69: Type aliases properly defined

The type aliases follow the established pattern and provide convenient shorthand for both regular and UTF-8 validated string references.


354-354: Constructor return type correctly updated

The change from PyResult<PyRef<Self>> to PyResult allows for proper subclassing support.


438-455: Enhanced UTF-8 validation with better error reporting

The method is correctly made private and now provides detailed error information including the exact position of invalid surrogates.


954-974: Format method properly updated for UTF-8 handling

The changes ensure format specifications are valid strings and the formatted result is UTF-8 validated before processing.


1475-1483: UTF-8 validation method correctly implemented

The try_as_utf8 method properly validates UTF-8 before performing the safe cast to PyUtf8Str.


1931-2087: Well-structured PyUtf8Str implementation

The repr(transparent) wrapper design ensures zero overhead while maintaining type safety. The custom downcastable_from implementation correctly validates UTF-8 before allowing downcasts.

stdlib/src/sqlite.rs (13)

62-62: LGTM: Correct import addition for UTF-8 validated string types.

The addition of PyUtf8Str and PyUtf8StrRef imports is necessary and appropriate for the systematic refactoring to use UTF-8 validated string types in SQL contexts.


855-855: LGTM: Correct parameter type change for SQL statements.

Changing from PyStrRef to PyUtf8StrRef is appropriate for SQL statement parameters, ensuring UTF-8 validation at the type level.


990-990: LGTM: Appropriate type change for SQL parameter.

Using PyUtf8StrRef for the SQL parameter ensures UTF-8 validation, which is essential for SQL statement execution.


1002-1002: LGTM: Consistent type change for SQL parameter.

The change to PyUtf8StrRef for the SQL parameter maintains consistency with other execute methods and ensures UTF-8 validation.


1014-1014: LGTM: Appropriate type change for SQL script parameter.

Using PyUtf8StrRef for the script parameter ensures UTF-8 validation for SQL scripts, which is necessary for proper execution.


1163-1163: LGTM: Correct type change for collation name.

Using PyUtf8StrRef for the collation name parameter ensures UTF-8 validation, which is appropriate for database collation names.


1494-1494: LGTM: Consistent type change for SQL parameter.

The change to PyUtf8StrRef maintains consistency with Connection execute methods and ensures UTF-8 validation for SQL statements.


1566-1566: LGTM: Appropriate type change for SQL parameter.

Using PyUtf8StrRef for the SQL parameter ensures UTF-8 validation and maintains consistency across execute methods.


1640-1640: LGTM: Consistent type change for SQL script parameter.

The change to PyUtf8StrRef maintains consistency with Connection::executescript and ensures UTF-8 validation for SQL scripts.


2376-2376: LGTM: Appropriate type change for Statement constructor.

Using PyUtf8StrRef for the SQL parameter in Statement::new ensures UTF-8 validation at the statement creation level, which is foundational for proper SQL handling.


2731-2731: LGTM: Correct UTF-8 conversion for string binding.

The explicit conversion to UTF-8 using try_as_utf8(vm)? is appropriate when binding PyStr parameters that haven't been upgraded to PyUtf8StrRef, ensuring UTF-8 validation is maintained.


2991-2991: LGTM: Consistent UTF-8 conversion for result handling.

The explicit UTF-8 conversion using try_as_utf8(vm)? maintains consistency with parameter binding and ensures UTF-8 validation for string results.


3072-3078: LGTM: Excellent improvement to eliminate redundant UTF-8 validation.

The change from PyStrRef to &PyUtf8Str is a significant improvement that:

  • Provides compile-time UTF-8 validation guarantees
  • Eliminates the need for runtime UTF-8 checks within the function
  • Makes the code more robust and efficient

This represents the core benefit of the refactoring effort.

Comment on lines +2449 to 2466
self.to_str()
.expect("Interned PyStr should always be valid UTF-8")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify UTF-8 validity is enforced at string interning time

The expect assumes interned strings are always valid UTF-8. This invariant should be enforced when strings are interned.


🏁 Script executed:

#!/bin/bash
# Search for string interning implementation to verify UTF-8 validation
ast-grep --pattern 'impl $_ {
  $$$
  fn intern($$$) {
    $$$
  }
  $$$
}'

# Also search for intern_str or similar methods
rg -A 10 "intern_str|new_interned" --type rust

Length of output: 35264


🏁 Script executed:

#!/bin/bash
# Locate string_pool usage in Context
rg -n "string_pool" vm/src/vm/context.rs

# Show beginning of context.rs to inspect field definitions
sed -n '1,200p' vm/src/vm/context.rs

# Search for StringPool type and its implementation
rg -n "struct StringPool" -n vm/src/vm
rg -n "impl StringPool" -n vm/src/vm

Length of output: 5090


🏁 Script executed:

#!/bin/bash
# Locate the intern module file
rg --files-with-matches "InternableString" -t rust

# Inspect its contents to find StringPool implementation
FILE=$(rg --files-with-matches "InternableString" -t rust | head -n1)
echo "=== Showing top of $FILE ==="
sed -n '1,200p' "$FILE"

# Search for StringPool definition and intern method
rg -n "pub struct StringPool" -n "$FILE"
rg -n "impl StringPool" -n "$FILE"

Length of output: 5331


🏁 Script executed:

#!/bin/bash
# Find InternableString and MaybeInternedString definitions in intern.rs
rg -n "trait InternableString" -n vm/src/intern.rs
rg -n "trait MaybeInternedString" -n vm/src/intern.rs

# Show the definitions around those lines
sed -n '250,350p' vm/src/intern.rs

Length of output: 2393


Enforce UTF-8 validity when interning strings
The StringPool::intern implementation accepts Wtf8Buf (which may contain unpaired surrogates) without any check, so the later

self.to_str().expect("Interned PyStr should always be valid UTF-8")

in vm/src/builtins/str.rs can still panic. We need to validate once at interning time to uphold the invariant.

• In vm/src/intern.rs, inside impl StringPool { pub unsafe fn intern<…> }, right after you get back the PyRefExact<PyStr> (e.g. before calling miss), add:

     let str_ref = s.into_pyref_exact(typ);
+    // Validate that the underlying WTF-8 data is valid UTF-8 before inserting
+    let w = str_ref.as_wtf8();
+    debug_assert!(
+        w.to_str().is_some(),
+        "Tried to intern invalid UTF-8 string: {:?}",
+        w
+    );

This guarantees that any panic in to_str().expect(…) truly reflects a broken invariant, not a late check.

🤖 Prompt for AI Agents
In vm/src/intern.rs inside the unsafe fn intern method, after obtaining the
PyRefExact<PyStr> and before calling miss, add a validation step to check that
the interned string is valid UTF-8. This ensures the invariant that interned
PyStr is always valid UTF-8 is enforced at interning time, preventing the later
to_str().expect(...) call in vm/src/builtins/str.rs from panicking unexpectedly.
Implement this by attempting to convert the interned string to &str and handling
any invalid UTF-8 by panicking or returning an error immediately.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
vm/src/builtins/str.rs (2)

1345-1351: LGTM: Proper error handling for surrogates

This change directly addresses the previous review concern about potential panics with surrogate code points. The method now:

  1. Returns PyResult<String> instead of String for proper error propagation
  2. Validates UTF-8 using try_as_utf8(vm)? before processing
  3. Handles surrogates gracefully by returning an error instead of panicking

The TODO comment appropriately acknowledges that full WTF-8 support could be implemented in the future if needed.


2464-2466: Note: Underlying UTF-8 validation issue persists

While the expectation message is now more descriptive, this still has the potential to panic if invalid UTF-8 strings are interned. As noted in previous reviews, the invariant that "interned PyStr should always be valid UTF-8" should be enforced at string interning time rather than relying on runtime checks.

Consider the previous suggestion to validate UTF-8 during the interning process in vm/src/intern.rs to prevent this expectation from ever failing.

🧹 Nitpick comments (1)
vm/src/macros.rs (1)

189-196: Consider adding debug assertions to validate the UTF-8 invariant

The unsafe unchecked downcast relies on the assumption that all VM identifiers are ASCII strings. While this is likely true in practice, consider adding a debug assertion to validate this invariant:

 #[macro_export]
 macro_rules! identifier_utf8(
     ($as_ctx:expr, $name:ident) => {
         // Safety: All known identifiers are ascii strings.
+        #[cfg(debug_assertions)]
+        debug_assert!(
+            $as_ctx.as_ref().names.$name.as_object()
+                .downcast_ref::<$crate::builtins::PyStr>()
+                .unwrap().is_utf8(),
+            "identifier {} is not valid UTF-8", stringify!($name)
+        );
         #[allow(clippy::macro_metavars_in_unsafe)]
         unsafe { $as_ctx.as_ref().names.$name.as_object().downcast_unchecked_ref::<$crate::builtins::PyUtf8Str>() }
     };
 );

This would help catch violations of the UTF-8 invariant during development while maintaining the performance benefits of unchecked downcasting in release builds.

📜 Review details

Configuration used: .coderabbit.yml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 445c3bd and 9583af0.

📒 Files selected for processing (28)
  • stdlib/src/sqlite.rs (13 hunks)
  • vm/src/builtins/builtin_func.rs (2 hunks)
  • vm/src/builtins/mod.rs (1 hunks)
  • vm/src/builtins/namespace.rs (1 hunks)
  • vm/src/builtins/slice.rs (1 hunks)
  • vm/src/builtins/str.rs (12 hunks)
  • vm/src/builtins/super.rs (2 hunks)
  • vm/src/builtins/type.rs (1 hunks)
  • vm/src/convert/try_from.rs (1 hunks)
  • vm/src/exceptions.rs (3 hunks)
  • vm/src/format.rs (1 hunks)
  • vm/src/macros.rs (1 hunks)
  • vm/src/object/core.rs (3 hunks)
  • vm/src/object/payload.rs (2 hunks)
  • vm/src/protocol/object.rs (3 hunks)
  • vm/src/stdlib/codecs.rs (2 hunks)
  • vm/src/stdlib/ctypes/structure.rs (1 hunks)
  • vm/src/stdlib/io.rs (14 hunks)
  • vm/src/stdlib/operator.rs (3 hunks)
  • vm/src/stdlib/posix.rs (2 hunks)
  • vm/src/stdlib/pwd.rs (2 hunks)
  • vm/src/stdlib/sys.rs (1 hunks)
  • vm/src/stdlib/time.rs (3 hunks)
  • vm/src/types/slot.rs (4 hunks)
  • vm/src/utils.rs (2 hunks)
  • vm/src/vm/mod.rs (1 hunks)
  • vm/src/vm/vm_ops.rs (2 hunks)
  • vm/src/warn.rs (5 hunks)
✅ Files skipped from review due to trivial changes (3)
  • vm/src/builtins/slice.rs
  • vm/src/format.rs
  • vm/src/exceptions.rs
🚧 Files skipped from review as they are similar to previous changes (22)
  • vm/src/stdlib/ctypes/structure.rs
  • vm/src/utils.rs
  • vm/src/builtins/mod.rs
  • vm/src/stdlib/pwd.rs
  • vm/src/builtins/super.rs
  • vm/src/builtins/namespace.rs
  • vm/src/stdlib/operator.rs
  • vm/src/warn.rs
  • vm/src/builtins/type.rs
  • vm/src/stdlib/time.rs
  • vm/src/stdlib/posix.rs
  • vm/src/stdlib/codecs.rs
  • vm/src/builtins/builtin_func.rs
  • vm/src/vm/mod.rs
  • vm/src/protocol/object.rs
  • vm/src/types/slot.rs
  • vm/src/object/core.rs
  • vm/src/vm/vm_ops.rs
  • vm/src/object/payload.rs
  • vm/src/stdlib/io.rs
  • vm/src/stdlib/sys.rs
  • stdlib/src/sqlite.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit Inference Engine (.github/copilot-instructions.md)

**/*.rs: Follow the default rustfmt code style (cargo fmt to format)
Always run clippy to lint code (cargo clippy) before completing tasks. Fix any warnings or lints that are introduced by your changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • vm/src/macros.rs
  • vm/src/convert/try_from.rs
  • vm/src/builtins/str.rs
🧠 Learnings (2)
vm/src/macros.rs (1)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

vm/src/builtins/str.rs (5)

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to Lib/test/**/*.py : When tests fail due to unsupported syntax, keep the test as @unittest.expectedFailure, document that it requires PEP 695 support, and focus on tests that can be fixed through Rust code changes only

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration using #ifdef checks rather than providing fallback values for other platforms.

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Learnt from: moreal
PR: #5847
File: vm/src/stdlib/stat.rs:547-567
Timestamp: 2025-06-27T14:47:28.810Z
Learning: In RustPython's stat module implementation, platform-specific constants like SF_SUPPORTED and SF_SYNTHETIC should be conditionally declared only for the platforms where they're available (e.g., macOS), following CPython's approach of optional declaration rather than providing fallback values for other platforms.

Learnt from: CR
PR: RustPython/RustPython#0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-07-19T03:16:56.511Z
Learning: Applies to **/*.rs : Follow Rust best practices for error handling and memory management

🧬 Code Graph Analysis (2)
vm/src/macros.rs (2)
vm/src/vm/context.rs (1)
  • as_ref (634-636)
vm/src/object/core.rs (4)
  • downcast_unchecked_ref (745-745)
  • downcast_unchecked_ref (754-754)
  • downcast_unchecked_ref (767-767)
  • downcast_unchecked_ref (773-777)
vm/src/convert/try_from.rs (4)
vm/src/builtins/str.rs (6)
  • obj (1969-1969)
  • obj (1975-1975)
  • class (280-282)
  • class (1638-1640)
  • class (1958-1960)
  • try_downcast_from (1974-1977)
vm/src/vm/mod.rs (1)
  • class (567-576)
vm/src/object/payload.rs (2)
  • class (51-51)
  • try_downcast_from (33-49)
vm/src/object/core.rs (1)
  • class (673-675)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Run tests under miri
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Ensure compilation on various targets
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Run rust tests (ubuntu-latest)
🔇 Additional comments (6)
vm/src/convert/try_from.rs (1)

81-86: LGTM: Improved error handling with VM-aware validation

The refactored implementation correctly uses the new try_downcast_from method to validate the downcast before performing the unsafe operation. This change:

  1. Maintains the performance optimization of fast_isinstance for type checking
  2. Adds proper VM-aware validation through try_downcast_from
  3. Provides consistent error reporting across the codebase
  4. Only performs the unchecked downcast after validation succeeds

This aligns well with the new UTF-8 string validation infrastructure and improves type safety.

vm/src/builtins/str.rs (5)

67-68: LGTM: Good placement of type aliases

Moving the type aliases to the top of the file improves readability and discoverability. The new PyUtf8StrRef follows the same naming convention as the existing PyStrRef.


438-455: LGTM: Comprehensive UTF-8 validation with proper error reporting

The ensure_valid_utf8 method provides thorough validation with:

  • Proper position tracking for surrogate detection
  • Python-compatible Unicode error messages
  • Integration with the VM's error system

Making this method private is appropriate since it's an internal validation utility.


506-521: LGTM: Safe UTF-8 validation and casting methods

These methods provide a safe interface for converting to PyUtf8Str:

  1. Proper validation: Uses ensure_valid_utf8 to check for surrogates before casting
  2. Safe casting: Only performs the unsafe cast after validation succeeds
  3. Correct lifetimes: Maintains proper lifetime relationships between input and output
  4. Transparent wrapper: The cast is safe due to PyUtf8Str being a transparent wrapper around PyStr

The implementation follows good Rust practices for safe abstractions over unsafe operations.


970-990: LGTM: Added UTF-8 validation to string formatting

The addition of try_into_utf8(vm)? on line 983 ensures that string formatting only operates on valid UTF-8 data. This prevents potential issues with surrogate code points during formatting operations and maintains consistency with the new UTF-8 validation approach.

The error handling is properly propagated, and the change maintains backward compatibility while improving robustness.


1939-2101: Excellent implementation of PyUtf8Str wrapper type

This is a well-designed UTF-8 validated string type that provides:

Type Safety:

  • Transparent wrapper around PyStr with enforced UTF-8 invariants
  • Safe accessor methods that leverage the UTF-8 guarantee
  • Proper downcast validation in try_downcast_from

Comprehensive API:

  • Multiple From implementations for UTF-8 compatible types
  • Standard trait implementations (Debug, Display, Traverse, PyPayload)
  • Upcast support via as_pystr() method
  • Equality traits based on string content

Integration:

  • Proper payload_type_id() to share type identity with PyStr
  • Runtime validation in downcastable_from checking UTF-8 validity
  • Safe unchecked operations justified by type invariants

The implementation follows Rust best practices for newtype wrappers and provides a solid foundation for UTF-8 validated string handling throughout the VM.

@youknowone youknowone merged commit cee579e into RustPython:main Jul 30, 2025
12 checks passed
@youknowone youknowone deleted the wtf8 branch July 30, 2025 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant