Skip to content

Conversation

@nicolas-grekas
Copy link
Member

Q A
Branch? 7.4
Bug fix? no
New feature? no
Deprecations? no
Issues -
License MIT

This PR keeps the DOM*-based API but uses the native HTML5 parser on PHP 8.4 instead of masterminds/html5.
This works by parsing HTML strings using Dom\HTMLDocument then serializing to XML, and loading again using DOMDocument::loadXML().

This basically replaces #61356 since it removes any BC breaks.

The drawback compared to a more native approach is the double-parsing that happens.
This could be worked on later by providing a way to leverage the new Dom\* API directly.
To be proved worth it before.

@carsonbot carsonbot added this to the 7.4 milestone Aug 20, 2025
@OskarStark OskarStark changed the title [DomCrawler] Use the native HTM5 parser on PHP 8.4 [DomCrawler] Use the native HTML5 parser on PHP 8.4 Aug 20, 2025
@nicolas-grekas nicolas-grekas force-pushed the dom-native-html5 branch 4 times, most recently from f963618 to a6a0033 Compare August 21, 2025 06:43
@nicolas-grekas nicolas-grekas merged commit 8187625 into symfony:7.4 Aug 21, 2025
10 of 12 checks passed
nicolas-grekas added a commit that referenced this pull request Aug 21, 2025
…nks to the native DOM parser (nicolas-grekas)

This PR was merged into the 8.0 branch.

Discussion
----------

[DomCrawler] Always parse according to HTML5 rules thanks to the native DOM parser

| Q             | A
| ------------- | ---
| Branch?       | 8.0
| Bug fix?      | no
| New feature?  | yes
| Deprecations? | no
| Issues        | -
| License       | MIT

Follows #61475

Commits
-------

0425b2a [DomCrawler] Always parse according to HTML5 rules thanks to the native DOM parser
@nicolas-grekas nicolas-grekas deleted the dom-native-html5 branch August 21, 2025 09:08
This was referenced Oct 27, 2025
@PhilETaylor
Copy link
Contributor

Just a note/heads up for others that find themselves here...

I upgraded to v7.4.0-BETA1 - ran my unit tests, which worked in 7.3.x - and now they fail :)

Messages like:

Warning: DOMDocument::loadXML(): Failed to parse QName ':class' in Entity, line: 17 in /Users/phil/Sites/symfony-7.4-alpine-reproducer/vendor/symfony/dom-crawler/Crawler.php on line 1116

Warning: DOMDocument::loadXML(): error parsing attribute name in Entity, line: 21 in /Users/phil/Sites/symfony-7.4-alpine-reproducer/vendor/symfony/dom-crawler/Crawler.php on line 1116

A reproducer is here https://github.com/PhilETaylor/symfony-7.4-alpine-reproducer

I think this is a wont fix ...

@stof
Copy link
Member

stof commented Oct 27, 2025

the fact that Alpine.js makes you write code that reports warnings in spec-compliant HTML parsers looks like an issue for alpine.js instead.

However, it looks like the new code path does not use libxml_use_internal_errors(true) anymore and so triggers PHP warnings for cases that were not triggering them before. @nicolas-grekas should this be changed ?

@PhilETaylor
Copy link
Contributor

the fact that Alpine.js makes you write code that reports warnings in spec-compliant HTML parsers looks like an issue for alpine.js instead.

Totally agree. Just mentioning it as Im probably the first to see this since upgrading/testing the beta.

and so triggers PHP warnings for cases that were not triggering them before

I only mentioned it as the OP mentioned this should remove b/c breaks, and then my tests broke up upgrade. Im not expecting a fix, unless you think one is needed.

@stof
Copy link
Member

stof commented Oct 27, 2025

I suggest opening an issue instead of discussing that on a merged PR, to give it more visibility.
As hinted in my previous message, I think something still needs to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants