PHP Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ‘;’ in Entity, line: 2416

Posted by on March 18, 2015 in PHP, Programming | 0 comments

PHP Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ‘;’ in Entity, line: 2416

If you’re trying to parse the results of a webpage, either remotely or locally, you may have run into this error.   If you’re trying to iterate over the html elements, the DOM Document may be a good choice for you.  It easily allows you to select all the document elements that match your search.  IE, you can select all bolded elements and find out what is bolded.  The php code below is a simple way to load html into the DOM model.  You may obtain this html using another method like cURL, or file_get_contents(), etc.

$doc = new DOMDocument();
$doc->loadHTML($content);

Now I’ll give you some research that I encountered.  Some people suggested that the reason this fails is that a URL in an href tag may have an un-escaped ampersand.  Because the parser is looking for a semi-colon after the ampersand to end the entity.  So it’s looking for a pattern like & not &Something=somethingelse.  if you have multiple ampresands and some semicolons it could also create problems with skipping content.

So my first thought was to simply run a string replace on ampresands and that would fix things.  Oh, was that wrong…..   It through a bunch of errors as shown below:

PHP Warning: DOMDocument::loadHTML(): Unexpected end tag : group in Entity, line: 98
PHP Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 606
PHP Warning: DOMDocument::loadHTML(): Tag aside invalid in Entity, line: 1387
PHP Warning: DOMDocument::loadHTML(): End tag : expected ‘>’ in Entity, line: 1397
PHP Warning: DOMDocument::loadHTML(): Unexpected end tag : scr in Entity, line: 1397
PHP Warning: DOMDocument::loadHTML(): Tag aside invalid in Entity, line: 2313
PHP Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 2383

Now the important part of this list of errors to me was the “End tag : expected ‘>’ in Entity, line: 1397″ which to me meant that there were already an ampresand that was escaped correctly.  So I found this regular expression to do a more sophisticated replace.

preg_replace(“/&(?!(?:apos|quot|[gl]t|amp);|#)/”, ‘&’, $content);

This cleared up the &gt error, and still gave me the results I was looking for.   I still had an issues with the Aside error and the End tag group error.  However, these are just HTML5  errors.  They had no bearing on my results, so I moved. on.  The important thing to accomplish here is that the ” htmlParseEntityRef: expecting ‘;’ ” error is taken care of because that affected the organization and structure of the rest of the document as it is parsed.

 

If you enjoyed this post, please consider leaving a comment or subscribing to the RSS feed to have future articles delivered to your feed reader.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>