Skip to content Skip to sidebar Skip to footer

Regex To Parse A Multiline HTML

am trying to parse a multi-line html file using regex. HTML code: Details uss_vod_translator Regex Expre

Solution 1:

Can any one please suggest me how to parse a multiline HTML?

Stop trying to use regular expressions and use a module that will parse it for you.

HTML::TreeBuilder is a good solution.

HTML::TreeBuilder::LibXML gives you the same API but backed by a fast parser.

HTML::TreeBuilder::XPath adds XPath support as well as a fast parser.


Solution 2:

As stated above Never use regexes to parse HTML.

I'm using HTML::TreeBuilder::XPath to parse HTML and this dramatically decrease creation time for each of my scraping/parsing programs.

Here is how you task could be implemented:

use Modern::Perl;
use HTML::TreeBuilder::XPath;

my $html = <<END;
<tr><td>General Info</td></tr>  
<tr class=d1>
<td>some info</td></tr>
<tr><td>Details</td></tr>  
<tr class=d1>
<td>uss_vod_translator</td></tr>
<tr><td>Another header</td></tr>  
<tr class=d1>
<td>some other info</td></tr>
END

my $tree = HTML::TreeBuilder::XPath->new_from_content($html);

my ($details) = $tree->findvalues('//tr[ td[ text() = "Details" ] ]/following-sibling::tr[1]/td[1]');
say $details;

Solution 3:

Try the below line before you match your pattern

 $line=~s/>(\n|\t|\s)+</></gs;

Then you can made the HTML string as in single line.


Post a Comment for "Regex To Parse A Multiline HTML"