Regex To Parse A Multiline HTML
am trying to parse a multi-line html file using regex. HTML code: Details     uss_vod_translator   Regex Expre
 
 
 
 
Solution 1:
Can any one please suggest me how to parse a multiline HTML?
Stop trying to use regular expressions and use a module that will parse it for you.
HTML::TreeBuilder is a good solution.
HTML::TreeBuilder::LibXML gives you the same API but backed by a fast parser.
HTML::TreeBuilder::XPath adds XPath support as well as a fast parser.
Solution 2:
As stated above Never use regexes to parse HTML.
I'm using HTML::TreeBuilder::XPath to parse HTML and this dramatically decrease creation time for each of my scraping/parsing programs.
Here is how you task could be implemented:
use Modern::Perl;
use HTML::TreeBuilder::XPath;
my $html = <<END;
<tr><td>General Info</td></tr>  
<tr class=d1>
<td>some info</td></tr>
<tr><td>Details</td></tr>  
<tr class=d1>
<td>uss_vod_translator</td></tr>
<tr><td>Another header</td></tr>  
<tr class=d1>
<td>some other info</td></tr>
END
my $tree = HTML::TreeBuilder::XPath->new_from_content($html);
my ($details) = $tree->findvalues('//tr[ td[ text() = "Details" ] ]/following-sibling::tr[1]/td[1]');
say $details;
Solution 3:
Try the below line before you match your pattern
 $line=~s/>(\n|\t|\s)+</></gs;
Then you can made the HTML string as in single line.
Post a Comment for "Regex To Parse A Multiline HTML"