Retrieve Text In Html With Powershell

Question

In this html code :

Solution 1:

What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:

Select-NodeContent $doc.DocumentNode "//a/@href"

And this one extracts the desired substring:

Select-NodeContent $doc.DocumentNode "//a/@href""IP_PHONE_BACKUP-(.*)\.zip"

The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:

Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
Install PowerShell Community Extensions if you want to parse a live web page.
Understand XPath to be able to construct a navigable path to your target node.
Understand regular expressions to be able to extract a substring from your target node.

With those requirements satisfied you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent function, both shown below. The very end of the code shows how you assign a value to the $doc variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.

Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPathfunctionSelect-NodeContent(
    [HtmlAgilityPack.HtmlNode]$node,
    [string] $xpath,
    [string] $regex,
    [Object] $default = "")
{
    if ($xpath -match"(.*)/@(\w+)$") {
        # If standard XPath to retrieve an attribute is given,# map to supported operations to retrieve the attribute's text.
        ($xpath, $attribute) = $matches[1], $matches[2]
        $resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
    }
    else { # retrieve an element's text$resultNode = $node.SelectSingleNode($xpath)
        $text = ?: { $resultNode } { $resultNode.InnerText } { $default }
    }
    # If a regex is given, use it to extract a substring from the textif ($regex) {
        if ($text -match$regex) { $text = $matches[1] }
        else { $text = $default }
    }
    return$text
}

$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html") # Use this to load a file#$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this  PSCX cmdlet to load a live web page

Solution 2:

Actually, the HTML surrounding your file name is irrelevant here. You can extract the date just fine with the following regex (which doesn't even care whether you're extracting it from an e-mail an HTML page or a CSV file):

(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)

Quick test:

PS> [regex]::Match($html,'(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)')Groups   : {2012-Jul-25_15:47:47}
Success  :TrueCaptures : {2012-Jul-25_15:47:47}
Index    :391Length   :20Value    :2012-Jul-25_15:47:47

Solution 3:

The group(2) and group(3) of the following regex receptively contains the date and time:

/IP_PHONE_BACKUP-((.*)_(.*)).zip/

Here is a link to extract the value from a regex in powershell.

Is there a shorter way to pull groups out of a Powershell regex?

HIH

Solution 4:

Without regex:

$a = '<divid="ajaxWarningRegion"class="infoFont"></div><spanid="ajaxStatusRegion"></span><formenctype="multipart/form-data"method="post"name="confIPBackupForm"action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup"id="confIPBackupForm" ><pre>Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup</pre><pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre><pre>Reports Success</pre><pre></pre><ahref =  /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>Download the new ZIP of IP Phone files</a></div>'
$a.Substring($a.IndexOf("IP_PHONE_BACKUP")+"IP_PHONE_BACKUP".length+1, $a.IndexOf(".zip")-$a.IndexOf("IP_PHONE_BACKUP")-"IP_PHONE_BACKUP".length-1)

Substring gets you a part of the original string. The first parameter is the start position of the substring while the second part is the length of the desiered substring. So now all you have to do is to calculate the start and the length using a little IndexOf- and Length-magic.

Html5 Tutorial

Retrieve Text In Html With Powershell

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "Retrieve Text In Html With Powershell"