What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:
Select-NodeContent $doc.DocumentNode "//a/@href"
And this one extracts the desired substring:
Select-NodeContent $doc.DocumentNode "//a/@href""IP_PHONE_BACKUP-(.*)\.zip"
The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:
- Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
- Install PowerShell Community Extensions if you want to parse a live web page.
- Understand XPath to be able to construct a navigable path to your target node.
- Understand regular expressions to be able to extract a substring from your target node.
With those requirements satisfied you can add the HTMLAgilityPath
type to your environment and define the Select-NodeContent
function, both shown below. The very end of the code shows how you assign a value to the $doc
variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.
Set-StrictMode -Version Latest
$HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll")
Add-Type -Path $HtmlAgilityPackPathfunctionSelect-NodeContent(
[HtmlAgilityPack.HtmlNode]$node,
[string] $xpath,
[string] $regex,
[Object] $default = "")
{
if ($xpath -match"(.*)/@(\w+)$") {
($xpath, $attribute) = $matches[1], $matches[2]
$resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default }
}
else { $resultNode = $node.SelectSingleNode($xpath)
$text = ?: { $resultNode } { $resultNode.InnerText } { $default }
}
if ($regex) {
if ($text -match$regex) { $text = $matches[1] }
else { $text = $default }
}
return$text
}
$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load("tmp\temp.html")
Actually, the HTML surrounding your file name is irrelevant here. You can extract the date just fine with the following regex (which doesn't even care whether you're extracting it from an e-mail an HTML page or a CSV file):
(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)
Quick test:
PS> [regex]::Match($html,'(?<=/tmp/IP_PHONE_BACKUP-)[^.]+(?=\.zip)')Groups : {2012-Jul-25_15:47:47}
Success :TrueCaptures : {2012-Jul-25_15:47:47}
Index :391Length :20Value :2012-Jul-25_15:47:47
Without regex:
$a = '<divid="ajaxWarningRegion"class="infoFont"></div><spanid="ajaxStatusRegion"></span><formenctype="multipart/form-data"method="post"name="confIPBackupForm"action="/cgi-bin/utilserv/confIPBackup/w_confIPBackup"id="confIPBackupForm" ><pre>Creating a new ZIP of IP Phone files from HTTP/PhoneBackup and HTTPS/PhoneBackup</pre><pre> /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip</pre><pre>Reports Success</pre><pre></pre><ahref = /tmp/IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip>Download the new ZIP of IP Phone files</a></div>'
$a.Substring($a.IndexOf("IP_PHONE_BACKUP")+"IP_PHONE_BACKUP".length+1, $a.IndexOf(".zip")-$a.IndexOf("IP_PHONE_BACKUP")-"IP_PHONE_BACKUP".length-1)
Substring
gets you a part of the original string. The first parameter is the start position of the substring while the second part is the length of the desiered substring. So now all you have to do is to calculate the start and the length using a little IndexOf
- and Length
-magic.
Post a Comment for "Retrieve Text In Html With Powershell"