So In this post I'll just show how to get the contents of an odt file into PowerShell...
Step One - Get the contents of the "contents.xml" file
The contents.xml file has all of the text for the document it's in the root of the odt archive. I chose to use the 7-zip program ( www.7-zip.org) to extract the file. This is done like so:
#get the contents of the odt file
$res = ."c:\program files\7-zip\7z.exe" e $ODTfile content.xml #extracts only the content.xml from the archive to the current directory
$content = Get-Content content.xml
remove-item content.xml
#modified content
$mc = concat $content " "
The above snippet extracts the contents.xml file and loads it's contents into the variable $contents. I then use a concatenation script I wrote to concatenate all of the lines together into a single string. This will make the searching we need to do a little bit easier and cleaner.
Step Two - Define some regular expressions so we can identify xml tags
We now have a whole lot of xml in $mc and want to process (I use that term loosely) it a little bit. There are only a couple of elements that we really are interested in to get some base functionality. So let's define our regular expressions...
#regular expressions for identifying relevant xml tabs
$rpar = New-Object -typename System.Text.RegularExpressions.Regex("<text:[p|h][^<>]*>") #a pagraph or header line
$rtab = New-Object -typename System.Text.RegularExpressions.Regex("<text:tab[^<>]*>") #a tab character
$rtag = New-Object -typename System.Text.RegularExpressions.Regex("<[^<>]+>") #any other xml tag
$rspace = New-Object -typename System.Text.RegularExpressions.Regex("<text:s text:c[^<>]*>") #a number of spaces in a row
$rint = New-Object -typename System.Text.RegularExpressions.Regex("\d+") #an integer
Process the tags
#process paragraphs
$rpar.matches($mc) | foreach{$mc = $mc.replace($_.value,"`r`n")}
#process tabs
$rtab.matches($mc) | foreach{$mc = $mc.replace($_.value,"`t")}
Spaces are a little trickier to handle. Multiple spaces in a row are handled with a tag that looks like <text:s text:c="4">. So we need to search for the tags, find out how many spaces are in each instance, and then create a string with that many spaces. Then we need to replace the xml tags with the strings of spaces...
#process spaces
$spaceCount = New-Object System.Collections.ArrayList
$spaces = New-Object System.Collections.ArrayList
#match the xml for the space tags
$m_spaces = $rspace.matches($mc)
if ($m_spaces.Count -gt 0) {
#get the number of spaces for each match
$m_spaces | foreach{
$result = $spaceCount.add(($rint.match($_.value)).value)
}
#create strings with the correct number of spaces
for ($i = 0;$i -lt $m_spaces.Count;$i++) {
$result = $spaces.add(("").padleft([int]$spaceCount[$i]))
}
#replace the xml space tag with the string of spaces
for ($i = 0;$i -lt $m_spaces.Count;$i++) {
$mc = $mc.Replace($m_spaces[$i].value,$spaces[$i])
}
}
Clean up a little more and return the modified string
#strip remaining xml tags
$rtag.Matches($mc) | foreach{$mc = $mc.replace($_.value,"")}
#clean up other characters
$mc = $mc.Replace(">",">")
$mc = $mc.Replace("<","<")
$mc = $mc.Replace("'","'")
return $mc
Left to do
Alot. Some things that would be nice to add...
- Ability to handle numbered and bulleted lists - currently you get the text next to the number or bullet, but not the number or bullet
- Tables
- Make headings a different color?
- A write-OdtText script would be nice, and an interesting little challenge
-bc
No comments:
Post a Comment