Opinions/Suggestions for XML Data

MannyKSoSo · 18 Jul 2018, 12:03

So I am looking for opinions/suggestions on the best way/approach to gather information from this xml. This is only a small section of the xml, but the formatting remains the same throughout the whole thing.

Code: Select all

<Row ss:Index="76" ss:AutoFitHeight="0" ss:Height="179">
<Cell ss:Index="1" ss:MergeAcross="2" ss:StyleID="s04">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40" ss:Type="String">
<Font html:Size="10" html:Face="Arial" x:Family="Swiss" html:Color="#000000">0</Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">958 </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">INT 1336</Font>
</ss:Data>
</Cell>
<Cell ss:Index="4" ss:MergeAcross="12" ss:StyleID="s04">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40" ss:Type="String">
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">
International Chart Series, Baltic Sea - Sweden and Denmark, Bornholmsgat.
</Font>
<Font html:Size="9" html:Face="Arial" x:Family="Swiss" html:Color="#000000">A </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">Christiansø. </Font>
<Font html:Size="9" html:Face="Arial" x:Family="Swiss" html:Color="#000000">B </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">Rønne. </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">
55° 05´·03 N. — 55° 06´·42 N., 14° 40´·57 E. — 14° 42´·03 E.
</Font>
<Font html:Size="9" html:Face="Arial" x:Family="Swiss" html:Color="#000000">C </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">Nexø. </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">
<I>Includes</I>
<I> </I>
<I>changes</I>
<I> </I>
<I>to</I>
<I> </I>
<I>depths,</I>
<I> </I>
<I>wrecks,</I>
<I> </I>
<I>restricted</I>
<I> </I>
<I>areas,</I>
<I> </I>
<I>extraction</I>
<I> </I>
<I>areas</I>
<I> </I>
<I>and</I>
<I> </I>
<I>aids</I>
<I> </I>
<I>to</I>
<I> </I>
<I>navigation.</I>
<I> </I>
<I>The</I>
<I> </I>
<I>limits</I>
<I> </I>
<I>of</I>
<I> </I>
<I>panel</I>
<I> </I>
<I>B</I>
<I> </I>
<I>have</I>
<I> </I>
<I>been</I>
<I> </I>
<I>changed</I>
<I> </I>
<I>to</I>
<I> </I>
<I>provide</I>
<I> </I>
<I>improved</I>
<I> </I>
<I>coverage</I>
<I> </I>
<I>of</I>
<I> </I>
<I>Rønne.</I>
<I> </I>
<I>(A</I>
<I> </I>
<I>modified</I>
<I> </I>
<I>reproduction</I>
<I> </I>
<I>of</I>
<I> </I>
<I>INT1336</I>
<I> </I>
<I>published</I>
<I> </I>
<I>by</I>
<I> </I>
<I>Denmark.)</I>
</Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">
<I>Note:</I>
<I> </I>
On publication of this New Edition former Notice 6042(P)/17 is cancelled.
</Font>
</ss:Data>
</Cell>
<Cell ss:Index="17" ss:MergeAcross="5" ss:StyleID="s04">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40" ss:Type="String">
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">1:100,000 </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">1:12,500 </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">1:12,500 </Font>
<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">1:8,000</Font>
</ss:Data>
</Cell>
<Cell ss:Index="23" ss:MergeAcross="3" ss:StyleID="s02">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40" ss:Type="Number">10</ss:Data>
</Cell>
<Cell ss:Index="27" ss:MergeAcross="5" ss:StyleID="s02">
<ss:Data xmlns="http://www.w3.org/TR/REC-html40" ss:Type="Number">34</ss:Data>
</Cell>
</Row>

Currently what I am doing is first separating the data by a small sections via their rows (since all the data I want is contained within a row). But as you can see, there is no specific ID's for each data point, and some of them are almost exactly the same in some cases. So I wrote a small section of code to clean up all the mess of the xml so its manageable.

Code: Select all

FileRead, InputText, PWKLY 30_week30_2018.xml
Pos=1
While Pos := RegExMatch(InputText, "Us)<Row.*</Row>", RawData, Pos+StrLen(RawData))
{
	If A_Index = 76 ;not needed if you are only using a small section of code
	{
		RawData := StrReplace(RawData, "<Font html:Size=""10"" html:Face=""Arial"" x:Family=""Swiss"" html:Color=""#000000"">0</Font>", "")
		RawData := StrReplace(RawData, "</Font>", "`r`n")
		RawData := StrReplace(RawData, "</ss:Data>", "`r`n")
		RawData := StrReplace(RawData, "&#10;", "")
		RawData := RegExReplace(RawData, "U)<.*>", "")
		FileAppend, % RawData, Text%A_Index%.txt
	}
}

Which this code will produce the following result.

Code: Select all

958
INT 1336

International Chart Series, Baltic Sea - Sweden and Denmark, Bornholmsgat.
A 
Christiansø.
B 
Rønne.
55° 05´·03 N. — 55° 06´·42 N., 14° 40´·57 E. — 14° 42´·03 E.
C 
Nexø.
Includes changes to depths, wrecks, restricted areas, extraction areas and aids to navigation. The limits of panel B have been changed to provide improved coverage of Rønne. (A modified reproduction of INT1336 published by Denmark.)
Note: On publication of this New Edition former Notice 6042(P)/17 is cancelled.

1:100,000
1:12,500
1:12,500
1:8,000

10
34

Which this isn't terrible to deal with, but it also means that I have to separate each line as a data point that I want to obtain. Specifically I would like to be the variables as follows

#: 958 INT#: 1336
Title: International Chart Series, Baltic Sea - Sweden and Denmark, Bornholmsgat. Scale: 1:100,000
Plan: A Title: Christiansø. Limits: (null) Scale: 1:12,500
Plan: B Title: Rønne. Limits: 55° 05´·03 N. — 55° 06´·42 N., 14° 40´·57 E. — 14° 42´·03 E. Scale: 1:12,500
Plan: C Title: Nexø. Limits: (null) Scale: 1:8,000
Remark 1: Includes changes to depths, wrecks, restricted areas, extraction areas and aids to navigation. The limits of panel B have been changed to provide improved coverage of Rønne. (A modified reproduction of INT1336 published by Denmark.)
Remark 2: Note: On publication of this New Edition former Notice 6042(P)/17 is cancelled.
Folio: 10
Page: 34

Any suggestions or hints welcome. Thanks
PS #BlameAdobe for the silly format of the xml and the repetitive Italics

MannyKSoSo · 20 Jul 2018, 12:34

Update to previous post. I have updated my code somewhat to improve finding the data appropriately. This is what I have come up with

Code: Select all

While Pos := RegExMatch(InputText, "Us)<Row.*</Row>", RawData, Pos+StrLen(RawData))
{
	DataType := RegLast(RawData, DataType)
	If (DataType = "")
		Continue
	If InStr(RawData, "ADMIRALTY CHARTS", False)
		Continue
	If (DataType = "<B>WITHDRAWN</B>" or DataType = "New ADMIRALTY Charts" or DataType = "New Editions of ADMIRALTY Charts" or DataType = "Errata" or DataType = "ERRATA")
		Continue
	RawData := StrReplace(RawData, "<Font html:Size=""10"" html:Face=""Arial"" x:Family=""Swiss"" html:Color=""#000000"">0</Font>", "")
	RawData := StrReplace(RawData, "<Font html:Size=""9"" html:Face=""Arial"" x:Family=""Swiss"" html:Color=""#000000"">0</Font>", "")
	RawData := StrReplace(RawData, ". </Font>", "`r`n")
	RawData := StrReplace(RawData, "</ss:Data>", "`r`n")
	RawData := StrReplace(RawData, "&#10;", "`r`n")
	RawData := RegExReplace(RawData, "U)<.*>", "")
	Location := RegExMatch(RawData, "(|NP|NZ|IN|JP|AUS|SLB(| )|CP|X)\d?\d?\d?\d\.?\d?+", Match)
	If Match is float
		Continue
	If Match is integer
	{
		If (Location = 1)
			FileAppend, % "#####`r`n" FileName "`r`n" RawData, %DataType%.txt
	}
	Else If (Location = 1)
			FileAppend, % "#####`r`n" FileName "`r`n" RawData, %DataType%.txt
}

RegLast(RawData, Last) {
	RegExMatch(RawData, "(New( Editions of|) ADMIRALTY Charts published \d?\d (January|Feburary|March|April|May|June|July|August|September|October|November|December) \d\d\d\d|ADMIRALTY Publications|<B>WITHDRAWN</B>|New ADMIRALTY Charts|New Editions of ADMIRALTY Charts|ADMIRALTY CHARTS AND PUBLICATIONS PERMANENTLY WITHDRAWN|INTENTION TO WITHDRAW CHARTS|ADMIRALTY CHARTS INDEPENDENTLY WITHDRAWN|(Errata|ERRATA))", DataType)
	If (DataType = "")
		Return Last
	Else
		Return DataType
}

This will produce a better looking file (removes the extra line breaks), but there are still things to be improved. For example, depending on how the original pdf was, there are a few inheritances that still can hinder some of the data.

Code: Select all

30/2018
SLB 101
2018-04-06
South  Pacific  Ocean  -  Soloman  Islands,  Anchorages  in Guadalcanal Island.
Marau Sound. Lungga Roads. Honiara.
Includes changes to depths. (A modified reproduction of Chart SLB 101 published by Australia.)
Note:  On  publication  of  this  New  Edition  former  Notice 4323(P)/17 is cancelled.
1:50,000
1:20,000
1:5,000
68
102

The above is the newer format that I have come up with so far, but the issue remains with line "Marau Sound. Lungga Roads. Honiara."
Instead of the line being split like it is in the example xml, instead the xml lumps them together like so

Code: Select all

<Font html:Size="9" html:Face="Times New Roman" x:Family="Roman" html:Color="#000000">Marau Sound. Lungga Roads. Honiara. </Font>

Any suggestions are appreciated (also before you say StrSplit by periods, its not a 100% guarantee that the names will have a period).

Also since all of the data that I will be gathering from these xml's are going to be placed in database file I have decided to use justme's SQLiteDB Class script for searching and maintaining the database, I would also like to hear suggestions of how to best display the data. I am jumping back and forth between a ListView and a TreeView (other recommendations accepted), but everything I put in the database will need to have its own unique ID so you can view the history of what is going on with the chart.

Opinions/Suggestions for XML Data

Opinions/Suggestions for XML Data

Re: Opinions/Suggestions for XML Data

Who is online