D3 Data Extraction

From Railway Knowledge Base for New Zealand
Revision as of 10:51, 13 April 2023 by Robert (talk | contribs) (Created page with "== Introduction == Following a number of trials, it has become apparent that the process of converting a single D3 Report is not just a matter of do an OCR conversion and dump...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

Following a number of trials, it has become apparent that the process of converting a single D3 Report is not just a matter of do an OCR conversion and dump it in Excel. Because of variations in pages, breaks between section, complex headers on each page, rows not staying absolutely parallel, and some columns having ditto's to repeat the data above, makes it all to irregular to do a simple conversion.

With the 1910 D3, I used a trial version of PDFTron with only partial success. Being new to the program did not help and it appeared as if there was a utility that would force the program to analyse the data based on a user definable template. This was all too complex for me to attempt to learn in a hurry, so I pressed on and worked through the pages getting the list of names with their Classification numbers, position, and their wage/salary data. There remained though many instances where the Progression number was merged with the Name, so that needs a VBA routine written to strip them off. The same routine should be extended to also strip the initials off the other end and allocate them to their own fields. The 6 columns of service data did not convert at all well so they were initially left, but it is now recogised as a doable task but only for a single volume so still need to find a better process.

I decided to contact PDFTron through their website and did get a reply about 2 weeks later. By then my trial license had expired but I may have to give things another go as they sent me a template file which might be helpful.

In the meantime I have looked at another program that looke promising, Wondershare PDF Element. Experiments with this are not as successful as I had hoped and PDFTron's OCR seemed more reliable.