Lloyds TSB duplicate bank template - Extracting a PDF with a scanned image and searchable text
Lloyds TSB send out bank statements like this:
Some of the text here is searchable, and some is an image, so the OCR engine skips the image as it assumes the text is already extracted (unfortunately the image on each page is the bank statement).
To overcome this, I used a standalone OCR engine to force the OCR to read all of the text; another method you can use is to tick the ‘bypass PDF encryption’ box above the Go button. Just FYI, if you try to run this you will first have to right click on the job you have already run in StatementReader and select ‘remove OCR cache’.
Here are the steps you can use: 1. Select the template UK -> Lloyds TSB duplicate 2. Select your input non-searchable PDF document using the ‘browse’ button 3. Untick ‘Parse PDF’ from above the ‘Go’ button (this will use our external OCR server by default, you can check this from the Options -> Advanced options -> Engine window). Also tick ‘bypass PDF encryption’. 4. Click ‘Go’
Recent PostsSee All
Revisited starter script from January 2021: Split Excel file into separate files Excel is essential, and Python is the future - forcing ourselves to practice the latter by automating some of the commo