Last week Google began turning PDF files of scanned printed documents into digital text, and as a result their searchable index has expanded once again.
Previously, only rarely would a scanned document appear in the organic search results. With the technology of optical character-recognition (OCR) implemented now these scanned PDFs will find their way into the results.
In an article posted at Information Week, one downside to this is the possibility of personal information appearing in the search results. Social Security numbers that could have gone unnoticed in scanned court documents could be discovered by Google.
“Public.Resource.org, a project that aims to make public government publicly accessible, recently found about 1,700 documents with Social Security numbers or alien identification numbers out of a corpus of 2.5 million court documents that go back decades.”
Unless Google wants more future lawsuits on its hands, I can imagine that issues such as this will be rectified rather quickly.
This process of turning an image back into readable text will likely have other uses such as reading text stored within images in a website. This could open the doors to using image based text more freely in one’s site design. While this use is not in place at the moment, it seems like a natural step forward.