Have you ever tried to look under the hood and interact with a pdf programmatically? I assure you it only gets worse.
A while ago I tried to write a small script to scrape data out of some account statements that my idiot bank only made available in pdf format. As far as I could tell, the file was just a list of tiny chunks of text along with sets of x/y coordinates specifying where each one should be placed on the page. Answering seemingly simple questions like “are these two words on the same line?” Involved comparing raw y-coordinates because the file had no concept of a “line of text”, and even spaces between words were often simulated by bumping the x-coordinate over by a few pixels instead of using an actual space character.
I suspect those files were generated by a particularly bad piece of software, and a more competent one could probably do much better, but knowing that its even possible to create a file that cursed is still infuriating to me.









Volkswagen AG Group