Print Page | Close Window

OCR of FSI Texts

Printed From: FSI Language Courses
Category: Language Courses
Forum Name: Member Contributions
Forum Discription: If you have course materials and are planning to contribute them to the website, this is the place to let everyone know.
URL: http://fsi-language-courses.com/forum/forum_posts.asp?TID=731
Printed Date: 16 January 2009 at 2:10am


Topic: OCR of FSI Texts
Posted By: DemiPuppet
Subject: OCR of FSI Texts
Date Posted: 31 December 2008 at 10:01pm
For what it's worth, I've run the Adobe Acrobat OCR tool against some of the PDF files on the site (I'm still working on others).  It's far from perfect, but the results might be useful for someone who wants to cut and paste text from the PDF files and is willing to make some hand edits. It might also be useful for anyone who wants to create HTML versions. Thanks to all who submitted the original PDF files.

Note that the files are larger than the originals since I've also added the book covers in most cases.  I also made sure that Acrobat was set up to retain the original file quality unchanged which may have added to the size increase.

The Finnish and French covers are quite nice looking.

Finnish (The workbook was already OCR'd)
http://www.sendspace.com/file/075xlv - http://www.sendspace.com/file/075xlv

French
http://www.sendspace.com/file/7713sb - http://www.sendspace.com/file/7713sb

Greek
http://www.sendspace.com/file/e7i30i - http://www.sendspace.com/file/e7i30i

Hungarian
http://www.sendspace.com/file/xazv36 - http://www.sendspace.com/file/xazv36



Replies:
Posted By: VagabondPilgrim
Date Posted: 31 December 2008 at 11:44pm
I've gone ahead and posted these to the http://www.fsi-language-courses.org - alternate site.


Posted By: flutable
Date Posted: 02 January 2009 at 4:58am
@Demipuppet, thanks for this, I thought I'd OCRd the text already. It's nice how the OCR process also straightens the pages!


Posted By: onebir
Date Posted: 02 January 2009 at 6:41am
Great work on the alternate site!

Perhaps it could also link to the material that's been uploaded to ERIC recently?  eg the FSI readers (FSI Finnish, Hungarian Turkish & Indonesian) and large amounts of DLI/Spoken Language material others have noticed has been uploaded.

Copyright might an issue for some of these texts, but linking to a govt-affiliated site that's hosting them can hardly be a copyright violation...

(Note that the links from the ERIC search results don't point direct to the PDFs.  But if you download these with flashget, the servlet that gets downloaded does provide a direct link.)


Posted By: DemiPuppet
Date Posted: 02 January 2009 at 9:32pm
Here are a few more OCR texts:

Le Monde Francophone
http://www.sendspace.com/file/d59jlk - http://www.sendspace.com/file/d59jlk

German Basic
http://www.sendspace.com/file/15wf5a - http://www.sendspace.com/file/15wf5a

Spanish Basic vols 1 and 2; several Spanish Programmatic files
http://www.sendspace.com/file/aza47g - http://www.sendspace.com/file/aza47g

Spanish Basic Vol 2 has missing page 28.36 corrected.


Posted By: VagabondPilgrim
Date Posted: 14 January 2009 at 8:20pm
Just a note to say that these are now on the alternate site as well.  Actually, they've been there since shortly after the last post.  I just haven't gotten around to mentioning it.  Embarrassed



Print Page | Close Window