Active TopicsActive Topics  Display List of Forum MembersMemberlist  CalendarCalendar  Search The ForumSearch  HelpHelp
  RegisterRegister  LoginLogin
Member Contributions
 FSI Language Courses Forum : Language Courses : Member Contributions
Message Icon Topic: Need Help - Cantonese project Post Reply Post New Topic
Author Message
mandel
Newbie
Newbie


Joined: 22 December 2007
Online Status: Offline
Posts: 3
Quote mandel Replybullet Topic: Need Help - Cantonese project
    Posted: 22 December 2007 at 2:19am
HI,
 
I'm a new forum member thinking of making a new, word processed pdf version of the FSI Cantonese course materials.  The FSI courses are helpful and deserves better treatment.  Only that the whole process is very labor intensive and requires the transformation of the present (typewritten) pdf textbook into image files (preferably PNG).  After that I will OCR the images and then proofread book to ensure that its contents are all right, before releasing a word document for all to proofread.  Barring any mistakes, the end product will be released as a pdf file over at this site.  The project will use free fonts and will be non-commercialized, so I don't think there would be any problem.  Possibly we could get someone to clean up the audio digitally too.
 
Obviously this will take lots of time and effort, and I need help.  Anyone interested in this project can email me at mandel1luke@yahoo.com.  Perhaps three or four of us can come up with an uncluttered word processed pdf textbook which will make learning Cantonese easier.
 
Thanks.
IP IP Logged
DemiPuppet
Administrator
Administrator


Joined: 27 May 2006
Location: United States
Online Status: Offline
Posts: 163
Quote DemiPuppet Replybullet Posted: 22 December 2007 at 8:19am
PDF files are already available on this site.  All modern OCR programs take PDF files as well as images files as input. No need to convert to PNG. I just verified this using ABBYY Fine Reader 7.0. It did a fairly decent job of converting the images into text (except for the marks over the characters).
IP IP Logged
mandel
Newbie
Newbie


Joined: 22 December 2007
Online Status: Offline
Posts: 3
Quote mandel Replybullet Posted: 22 December 2007 at 7:44pm

It depends on what model of an OCR you use.  ABBYY FIne Reader is a paid software, whereas the OCR I used is freeware (FreeOCR), which does an excellent job converting image files to text, but which cannot convert PDF to text.  Unfortunately ABBYY Fine Reader is a paid software I cannot afford.  If you could join this project and help do the OCR (hopefully not too hectic a job), then I will do the diacritics and the proofreading itself, and we can give a more reader-friendly and uncluttered pdf booklet to learners.  If this project is successful, we can extend it to other languages.

All uncommercialized and for free of course.

Your input is very much appreciated.

P.S. I've check out the ABBYY Fine Reader.  It is indeed an excellent piece of software, which does exactly what Demipuppet says it does.  However, as I was testing a trial version, it did not allow me to save the OCR'd document as a file.
 
I humbly urged anyone with the ABBYY Fine Reader or a similar OCR software which can read pdf to email me the OCR'd doc file.  Never mind the mistakes; I will proofread it and turn it into a word-processed pdf textbook, to be hosted here.
 
Thanks to all.


Edited by mandel - 22 December 2007 at 9:16pm
IP IP Logged
DemiPuppet
Administrator
Administrator


Joined: 27 May 2006
Location: United States
Online Status: Offline
Posts: 163
Quote DemiPuppet Replybullet Posted: 24 December 2007 at 9:12am
Check your Yahoo account for email.

OCR is the easy part. I know from experience that there is still a tremendous amount of work.  I'm still editing/proofreading the FSI Hindi Basic course I OCR'd a couple of months ago.

BTW, the free graphic editing program GIMP can convert PDF document pages into PNG (or any other graphic format).  You also need to have Ghostscript installed.


Edited by DemiPuppet - 24 December 2007 at 9:15am
IP IP Logged
mandel
Newbie
Newbie


Joined: 22 December 2007
Online Status: Offline
Posts: 3
Quote mandel Replybullet Posted: 24 December 2007 at 9:16pm
Thanks very much for your OCR, Demi puppet.
 
You're right, there's still lots of work to be done.  Proofreading is difficult for my side, esp. since this course uses the Yale system with lots of diacritic marks which must be edited manually.
 
BTW, does anyone know how to input the character M with a grave accent (`) above it under Unicode?  It's required in the Yale transcripts, but somehow I can't find it.  Many thanks beforehand.
IP IP Logged
unzum
Newbie
Newbie


Joined: 25 April 2007
Location: United Kingdom
Online Status: Offline
Posts: 10
Quote unzum Replybullet Posted: 12 January 2008 at 1:17pm
I know what you mean mandel. I wanted to write some flashcards for Cantonese and must have spent about an hour looking for a way to write in Yale.

The closest I could find was http://toshuo.com/cantonese-tone-tool/

Hope that helps.
IP IP Logged
sceva
Newbie
Newbie


Joined: 29 February 2008
Online Status: Offline
Posts: 1
Quote sceva Replybullet Posted: 29 February 2008 at 10:31pm
If you want pictures to use with your flashcards, check out http://www.foreignlanguageflashcards.com.  They have some blank files that will let you type in the language you are learning.  It is really easy, and the pictures make it funner to study.
IP IP Logged
pudding
Newbie
Newbie


Joined: 15 June 2008
Location: New Zealand
Online Status: Offline
Posts: 5
Quote pudding Replybullet Posted: 15 June 2008 at 5:45am
(My first post)

I am currently transcribing the Volume I text into annotated HTML (HTML with comment tags, noting where I have changed the text and where page-breaks were in the original text for proofing).

I've nearly finished typing out the coliform(sic?), and am considering giving OCR a go, however I have reservations about transcribing it verbatim, and am considering re-writing it using LHSK JyutPing which seems to be the most widely used romanization on the internet, or at least in the resources I have available to me.

I find PDF's difficult to work with, as it is difficult to annotate them without Acrobat and I like the flexibility of HTML.

I will be posting a link to what I've done when I finish the introduction, though I'm still a little bit hazy as to what to do with regards to copyright, I don't really care if anyone copies my work or commercially reproduces what I've done, but I would appreciate credit for the transcription.

I'm also thinking of other ways I can enhance the text, hyperlinking and perhaps chinese characters immediately spring to mind, while still keeping it compatible with the existing audio recordings.
IP IP Logged
pudding
Newbie
Newbie


Joined: 15 June 2008
Location: New Zealand
Online Status: Offline
Posts: 5
Quote pudding Replybullet Posted: 15 June 2008 at 5:49am
Just another note, if anyone else has a portion or the entire of the text OCR'd or retyped I would really appreciate if you could let me know or share, I hate to think I'm going to all this effort to retype something someone else already has.
IP IP Logged
Post Reply Post New Topic
Printable version Printable version

Forum Jump
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum



This page was generated in 0.172 seconds.