Wednesday 25 December 2013

Langana - Turkish Language Parser - First Output

LANGANA - A COMPUTER PROGRAM THAT UNDERSTANDS A GIVEN TEXT AND ANSWERS QUESTIONS ABOUT IT

This work is protected by the Copyright:
Creative Commons -CC Attribution-NonCommercial-NoDerivatives 4.0 International


I have been working on LANGANA for the last 5 months.  LANGANA is a computer program
that reads and understands a given text or book and then answers questions related to it.

I have finished the LANGANA-parser program which parses Turkish Language texts and
converts the output to a pseudo language and outputs this parsed text to a file
and a MySQL database.

Currently, I am working on designing queries on this parsed pseudo language database tables.
The parsed DB table data enables me to make complex queries to extract sentences and
reach out to data to answer questions such as "Ali nereye gitti?".  The query has to
check for PrivateName('Ali') + nounExt('e OR a OR ye OR ya') + Verb('git'+(verbExt('ti'))
within  a sentence.  The program will be able to switch to deeper understanding modes by
checking the context, going through more than 1 sentence, for example 2 sentences before
and 3 sentences after it finds 'Ali'.

I will be working on a userinterface which takes questions and parses/converts them.
There will be need to find an algorithm to convert the question forms such as what, which,
where to search algorithms.

Please find attached the output of LANGANA's parse of STEINBECK's book 'Of Mice and Men'
in Turkish languge.

OF MICE AND MEN statistical data:
---------------------------------
total #of words: 27740
total #of roots: 29577
total #of exts:  29366
nonprocessed  :      3
% ambiguity :  (29577 - 27740)/ 29577 = 0.07 = %7

Ambiguity may be reduces by program retouches, procedural changes and manual changes
to produce a reference book for further parses of other Turkish books.  But reducing
ambiguity may in some cases reduce the possibilities to search in case of answering
questions, for ex. adjFromVerb and nounFromVerb ambiguity.

I wish my work serves my other collegues who also take the challenges of  the Turkish language.
Last but not the least, my work is available for only academical and non-commercial endeavours,
unless you get a written consent from me.

Ali Riza SARAL

Copyright condition:
Creative Commons -CC Attribution-NonCommercial-NoDerivatives 4.0 International
Attribution-NonCommercial-NoDerivs
 CC BY-NC-ND

You are free to:
Share — copy and redistribute the material in any medium or format

The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Information

NonCommercial — You may not use the material for commercial purposes.
NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.


http://master.dl.sourceforge.net/project/turkishlanguageparser/LANGANA%20report3.txt