Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson.

Similar presentations


Presentation on theme: "Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson."— Presentation transcript:

1 Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson

2 Carnegie Mellon Christian Monson 2 Turkish Morphology – Beads on a String takepassivenegative present progressive 2 nd person singular You are not being taken

3 Carnegie Mellon Christian Monson 3 götürülmsunsunüyor takepassivenegative present progressive You are not being taken 2 nd person singular Turkish Morphology – Beads on a String

4 Carnegie Mellon Christian Monson 4 Applications of Computational Morphology Machine Translation –Turkish-English (Oflazer, 2007) –Czech-English (Goldwater and McClosky, 2005) Speech Recognition –Finnish (Creutz, 2006) Information Retrieval

5 Carnegie Mellon Christian Monson 5 Challenges of Computational Morphology Time Consuming for a New Language –Kemal Oflazer estimates 3-4 months to build basic Turkish analyzer Plus lexicon development and maintenance Expertise Needed –Greenlandic Official language of Greenland Agglutinative Inuit language 50,000 speakers Per Langaard

6 Carnegie Mellon Christian Monson 6 The Solution Raw Text Unsupervised Morphology Induction

7 Carnegie Mellon Christian Monson 7 ParaMor – Paradigm Morphology ParaMor Identify Search Cluster Filter Segment Evaluation Results ParaMor –Unsupervised morphology induction system Paradigm –The natural structure of morphology

8 Carnegie Mellon Christian Monson 8 Paradigms – The Structure of Morphology ülmsunsunüyor takepassivenegative present progressive 2 nd person singular StemVoicePolarity Tense & Mood Person & Number götür

9 Carnegie Mellon Christian Monson 9 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 1 st person singular umum götür

10 Carnegie Mellon Christian Monson 10 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 3 rd person singular umum Ø götür

11 Carnegie Mellon Christian Monson 11 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive 1 st person plural umum Ø uzuz götür

12 Carnegie Mellon Christian Monson 12 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative present progressive umum Ø uzuz götür

13 Carnegie Mellon Christian Monson 13 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative future umum Ø uzuz yecek götür

14 Carnegie Mellon Christian Monson 14 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number takepassivenegative umum Ø uzuz yecek götür

15 Carnegie Mellon Christian Monson 15 Paradigms – The Structure of Morphology ülmumüyor StemVoicePolarity Tense & Mood Person & Number umum Ø uzuz yecek

16 Carnegie Mellon Christian Monson 16 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigms

17 Carnegie Mellon Christian Monson 17 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigms Paradigm –Set of mutually replaceable strings

18 Carnegie Mellon Christian Monson 18 Paradigms – The Structure of Morphology ülmumüyor umum Ø uzuz yecek Paradigm –Set of mutually replaceable strings

19 Carnegie Mellon Christian Monson 19 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps

20 Carnegie Mellon Christian Monson 20 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps 1.Search for candidate paradigms

21 Carnegie Mellon Christian Monson 21 The ParaMor Algorithm ParaMor Identify Search Cluster Filter Segment Evaluation Results Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm

22 Carnegie Mellon Christian Monson 22 The ParaMor Algorithm Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm 3.Filter ParaMor Identify Search Cluster Filter Segment Evaluation Results

23 Carnegie Mellon Christian Monson 23 The ParaMor Algorithm Identify suffix paradigms in 3 steps 1.Search for candidate paradigms 2.Cluster candidates modeling the same paradigm 3.Filter Segment words –Using the discovered paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results

24 Carnegie Mellon Christian Monson 24 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms All character boundaries are candidate morpheme boundaries

25 Carnegie Mellon Christian Monson 25 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms autorizaciones buscabamos costas importadoras vallas … Begin search with the most frequent word-final string Spanish

26 Carnegie Mellon Christian Monson 26 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms autorizaciones buscabamos costas importadoras vallas … Ø s 5501 Identify the most frequent mutually replaceable string –Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish

27 Carnegie Mellon Christian Monson 27 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms Stop adding suffixes –When the most frequent mutually replaceable string severly decreases the stem count. Ø s 5501 Ø r s 287 autorizaciones buscabamos costas importadoras vallas …

28 Carnegie Mellon Christian Monson 28 s 10662 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms Move on to the next most frequent word-final string Ø s 5501 Ø r s 287 a 8981

29 Carnegie Mellon Christian Monson 29 a 8981 s 10662 a o 2304 a o os 1410 a as o os 892 Ø s 5501 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms

30 Carnegie Mellon Christian Monson 30 n 6051 a 8981 s 10662 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms

31 Carnegie Mellon Christian Monson 31 n 6051 a 8981 s 10662 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms

32 Carnegie Mellon Christian Monson 32 an 1786 n 6051 a 8981 s 10662 a an 1049 a an ar 413 a an ar ó 353 a ada adas ado ados an ar aron ó 149 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms

33 Carnegie Mellon Christian Monson 33... strado 15 rado 167 an 1786 n 6051 a 8981 s 10662 a an 1049 a an ar 413 a an ar ó 353 a ada adas ado ados an ar aron ó 149 rada radas rado rados 53 rada rado rados 67 rada rado 89 ra rada radas rado rados ran rar raron ró 23 Ø n 1874 Ø n r 509 Ø do n r 354 Ø da das do dos n ndo r ron 118 a o 2304 a o os 1410 a as o os 892 Ø s 5501 strada strado 12 strada strado stró 9 strada strado strar stró 8 strada stradas strado strar stró 7 es 2751 Ø es 874 Ø r s 287 ParaMor Identify Search Cluster Filter Segment Evaluation Results Search for Candidate Paradigms...

34 Carnegie Mellon Christian Monson 34 Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results

35 Carnegie Mellon Christian Monson 35 Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results

36 Carnegie Mellon Christian Monson 36 Cluster Candidates per Paradigm 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results

37 Carnegie Mellon Christian Monson 37 Cluster Candidates per Paradigm 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.715 532 Covered Types ParaMor Identify Search Cluster Filter Segment Evaluation Results

38 Carnegie Mellon Christian Monson 38 Filter Candidate Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results 2 types of filtering 1.Remove small unclustered candidate paradigms 2.Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)

39 Carnegie Mellon Christian Monson 39 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas

40 Carnegie Mellon Christian Monson 40 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas a ada adas ado ados an ar aron ó...

41 Carnegie Mellon Christian Monson 41 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas a ada adas ado ados an ar aron ó... administrada

42 Carnegie Mellon Christian Monson 42 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas administrada a ada adas ado ados an ar aron ó...

43 Carnegie Mellon Christian Monson 43 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas a as o os administrada

44 Carnegie Mellon Christian Monson 44 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas, administrad +as a as o os administrada Old way: Separate alternative analysis

45 Carnegie Mellon Christian Monson 45 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +adas, administrad +as a as o os administrada administr +ad +as New way: Augment the current segmentation

46 Carnegie Mellon Christian Monson 46 Segment Words Using Paradigms ParaMor Identify Search Cluster Filter Segment Evaluation Results administradas administr +ad +a +s Ø sØ s administradaØ administr +adas, administrad +as, administrada +s

47 Carnegie Mellon Christian Monson 47 Morpho Challenge 2007 ParaMor Identify Search Cluster Filter Segment Evaluation Results Peer operated competition –For unsupervised morphology induction algorithms 4 languages –English –German –Finnish –Turkish

48 Carnegie Mellon Christian Monson 48 ParaMor in Morpho Challenge 2007 ParaMor Identify Search Cluster Filter Segment Evaluation Results Developed on Spanish –ParaMor’s free parameters were frozen

49 Carnegie Mellon Christian Monson 49 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 1.Linguistic Segmentations compared to a morphologically analyzed lexicon AnalysisAnswer administradasadministr +ad +a +sadministrar +Adj +Fem +Pl administradaadministr +ad +aadministrar +Adj +Fem

50 Carnegie Mellon Christian Monson 50 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 1.Linguistic Segmentations compared to a morphologically analyzed lexicon AnalysisAnswer administradasadministr +ad +a +sadministrar +Adj +Fem +Pl administradaadministr +ad +aadministrar +Adj +Fem

51 Carnegie Mellon Christian Monson 51 2 Methods of Evaluation ParaMor Identify Search Cluster Filter Segment Evaluation Results 2.Task based Information retrieval –Short two-sentence queries –About international news topics –Binary relevance assessments –About 50 queries and 20K relevance judgements for each language.

52 Carnegie Mellon Christian Monson 52 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results Morfessor 47.2

53 Carnegie Mellon Christian Monson 53 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results 47.2 MorfessorParaMor 50.6

54 Carnegie Mellon Christian Monson 54 Linguistic Evaluation F1F1 Bernhard 2 MorfessorParaMorParaMor & Morfessor ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2Morfessor 47.2 50.650.7

55 Carnegie Mellon Christian Monson 55 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results 50.7 MorfessorParaMorParaMor & Morfessor 60.8

56 Carnegie Mellon Christian Monson 56 Linguistic Evaluation F1F1 Bernhard 2 ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorParaMor & Morfessor 60.8 56.3

57 Carnegie Mellon Christian Monson 57 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2 MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.953.4

58 Carnegie Mellon Christian Monson 58 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2 MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.9 53.4

59 Carnegie Mellon Christian Monson 59 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morf. Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.9 53.4 48.248.5

60 Carnegie Mellon Christian Monson 60 Linguistic Evaluation F1F1 ParaMor Identify Search Cluster Filter Segment Evaluation Results Bernhard 2MorfessorParaMorParaMor & Morf. MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor Bernhard 2MorfessorParaMorParaMor & Morfessor 60.8 56.3 52.9 53.4 48.248.5 24.7 52.0

61 Carnegie Mellon Christian Monson 61 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results McNameePar. 27.0 – No Morphological Analysis 28.9 26.4

62 Carnegie Mellon Christian Monson 62 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results McNameeParaMor 27.0 – No Morphological Analysis 28.9 29.3

63 Carnegie Mellon Christian Monson 63 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & M. 30.7 – No Morphological Analysis 28.9 29.3 38.3 32.1

64 Carnegie Mellon Christian Monson 64 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & M. 30.7 – No Morphological Analysis 28.9 29.3 38.3 38.2

65 Carnegie Mellon Christian Monson 65 IR Evaluation (TF/IDF) Average Precision Morf.P & M ParaMor Identify Search Cluster Filter Segment Evaluation Results MorfessorParaMorMorfessorParaMorMcNameeParaMorMorfessor BaselineParaMor & MorfessorMorfessor BaselineParaMor & Morfessor 32.0 – No Morphological Analysis 28.9 29.3 38.8 38.2 41.2 37.2

66 Carnegie Mellon Christian Monson 66 ParaMor: State-of-the-Art Unsupervised Morphology Induction System Combined system among the best in Morpho Challenge 2007 Consistent across languages Better than no morphology –Task based (IR) measure

67 Carnegie Mellon Christian Monson 67 Many Future Directions Improve Performance –F 1 of 50-60% is state-of-the-art! –Inflection classes –Morphophonology Beyond beads-on-a-string

68 Carnegie Mellon Christian Monson 68 Thank You!


Download ppt "Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson."

Similar presentations


Ads by Google