1 Linguistic Research And The CLARIN Infrastructure Jan Odijk Digital Humanities Lecture, Utrecht 23 Oct 2012.

Slides:



Advertisements
Verwante presentaties
Defining a standard JSON-based exchange format for learning metadata Manon Haartsen.
Advertisements

Update on EduStandard: public-private platform in Dutch education Henk Nijstad, Kennisnet / november 2013.
Deltion College Engels C1 Spreken [Edu/003] thema “Oprah at Deltion” can-do : kan duidelijke, gedetailleerde beschrijving geven van complexe onderwerpen,
Requirements -People are able to make their own memorial page, called a memori -The website will be build first in Dutch for extension.nl, then copied.
Een alternatief voorstel Naar aanleiding van bestudering van de IAASB voorstellen denkt de NBA na over een alternatief. Dit alternatief zal 26 september.
Deltion College Engels C1 Gesprekken voeren [Edu/002]/ subvaardigheid lezen thema: Order, order…. can-do : kan een bijeenkomst voorzitten © Anne Beeker.
Smart Style on the Semantic Web Lynda Hardman CWI, Multimedia and Human-Computer Interaction TU/e, Multimedia and Internet Technology.
Internet vriendschap Internet friendship
1 Co-Design at Chess-iT Guus Bosman. 2 Afstuderen bij Chess Net.Footworks tot augustus 2003 Afstuderen augustus 2003 tot maart 2004 Chess full-time vanaf.
Hoogwaardig internet voor hoger onderwijs en onderzoek Amsterdam, 23 November 2005 Walter van Dijk SURFnet Development of LCPM decision-making models and.
Teams on the frontline Geert Stroobant De Heide - Balans
Reading Test – answers explained Class 8A Chapter 6.
Voorziening levensonderhoud Religieuze Instituten Paul Op Heij ‘s-Hertogenbosch, 25 september 2013 The future depends on what you do today.
Vaardig? Een spectrum aan vaardigheden! Van informatie- naar media- naar exploratievaardig? Of e-Research & e-learning literate? Collaboration literate??
Accessible Instructional Materials. § Discussion: Timely access to appropriate and accessible instructional materials is an inherent component.
Nieuwe wegen in ontwerpen met CAD
High quality internet for higher Education and Research 1 TF-LCPM: Exchanging new ideas New ideas within SURFnet Sharing with other NRENs
AGENDA Het programma start om uur en eindigt om uur met aansluitend een dinerbuffet tot uur. Er is voldoende ruimte om met uw collega.
zaterdag 19 juli 2014 Saturday, 19 July 2014 I see what you don’t see I come from another galaxy My earthal life was not the intention I was meant.
Het Landelijk Steunpunt Gastsprekers WO II-Heden is ondergebracht bij het Herinneringscentrum Kamp Westerbork Landelijk Steunpunt Gastsprekers WO II-Heden.
IST Status Gerrit van Nieuwenhuizen IST-MIT meeting BNL, July 24, 2008
Beyond Big Grid – Amsterdam 26 september 2012 Enquette 77 ingevulde enquettes, waarvan 60 met gebruikservaring = Mainly Computer Science.
Hyves brands Scrape, mashup and analyse. Introduction Anxiety about visible data on social networks by parents, employees (in news) Anxiety comes from.
Identification Documents Port of Ghent All documents in this leaflet are copies of identification/legitimation documents that authorise persons to access.
SQL injections en meer... PERU. web application vulnerabilities Cross Site Scripting (21.5%) SQL Injection (14%) PHP includes (9.5%) Buffer overflows.
Woensdag 23 juli 2014 volgende vorige algemeen ziekenhuis Sint-Jozef Malle Dementia pathway: a condition specific approach Patrick De Wit, MD Thierry Laporta,
In samenwerking met het Europees Sociaal Fonds en het Hefboomkrediet The role of APEL in career coaching and competence management Competence navigation.
ontwik idee - keling dag 3 goals today Develop “criteria” to help you evaluate & select your ideas Some tools from Tassouls book to help you do this.
ERIC Combine search terms with Boolean operators Next = click.
Netwerk Algorithms: Shortest paths1 Shortest paths II Network Algorithms 2004.
De digitale coach Het verbeteren van een plan van aanpak Steven Nijhuis, coördinator projecten FNT Deze presentatie staat op:
De digitale coach Het verbeteren van een plan van aanpak Steven Nijhuis, coördinator projecten FNT Deze presentatie staat op:
Word Order. Wie?DoetWat? wie?Waar?Wanneer? onderwerpWerkwoord(en)Meewerkend Lijdend voorwerp PlaatsbepalingTijdsbepaling SheGaveHim a kissIn the shoolyard.
Bedrijfsspecifieke extensies Standaard Rekeningschema
1 Van Harvard naar MIPS. 2 3 Van Harvard naar MIPS Microprocessor without Interlocked Pipeline Stages Verschillen met de Harvard machine: - 32 Registers.
Tussentoets Digitale Techniek. 1 november 2001, 11:00 tot 13:00 uur. Opmerkingen: 1. Als u een gemiddeld huiswerkcijfer hebt gehaald van zes (6) of hoger,
From computer power and human reason. Joseph Weizenbaum.
F REE R IDING IN P ROJECTS Recognize it today, Deal with it tomorrow, Prevent it in the next project Toine Andernach Focus Centre of Expertise on Education,
Geheugen, distributie en netwerken Netwerken: de basis voor distributie van gegevens en taken (processen) –bestaan zo’n 40 jaar, zeer snelle ontwikkeling.
Deltion College Engels B1 Gesprek voeren [Edu/001]
Deltion College Engels C1 Schrijven [Edu/002] thema: CV and letter of application can-do : kan complexe zakelijke teksten schrijven © Anne Beeker Alle.
Deltion College Engels B1 Gesprekken voeren [Edu/005] thema: applying for a job can-do : kan een eenvoudig sollicitatiegesprek voeren © Anne Beeker Alle.
Deltion College Engels B1 Gesprekken voeren [Edu/007] theme: Can I have my money back… can-do : kan minder routinematige situaties aan © Anne Beeker Alle.
Deltion College Engels C1 Gesprekken voeren [Edu/004]/ thema: There are lies, damned lies and statistics... can-do : kan complexe informatie en adviezen.
Deltion College Engels B2 Schrijven [Edu/004] thema: (No) skeleton in the cupboard can-do: kan een samenhangend verhaal schrijven © Anne Beeker Alle rechten.
Deltion College Engels C1 Luisteren [Edu/001] thema: It’s on tv can-do : kan zonder al te veel inspanning tv-programma’s begrijpen.
Deltion College Engels B2 Gesprekken voeren [Edu/006]/subvaardigheid schrijven notulen en kort voorstel thema: ‘What shall we do about non- active group.
Deltion College Engels B1 En Spreken/Presentaties [Edu/007] Thema: Soap(s) can-do : kan met enig detail verslag doen van ervaringen, in dit geval, rapporteren.
Deltion College Engels En Projectopdracht [Edu/001] thema: research without borders can-do/gesprekken voeren : 1. kan eenvoudige feitelijke informatie.
Deltion College Engels C1 Spreken/Presentaties [Edu/006] thema ‘I hope to convince you of… ‘ can-do : kan een standpunt uiteenzetten voor een publiek van.
Deltion College Engels B1 Schrijven [Edu/004]/ subvaardigheid lezen thema: reporting a theft can-do : kan formulieren waarin meer informatie gevraagd wordt,
Deltion College Engels C1 Gesprekken voeren [Edu/006] thema: ‘I was wondering what you think of…’ can-do : kan deelnemen aan de conversatie bij zeer formele.
Telecommunicatie en Informatieverwerking UNIVERSITEIT GENT Didactisch materiaal bij de cursus Academiejaar
Telecommunicatie en Informatieverwerking UNIVERSITEIT GENT Didactisch materiaal bij de cursus Academiejaar
Future (toekomst) Je krijgt 2 verschillende vormen van Future.
Rational Unified Process RUP Jef Bergsma. Iterations –Inception –Elaboration –Construction –Transition De kernbegrippen (Phases)
Blended Learning. content Waarom wij e-learning hebben gebruikt Demo van de module Voorlopige resultaten van effecten op gebruikers.
"Genetisch Gewijzigde Organismen in relatie tot de voedselvoorziening in 't algemeen, en in 't bijzonder in ontwikkelingslanden” Discussie Forum 28 Januari.
Ted Nelson (1937- ) A file structure for the Complex, the changing, and the Interdeterminate.
© Shopping 2020 TITLE Date Subtitle Logo Gastheer Logo Voorzitter.
Combining pattern-based and machine learning methods to detect definitions for eLearning purposes Eline Westerhout & Paola Monachesi.
Sustainable employability in Tourism The human factor October 24, 2014 Where Europe Meets the Americas.
Mining Dutch History: researching public debate in the nineteenth century Dr José de Kruif Researcher Research Institute for History and Culture Utrecht.
Deltion College Engels B1 Lezen [no. 001] can-do : 2 products compared.
Deltion College Engels B1 Gesprekken voeren [Edu/006] thema: Look, it says ‘No smoking’… can-do : kan minder routinematige zaken regelen © Anne Beeker.
Deltion College Engels B1 Spreken [Edu/001] thema: song texts can-do : kan een onderwerp dat mij interesseert op een redelijk vlotte manier beschrijven.
The Research Process: the first steps to start your reseach project. Graduation Preparation
Dictionary Skills!?.
Reading for Understanding Analysis and Evaluation (National 5)
The student will be able to:
Transcript van de presentatie:

1 Linguistic Research And The CLARIN Infrastructure Jan Odijk Digital Humanities Lecture, Utrecht 23 Oct 2012

2 Overview Introduction Basic Facts & Research Questions Do the Research –Consult Grammars –Select from relevant data from multiple sources –Apply tools to enrich data –Analyze the data Conclusions

3 Introduction Suppose you’re a linguistic researcher in 1980 (no internet, no computers,…) –And libraries would not exist…. I am a linguistic researcher in 2012 –But no infrastructure for data and tools exists! –though there are many data and tools CLARIN has as its main goal to remedy this

4 Basic Facts Heel, erg, and zeer are synonyms (‘very’) Zeer, erg can modify verbs, adjectival predicates and prepositional predicates Heel can only modify adjectival predicates –A: Hij is daar zeer/erg/heel blij mee –P: Hij is daar zeer/erg/*heel mee in zijn nopjes –V: Dat verbaast ons zeer/erg/*heel.

5 Basic Facts English very is like heel in these respects; –P: *He is very in love –A: He is very amorous –V: It surprised us very *(much))

6 Basic Facts Difference: –not due to semantics –Purely syntactic –As far as we know: does not follow from a general rule –So it must be ‘learned’ by a child acquiring Dutch as first language

7 Research Question (1) How does a child acquiring Dutch as a first language get to ‘know’ that zeer and erg can modify verbs, prepositional and adjectival predicates?

8 Hypotheses (1) Hypothesis 1a –Once a word is encountered for the first time, a critical phase (‘training phase’) starts in which the word properties will be determined based on input; after this phase the word properties are fixed. –A sufficient number of actual examples occurring in this period sets the word properties (positive evidence)

9 Hypotheses (1) Hypothesis 2a –Once a word is encountered for the first time, its grammatical properties are initially set by Semantic Bootstrapping: D (semcat) -> syncat –A sufficient number of actual examples occurring in this period will add to the word properties (positive evidence) –Sufficient amount of input that is contradictory to the semantically bootstrapped properties overrules them

10 Research Question (2) How can a child acquiring Dutch as first language get to ‘know’ that heel cannot modify prepositional predicates and verbs? –Children are never taught that it is not possible; –They are also never or seldom corrected for language errors, and if they are, they seem to ignore it (Negative evidence plays no role)

11 Hypotheses (2) Hypothesis 1b –Absence of relevant constructions in the training phase of a word leads to absence of the property (indirect negative evidence) Hypothesis 2b –Absence of relevant constructions in the training phase of a word does not lead to absence of the property for semantically bootstrapped properties

12 Related Questions Do children ever make errors against this? Is a ‘training phase’ for word properties real? How ‘long’ is this training phase? What is a ‘sufficiently large’ number of actual examples Does semantic bootstrapping play a role, and if so which one Are these words acquired in different language acquisition stages?

13 Related Questions Can this be related to the different modification potential? Is there a relation with the fact that zeer appears to be rather formal, while heel and erg are not?

14 Related Questions adverb-adjective agreement (substandard): –heel/hele dikke boeken ‘very thick books’ –erg/erge dikke boeken –Zeer/*zere dikke boeken –Is this somehow related? What about other, closely related, words?

15 Consult Grammars Currently –Consult paper and electronic grammars ANS and e-ANS e.g. section e-ANS In the near Future –Consult Taalportaal with (I hope/expect)Taalportaal All examples formally marked as such All examples parsed/tagged, using ISOCAT DCs and searchable Links to (possibly complex queries) to illustrate with real data from treebanks and other annotated data

16 Find Data Which data and tools (LRs) exist that might contribute to answering these questions? Currently: –you have to search for them in multiple places –Many relevant data are not publicly visible (you will encounter them by personal contacts only) –Or you have to create them yourself

17 Find Data There is no place/site where you query: –Give me a list of all LRs for the Dutch language –What is the size of all Dutch text corpora (in #tokens) –Give me a list of all Dutch data that contain children 2-7 years old as speaker –Give me a list of all Dutch data containing any of the words heel, zeer, erg Not even in most individual data centres (TST-Centrale, ELRA, LDC,..)TST-CentraleELRALDC

18 Find Data CLARIN –Provides a flexible framework incl. tools for making descriptions of LRs (‘metadata’) CMDI –Supports (assistance, execution, funding) the creation of metadata for LRsassistanceexecutionfunding –Supports making these metadata (and the actual data) visible and accessible via CLARIN portals

19 Find Data CLARIN –Provides facilities for semantic interoperability ISOCAT, Relation Registry (coming soon)ISOCAT –browsing, searching and querying facilities for the metadata Initial prototype: Virtual Language ObservatoryVirtual Language Observatory –Will enable you to collect the data that are relevant to you in a virtual collection –This will save the researcher a lot of time –It will enlarge the empirical basis for the research

20 Closely Related Words Find words that are closely related –Adverbs that function as an intensifier (‘booster’) –Are (near-)synonymous, hyponyms, or co- hyponyms –Also (near-)antonyms are relevant In order to determine their properties and potential further generalizations

21 Closely Related Words Using e.g. –Synonym information in traditional dictionaries –Dutch EuroWordnet (currently via ELRA M0016)EuroWordnetELRAM0016 –Or Cornetto (via the Dutch HLT-Agency)CornettoHLT-Agency Currently searchable only via –a plug-in in an old version (3.5) of Firefox. ora plug-in in an old version (3.5) of Firefox –In programs via a python modulea python module A CLARIN-NL project to improve thisCLARIN-NL project

22 Closely Related Words Found via synonym dictionaries: abnormaal afschuwelijk akelig bijster bijzonder bovenmatig buitengemeen buitensporig danig donders eminent enorm exceptioneel extra extraordinair extreem fabelachtig fenomenaal geweldig gigantisch intens kolossaal merkwaardig mirakels onbeschrijfelijk ongelofelijk ongehoord ongekend ongemeen onmenselijk onmetelijk ontzettend onwijs speciaal uitermate uiterst uitzonderlijk verdraaid verduiveld verrekte verschrikkelijk vet zeldzaam …..

23 Closely Related Words zeer:adverb:3 / heel:adverb:5 (from Cornetto) zeer:3/d_r , allemachtig:2/d_r-9922, beestachtig:2/d_r-23835, bijzonder:4/c_546765, bliksems:2/d_r-32612, bloedig:2/d_r-32881, bovenmate:1/d_r-36728, buitengewoon:2/d_r , buitenmate:1/d_r-39294, buitensporig:2/d_r , crimineel:4/d_a-53026, deerlijk:2/d_r-57321, deksels:2/d_r-57728, donders:2/d_r-62605, drommels:2/d_r-65820, eindeloos:3/c_546740, enorm:2/d_r-74285, erbarmelijk:2/d_r-74877, fantastisch:6/d_r-79264, formidabel:2/d_r-82704, geweldig:4/d_r-92392, goddeloos:2/d_r-94633, godsjammerlijk:2/d_r-94798, grenzeloos:2/d_r-96846, grotelijks:1/d_r-98244, heel:5/d_r , ijselijk:2/d_r , ijzig:4/c_546756, intens:2/d_r , krankzinnig:3/d_r , machtig:4/d_r , mirakels:1/d_r , onsterachtig:2/d_r , moorddadig:4/d_r , oneindig:2/d_r , onnoemelijk:2/d_r , ontiegelijk:2/d_r , ontstellend:2/d_r , ontzaglijk:2/d_r , ontzettend:3/d_r , onuitsprekelijk:2/d_r , onvoorstelbaar:2/d_r , onwezenlijk:2/d_r , onwijs:4/d_r , overweldigend:2/d_r , peilloos:2/d_r , reusachtig:3/d_r , reuze:2/d_r , schrikkelijk:2/d_r , sterk:7/d_r , uiterst:4/d_r , verdomd:2/d_r , verdraaid:4/c_546761, verduiveld:2/d_r , verduveld:2/d_r , verrekt:3/d_r , verrot:3/d_r , verschrikkelijk:3/d_r , vervloekt:2/d_r , vreselijk:5/d_r , waanzinnig:2/d_r , zeldzaam:2/d_r , zwaar:10/d_r

24 Basic Facts: Correct? Check the basic facts Check against occurrences in corpora –Problem: each of the 3 words is ambiguous! Erg (4x)= noun(de) ‘erg’; noun(het)’evil’, adj+adv ‘unpleasant’, adv ’very’ Zeer (3x)= noun ‘pain’; adj ‘painful’; adv ‘very’ Heel (3x) = adj ‘whole’; verbform ‘heal’; adv ‘very’ –PoS-tagged corpus will help somewhat But most corpora do not distinguish adj from adv by category! (searching for PoS bigrams will help slightly) –A fully-parsed corpus would be ideal

25 Basic Facts: Correct? –LASSY Small: 1M manually verified parsed corpusLASSY Small –Interface to LASSY SmallInterface LASSY Small Requires knowledge of XPATH/XQUERY –Very Simple Interface to LASSY SmallVery Simple Interface LASSY Small limited options but simple commands –Example-based interface GrETEL (CLARIN- Flanders)GrETEL Greedy Extraction of Trees for Empirical Linguistics Generates XPATH/XQUERY expression on the basis of an example sentence plus markings of what is relevant in it

26 Basic Facts: Correct? –Queries: erg::mod:; zeer::mod: ; heel::mod: –Extract from Statistics: –Query: heel::mod:ww ergzeerheel ADJ WW35499 BW117

27 Basic Facts: Correct? Analysis –8 examples are forms that are ambiguous between adjectival and verbal participle, All are examples of adjectival participles but LASSY represents all participles as verbal –In 1 example heel modifies the adj open from the expression open staan voor, but wrongly analyzed as modifying the verb staan CLARIN will offer facilities to make annotations to such corpora Same queries could be done –for the other related words –on LASSY Large Corpus (2.4 billion words, automatically parsed)LASSY Large Corpus –In the CGN corpus (but it uses a different interface) But this will require facilities for ‘batch jobs’ or more complicated queries (maybe via web services)

28 Acquisition Corpora: Search E.g. data in the CHILDES system (part of TalkBankCHILDES TalkBank –7 corpora for Dutch7 corpora for Dutch –But with their own data formats (CHAT) and tools (CLAN)CHATCLAN However, also mirrored at MPI and accessible via (ANNEX/)TROVA (again another interface)(ANNEX/)TROVA

29 Acquisition Corpora: Search Give records for utterances containing erg with –Corpus(e.g. Van Kampen Corpus) –File:(e.g. laura74.cha) –Line:(e.g. 139) –Part Role: (e.g. Child) –Child Gdr: (e.g. female) –Age:(e.g. 5;6.12) –UTT(e.g. “ja, die s erg moeilijk.”) Maybe also some preceding/following context Map attribute names and values to ISOCAT

30 Acquisition Corpora: Search Corpus: Van Kampen File: sarah21.cha Line: 630 Speaker: Child Child Gender: Female Age: 2;7.16 UTT: “prinses e(r)g groot !”

31 Acquisition Corpora: Search For each child, give list of pairs session + age of the child For child and each session, give #occurrences of zeer, heel, erg etc, etc. Such queries (Some example attempts )Some example attempts –Mixed metadata/content search –Over multiple resources –Specific output formats are not so easy with the current interfaces!!

32 Acquisition corpora: Search Heel is found 153 times in Van Kampen corpus Erg is found 77 times in Van Kampen corpus –But many are an irrelevant use of erg PoS-tagging the corpus might be useful –Search for POS-bigrams (e.g. erg/adj */adj) –Add lemma’s Or even full parsing, at least of the adult speech

33 Acquisition corpora: Parse CLARIN-NL –Web services are being developed For PoS-tagging text For full parsing of text (and many more) –To be usable by humanities researchers –in a user-friendly way in work flow systems Usefulness depends on –Size of the data (effort to select manually) –Quality of the web services

34 Store the found data The found and newly created data –should be stored in a supported format –With automatically generated metadata –With automatically generated provenance data –Using data categories mapped to or from ISOCAT –For which PIDs are provided –Stored on a server of a CLARIN-centre –So that they can become proper resources on their own Are visible, accessible and interpretable as part of enriched publications

35 Search in CGN / SONAR To assess level of formality –Give absolute and relative frequencies of heel/hele/erg/erge/zeer as adj by text genre, and speaker/participants education level –In CGN (spoken corpus) –In SONAR (written corpus) –Idem but for the word + the following Pos-tag –Idem but in the fully parsed part of CGN and in LASSY + the PoS tag of the modifiee head

36 Interpret the data Interpret the data in function of the hypotheses being investigated Apply analytical / statistical tools to the data –CLARIN should support formats of frequently used statistical packages such as SPSS, R, etc. The research will surely lead to new questions, so to new queries Reach conclusions and publish an open access enriched publication

37 Broaden the scope Do the same for worden/raken (‘become’/ ‘get’) NP, PP and AP can be predicate complements Worden and raken take predicate complements They are (almost) synonymous worden: takes only NP or AP raken: takes only AP or PP

38 Broaden the scope AP: Zij werd / raakte zwanger PP: Zij *werd / raakte in verwachting NP: Zij werd / *raakte burgemeester And repeat the process Exercise

39 Conclusions There is no adequate infrastructure for linguistic research There are bits and pieces, but –Finding LRs is not easy –LRs have their own formats, data categories, user / search interfaces –Limited formal and no semantic interoperability –Search in combined LRs very difficult if not impossible  full research potential is not exploited CLARIN(-NL) attempts to remedy this

40 CLARIN-NL Thanks for your attention!

41 No Entry!

42 Basic Facts: Correct? De omgang met de buren gebeurt op een heel ontspannen manier en de vrouw van de dominee heeft zelfs al Wolderse vlaai leren bakken. (parse)parse –heel:ADJ:mod:WW:ontspannen De verschijnselen zijn heel verschillend. (parse)parse –heel:ADJ:mod:WW:verschillend,, Op het voorterrein ging het nog heel overtuigend. (parse)parse –heel:ADJ:mod:WW:overtuigend Ze hebben heel gericht en planmatig volkscafés bezocht om daar hun gif te spuien. (parse)parse –heel:ADJ:mod:WW:gericht Ze is zelfs met een ' meester ' getrouwd : Marc Dassesse _ mevrouw Spiritus-Dassesse zet heel geëmanicipeerd haar meisjesnaam voorop _ is nu een gerenomeerd fiscaal adviseur en hoogleraar aan de ULB. (parse)parse –heel:ADJ:mod:WW:geëmanicipeerd Gelukkig krijg ik nog heel geregeld te horen : ' Gerard jongen, dat doe je gewoon foùt '. (parse)parse –heel:ADJ:mod:WW:geregeld Dat is een heel verrassend resultaat en het stemt tot optimisme. (parse)parse –heel:ADJ:mod:WW:verrassend De biermarkt is heel versnipperd en wordt overspoeld door nieuwe productlanceringen. (parse)parse –heel:ADJ:mod:WW:versnipperd Toch staan we hier heel open voor voorstellen. (parse)parse –heel:ADJ:mod:WW:staan

Metadata search CGN+CHILDES Dutch && 2<age<7

Regexp content search heel|zeer|erg|erge|hele

Resultset export to file

CGN regexp ^heel$|^erg$|…

CGN regexp op WORDS tier + POS

48 Exercise ‘Worden takes APs not PPs as predc’ Use the LASSY-Small Very Simple InterfaceVery Simple Interface Give me all sentences in which the word “worden” takes a predicative (predc) PP complement: –rel='predc' and hlemma='worden‘ and postag='vz' Do you find examples with this query? How do you interpret this?