More Thesis Thoughts

Earlier this week, my thesis fell apart when I discovered a research paper that was about my same topic, and it was done so much better than mine. The researcher even used the same corpus! Regardless of how it ruined my life for a minute, Dr. Penny MacDonald's research was well-done and illuminating on several great points that I will cite in my final draft.

After panicking for a whole day and talking to my advisor, it's clear that my mistake was in proceeding too close to computational linguistics, which I know practically nothing about. I will still write my research about Native Language Identification, essentially, but most work in the topic has been computational. So I've got to take a different approach.

Most of the NLID studies use large corpora of second language-English use and then run algorithms or use machine learning to identify the statistically significant language features that might indicate the authors' first language. These studies are so mathematical that I just got a little nauseous writing that sentence about them. However, their results are usually presented in the form of n-grams, which, suffice to say, are highly specific and don't translate directly to forensic linguistic analyses. My advisor had to explain that last part to me (I thought everyone was being stingy by not plainly stating those n-grams).

While corpus analysis has an important - and growing - place in forensic linguistics, actual casework rarely calls for building a corpus from its evidence. The average case does not contain enough language samples to build a corpus large enough to run the kinds of studies done in these NLID papers.

The problem I've encountered with my thesis is that most of the current research in Native Language Identification is too computational for forensic linguistic application. Forensic linguistic analysis needs something a little more... qualitative. Now, I specifically said I wanted to make Native Language Identification more quantitative in practice. I still will, except I won't, because I'll research cross-linguistic analysis in a different direction.

I'm calling it "Native Language Analysis" in the style of most Forensic Linguistic methods (analyses instead of identifications). Ultimately, I hope to develop a method of efficiently cross-analyzing a written sample against a catalog of cross-linguistic transference. This catalog would present the language features most often affected by native language interference alongside the languages that might have interfered. Most importantly, rather than just using a check list of language errors, Native Language Analysis should be a process of discerning the features that mark a sample as non-native English, referencing those features against a catalog, and further analyzing the results for veracity and probability.

Did that sound vague? I'm sorry, I've only been working on this method for a few hours so far. For now, my research is a matter of parsing those NLID papers and ESL education resources for those susceptible features I mentioned. I swear on my grave though, if I find a paper on someone's completed, superior research on this topic, I'm going to drop out and become a goat farmer.

Comments

Popular Posts

Native Language Analysis for German Transference Features in the Lindbergh Kidnapping Notes

How to Hire a Forensic Linguist

Tech Giant VS Junk Pirate - An Example of the Evocative Potential of Labels in News Headlines