BILINGUAL CORPUS

Home

Bilingual Corpus

Research Tools

Spanish was spoken before English for centuries in northern New Mexico, U.S.A. Since the arrival of English-speaking settlers with the annexation of the territory in the mid-19th century and the expansion of the railroad, Spanish and English have coexisted as the main competing languages for over 150 years. Today, however, bilingualism is threatened by language shift to English and stigmatization of New Mexican Spanish with respect to extralocal varieties in schools and public discourse. For example, here’s the story of a teacher marking New Mexican Spanish as wrong.

mi,
nieta,
.. vino —
y dijo,
grandma ayúdame con mi español,
le dije yo bueno.
so me senté áhi yo,
yo y mi suegra,
los sentamos y c- —
y y,
y le ayudamos.
pues agarró todo mal.

((21 intervening lines))

.. they c- called it proper Spanish.
or,
whatever,
it was called,
but it wasn’t our Spanish.
so she got everything wrong.
so I went to the school,
and I complained.

[14 Proper Spanish, 25-51 – 26:34]

The New Mexico Spanish English Bilingual (NMSEB) corpus samples northern Nuevomexicanos who use both languages regularly in their daily interactions. We see their high level of bilingualism in the even distribution of languages (S=Spanish, E=English) and the bidirectionality of multi-word code-switching (CS).

Bilingual speech was recorded through conversational sociolinguistic interviews, in which participants tell their own life stories (see Cuentos). The corpus comprises 31 recordings with 40 speakers totaling 29 hours, or 300,000 words, transcribed orthographically and prosodically. The prosodic transcription is based on the Intonation Unit (Du Bois, Schuetze-Coburn, Cumming, and Paolino 1993). On constructing a bilingual corpus, see Chapters 2 and 3 of Bilingualism in the community.

Acoustic properties of Intonation Unit: higher pitch at the beginning of the IU, gradually dropping over the course of the IU to a fall at the end, and slower rate of speech at the end of the IU.

Using Principal Component Analysis (PCA) to group speakers

On social factors in NM Spanish, see Torres Cacoullos, R. & Berry, G. M. 2018. Sociolinguistic variation in US Spanish. Handbook of Spanish as a Minority/Heritage Language, K. Potowski (ed.), 254-268. Routledge.

R Scripts Designed for Working Bilingual Corpus Data

In the course of our research, we’ve developed numerous R scripts for working with bilingual corpus data. Some of these scripts include:

Segmenting prosodic breaks from Intonation Unit-based transcriptions
Dealing with overlapping speech when extracting corpus data

We are currently working to publish these scripts on GitHub for public use. If you have any interest in these scripts, please feel free to reach out to afleming9796@gmail.com.