Pse e ndërtova këtë Why I built this
E ndërtova albanisht.com për bisedat që diaspora nuk ka mundur t'i bëjë. Nipi në New York që mezi i thotë "mirëdita" gjyshes në fshat. Telefonata e së dielës që mbaron pas tridhjetë sekondash, sepse askush s'mund të mbajë të dy gjuhët. Certifikata e martesës që USCIS e refuzon, sepse Google Translate ka prodhuar diçka që nuk lexohet as si shqip, as si anglisht.
I built albanisht for the conversations the diaspora hasn't been able to have. The grandkid in New York who can barely say "mirëdita" to the grandmother in the village. The Sunday phone call that ends after thirty seconds because nobody can carry both languages. The marriage certificate that USCIS rejects because Google Translate produced something that doesn't read as Albanian and doesn't read as English.
Modelet e fundit të AI tani janë mjaft të mira sa puna e mbetur është shtresa kulturore dhe gjuhësore që duhet ndërtuar posaçërisht për shqipen. Atë e ndërtova këtu. Krahasim të ndershëm me Google Translate, ku tregohen edhe testet që dështojmë. Përkthim me zë në kohë reale për telefonatat e familjes. Bazë të dhënash që rritet me korrigjime nga vetë shqipfolësit.
Modern AI models have gotten good enough that the only thing missing is the cultural and linguistic layer built specifically for Albanian. That's what's here. An honest benchmark against Google Translate with failed tests shown openly. A real-time voice interpreter for family calls. A dataset that grows from corrections submitted by actual Albanian speakers.
Ky është një projekt familjar. Nuk po përpiqet të bëhet startup gjigant. Po përpiqet të jetë vegla që na ka munguar.
This is a family project. Not trying to become a unicorn. Trying to be the tool that's been missing.
Methodology
A frontier LLM combined with retrieval over a curated parallel corpus and a hand-built linguistic knowledge base. Every translation request retrieves the most similar examples from the corpus, looks up relevant idioms and false-friends, and constructs a prompt specifically tuned for Albanian.
Twelve linguistic rules cover admirative mood, optative idioms, clitic doubling, post-posed definite article, linking articles, pro-drop, ti-vs-ju politeness, religious register, kinship disambiguation, false friends and word order. A parallel corpus of around 1,200 Albanian-English sentence pairs from Tatoeba plus 190 hand-curated pairs reviewed by a native speaker. A retrieval index keyed by character n-grams. A dialect detector that distinguishes Tosk from Gheg.
Sources we cite. Akademia e Shkencave e Shqipërisë. Newmark, Hubbard and Prifti's Standard Albanian. A Reference Grammar. Camaj's Albanian Grammar. Mëniku and Campos's Discovering Albanian. Friedman's work on the Balkan sprachbund. Tatoeba sentence corpus. Wiktionary Albanian.
Full provenance for every directive, lexicon and benchmark entry
lives in docs/LINGUISTIC_SOURCES.md
in the repository. Each phenomenon-specific directive is mapped to
its source grammar; each corpus entry is tagged by origin
(akademia / newmark / camaj / sister-review / user-report).
Known limitations
- Strict evaluator. Our benchmark scorer is strict-and-reproducible by design. It can mark semantically correct translations as incorrect when they use synonyms or different word order. We hand-review failures and have found surface-only mismatches.
- Google Translate column is captured manually. Google updates their engine continuously. Between our capture dates, their current behavior may differ from what we display.
- Benchmark size. 48 tests catches regressions on the phenomena we care most about, but it isn't exhaustive. We extend it as we discover new failure modes from real usage.
- Test answers are static. Acceptable-answer lists were authored at a point in time. Language is alive. We revisit annually.
Get in touch
General questions: hello@albanisht.com. Security reports: security@albanisht.com.