Methods and Data
Corpus
All analysis in this book is performed on the Masoretic Torah text, accessed via the Sefaria.org public API. The text comprises the Five Books of Moses:
| Book | Chapters | Verses | Words | Letters |
|---|---|---|---|---|
| Genesis | 50 | 1,533 | 20,614 | ~78,064 |
| Exodus | 40 | 1,210 | 16,713 | ~63,529 |
| Leviticus | 27 | 859 | 11,950 | ~44,790 |
| Numbers | 36 | 1,288 | 16,408 | ~63,530 |
| Deuteronomy | 34 | 956 | 14,294 | ~54,892 |
| **Total** | **187** | **5,846** | **79,979** | **304,805** |
No proprietary data is used. The annotated corpus (torah_corpus.csv), developed independently by the author over several years, is used only for validation — never for training. All algorithms can be reproduced using only the public Sefaria API.
Preprocessing
1. Nikud removal: Vowel marks (nikud) are stripped for the base morphological analysis. A separate analysis with nikud preserved demonstrates a +4.3% improvement in meaning prediction.
2. Final forms: Letters with final forms (ך→כ, ם→מ, ן→נ, ף→פ, ץ→צ) are normalized to their standard forms.
3. Maqaf handling: Words joined by maqaf (־) are treated as separate words.
4. HTML stripping: All markup from the API response is removed, retaining only Hebrew text.
The Letter Partition
The 22 letters of the Hebrew alphabet are partitioned into four groups:
| Group | Letters | Count | Criterion |
|---|---|---|---|
| **Foundation** | ג ד ז ח ט ס ע פ צ ק ר ש | 12 | Never serve as grammatical morphemes; always carry root-level semantic content |
| **AMTN** | א מ ת נ | 4 | Serve as both root consonants and grammatical markers (tense, person, reflexive) |
| **YHW** | י ה ו | 3 | Serve as both root consonants and grammatical markers (gender, possession, causation) |
| **BKL** | ב כ ל | 3 | Serve as both root consonants and grammatical markers (prepositions, relation) |
This partition is fixed throughout all analyses. No parameter tuning, no optimization, no text-dependent adjustment. The same 22→4 mapping produces every finding in this book. Changing the partition changes every result, making the system fully falsifiable.
The criterion is purely morphological: a letter is Foundation if and only if it never functions as a grammatical prefix, suffix, or inflectional marker in Biblical Hebrew. The remaining 10 letters all have dual roles — sometimes root consonant, sometimes grammatical marker — and are classified into three Control subgroups by their grammatical function.
Foundation%
For any string of Hebrew letters w = c₁c₂...cₙ:
Foundation%(w) = |{cᵢ : cᵢ ∈ Foundation}| / n × 100
This is the percentage of letters in the string that belong to the Foundation group. It can be computed for a single word, a verse, a chapter, a book, or the entire Torah.
ModeScore
For any window of k consecutive verses, let:
- Y = count of verses containing the name יהוה (YHWH)
- E = count of verses containing the name אלהים (Elohim)
ModeScore = (Y − E) / (Y + E)
When ModeScore > 0, the window is YHWH-dominant. When ModeScore < 0, it is Elohim-dominant. When ModeScore ≈ 0, both names appear equally. Default window: k = 50 verses (sliding, step = 1).
Autocorrelation
The autocorrelation function measures how similar the signal is to a time-shifted version of itself:
ACF(τ) = Σᵢ (xᵢ − μ)(xᵢ₊τ − μ) / Σᵢ (xᵢ − μ)²
where x is the F% or ModeScore series, μ is the mean, and τ is the lag (in verses). A slow decay of ACF indicates long-range memory; rapid decay indicates independence.
Scaling Analysis
The fluctuation function F(s) measures the root-mean-square deviation of the integrated signal from local trends at scale s:
F(s) ∝ sᵅ
The exponent α characterizes the signal:
- α = 0.5: uncorrelated (white noise)
- α > 0.5: persistent (long-range correlated)
- α < 0.5: anti-persistent
We compute α separately for Foundation% and ModeScore, yielding the dual scaling law: α_base = −0.266, α_mode = −0.056 (ratio 4.7×).
Change-Point Detection
To identify structural boundaries, we use a sliding-window mean-shift detector:
For each position i in the verse-by-verse F% series:
1. Compute mean F% in the left window [i−w, i)
2. Compute mean F% in the right window [i, i+w)
3. Record the absolute difference |right − left|
4. Z-normalize all differences
5. Flag positions where Z > threshold AND the position is a local maximum
Default parameters: w = 40 verses, threshold Z > 1.0. Robustness is verified across w ∈ {20, 30, 40, 50, 60, 75}.
Shuffle Tests
All statistical claims are verified against null models:
1. Partition shuffle: Randomly reassign 22 letters to groups of 12/4/3/3 (1,000 iterations)
2. Position shuffle: Randomly permute letters within each word (1,000 iterations)
3. Boundary shuffle: Randomly place the same number of boundaries across the text (1,000 iterations)
4. Text shuffle: Randomly permute verses while preserving verse-level statistics (1,000 iterations)
A finding is reported as significant only if the real value exceeds 95% of shuffled values (p < 0.05).
Software
All code is written in Python 3.8+ using standard libraries (numpy, collections, json, re) plus scikit-learn for the GBM classifier. No commercial software is required. Complete source code for all four algorithms is provided in Appendix B.
Reproducibility
To reproduce any finding:
1. Install Python 3.8+
2. Run `python3 torah_root_analyzer.py --demo` (downloads Torah from Sefaria automatically)
3. Execute the relevant analysis script
All data is public. All code is provided. All parameters are documented above. No hidden steps.