3½

Methods and Data

Corpus

All analysis in this book is performed on the Masoretic Torah text, accessed via the Sefaria.org public API. The text comprises the Five Books of Moses:

Book	Chapters	Verses	Words	Letters
Genesis	50	1,533	20,614	~78,064
Exodus	40	1,210	16,713	~63,529
Leviticus	27	859	11,950	~44,790
Numbers	36	1,288	16,408	~63,530
Deuteronomy	34	956	14,294	~54,892
Total	187	5,846	79,979	304,805

No proprietary data is used. The annotated corpus (torah_corpus.csv), developed independently by the author over several years, is used only for validation — never for training. All algorithms can be reproduced using only the public Sefaria API.

✦

Preprocessing

1. Nikud removal: Vowel marks (nikud) are stripped for the base morphological analysis. A separate analysis with nikud preserved demonstrates a +4.3% improvement in meaning prediction.

2. Final forms: Letters with final forms (ך→כ, ם→מ, ן→נ, ף→פ, ץ→צ) are normalized to their standard forms.

3. Maqaf handling: Words joined by maqaf (־) are treated as separate words.

4. HTML stripping: All markup from the API response is removed, retaining only Hebrew text.

✦

The Letter Partition

The 22 letters of the Hebrew alphabet are partitioned into four groups:

Group	Letters	Count	Criterion
Foundation	ג ד ז ח ט ס ע פ צ ק ר ש	12	Never serve as grammatical morphemes; always carry root-level semantic content
AMTN	א מ ת נ	4	Serve as both root consonants and grammatical markers (tense, person, reflexive)
YHW	י ה ו	3	Serve as both root consonants and grammatical markers (gender, possession, causation)
BKL	ב כ ל	3	Serve as both root consonants and grammatical markers (prepositions, relation)

This partition is fixed throughout all analyses. No parameter tuning, no optimization, no text-dependent adjustment. The same 22→4 mapping produces every finding in this book. Changing the partition changes every result, making the system fully falsifiable.

The criterion is purely morphological: a letter is Foundation if and only if it never functions as a grammatical prefix, suffix, or inflectional marker in Biblical Hebrew. The remaining 10 letters all have dual roles — sometimes root consonant, sometimes grammatical marker — and are classified into three Control subgroups by their grammatical function.

✦

Foundation%

For any string of Hebrew letters w = c₁c₂...cₙ:

Foundation%(w) = |{cᵢ : cᵢ ∈ Foundation}| / n × 100

This is the percentage of letters in the string that belong to the Foundation group. It can be computed for a single word, a verse, a chapter, a book, or the entire Torah.

✦

ModeScore

For any window of k consecutive verses, let:

Y = count of verses containing the name יהוה (YHWH)
E = count of verses containing the name אלהים (Elohim)

ModeScore = (Y − E) / (Y + E)

When ModeScore > 0, the window is YHWH-dominant. When ModeScore < 0, it is Elohim-dominant. When ModeScore ≈ 0, both names appear equally. Default window: k = 50 verses (sliding, step = 1).

✦

Autocorrelation

The autocorrelation function measures how similar the signal is to a time-shifted version of itself:

ACF(τ) = Σᵢ (xᵢ − μ)(xᵢ₊τ − μ) / Σᵢ (xᵢ − μ)²

where x is the F% or ModeScore series, μ is the mean, and τ is the lag (in verses). A slow decay of ACF indicates long-range memory; rapid decay indicates independence.

✦

Scaling Analysis

The fluctuation function F(s) measures the root-mean-square deviation of the integrated signal from local trends at scale s:

F(s) ∝ sᵅ

The exponent α characterizes the signal:

α = 0.5: uncorrelated (white noise)
α > 0.5: persistent (long-range correlated)
α < 0.5: anti-persistent

We compute α separately for Foundation% and ModeScore, yielding the dual scaling law: α_base = −0.266, α_mode = −0.056 (ratio 4.7×).

✦

Change-Point Detection

To identify structural boundaries, we use a sliding-window mean-shift detector:

For each position i in the verse-by-verse F% series:

1. Compute mean F% in the left window [i−w, i)

2. Compute mean F% in the right window [i, i+w)

3. Record the absolute difference |right − left|

4. Z-normalize all differences

5. Flag positions where Z > threshold AND the position is a local maximum

Default parameters: w = 40 verses, threshold Z > 1.0. Robustness is verified across w ∈ {20, 30, 40, 50, 60, 75}.

✦

Shuffle Tests

All statistical claims are verified against null models:

1. Partition shuffle: Randomly reassign 22 letters to groups of 12/4/3/3 (1,000 iterations)

2. Position shuffle: Randomly permute letters within each word (1,000 iterations)

3. Boundary shuffle: Randomly place the same number of boundaries across the text (1,000 iterations)

4. Text shuffle: Randomly permute verses while preserving verse-level statistics (1,000 iterations)

A finding is reported as significant only if the real value exceeds 95% of shuffled values (p < 0.05).

✦

Software

All code is written in Python 3.8+ using standard libraries (numpy, collections, json, re) plus scikit-learn for the GBM classifier. No commercial software is required. Complete source code for all four algorithms is provided in Appendix B.

✦

Reproducibility

To reproduce any finding:

1. Install Python 3.8+

2. Run `python3 torah_root_analyzer.py --demo` (downloads Torah from Sefaria automatically)

3. Execute the relevant analysis script

All data is public. All code is provided. All parameters are documented above. No hidden steps.

✦ ✦ ✦