• tetris11@feddit.uk
    link
    fedilink
    English
    arrow-up
    12
    ·
    edit-2
    11 hours ago

    hierarchical letter clustering would be my guess, or graph-based clustering using ngrams of 2-4 as nodes and maximising for connections.

    Or using an optimized Regex and printing out the DFA?

    Edit: Quick N-gram analysis (min=3, max=num letters in that month)

    R-code
    library(ngram)
    
    tmonths = c("january", "february", "march",
               "april", "may", "june", "july",
               "august", "september", "october",
               "november", "december")
    
    zzz = lapply(tmonths, function(mon){
      ng = ngram::ngram_asweka(paste(unlist(strsplit(mon, split="")), collapse=" "), min=3, max=nchar(mon))
      return(gsub(" ", "", ng))
    })
    res = sort(table(unlist(zzz)))
    res[res > 1]
    

    This gives the following 9 ngram frequencies greater than 1:

      ary   uar  uary   emb  embe ember   mbe  mber   ber 
        2     2     2     3     3     3     3     3     4 
    

    As you can see two longest most common motifs are “em-ber” and “uar-y”

    Using this I propose the following graph

    Mermaid
    stateDiagram
        direction LR
        sept --> em
        nov --> em
        dec --> em
        em --> ber
        oc --> to
        to --> ber
        feb --> uar
        uar --> y
        jan --> uar
        ju --> ne
        ju --> l
        l --> y
        ma --> r
        ma --> y
        r --> ch
        
        a --> p 
        p --> r
        r --> il
        a --> u
        u --> gust
    
    

      • tetris11@feddit.uk
        link
        fedilink
        English
        arrow-up
        1
        ·
        11 hours ago

        I’m really disappointed by June, April and August. Without these months, everything would be so neat and orderly

    • tetris11@feddit.uk
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      11 hours ago

      Interestingly

      • Aprch
      • Maril

      are the only two hallucinations, everything else is always a legit month