Page 1 of 1

Sørensen–Dice coefficient

Posted: 09 Oct 2018, 14:05
by Chunjee
10.16.18 update:
With the help of people in this thread, I put together a library that mirrors and expands on the output of string-similarity on npm.
https://github.com/Chunjee/string-similarity.ahk



I'm not sure how to turn this into a function:
Image


Does anyone have something similar or have a starting point?


Other notes:
https://en.wikipedia.org/wiki/S%C3%B8re ... oefficient

Re: Sørensen–Dice coefficient

Posted: 09 Oct 2018, 14:16
by nnnik
Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.

Code: Select all

sharedBigrams := 0
for bigram, _ in bigramsOfFirstWord {
	if bigramsofSecondWord.hasKey(bigram) {
		sharedBigrams++
	}
}

Re: Sørensen–Dice coefficient

Posted: 09 Oct 2018, 14:32
by MannyKSoSo
For the start we can look at SubStr() here https://autohotkey.com/docs/commands/SubStr.htm
This gives us a good starting point for how we want to find each set

Code: Select all

Woord1 := "pizazz"
Woord2 := "pizzaz"

LoopCount = 0
Length := StrLen(Woord1)
Loop, % Length-1 {
	Woord1_%A_Index% := SubStr(Woord1, A_Index, 2)
	Woord2_%A_Index% := SubStr(Woord2, A_Index, 2)
	LoopCount := A_Index
}
Loop, %LoopCount%
	MsgBox % Woord1_%A_Index% ":" Woord2_%A_Index%
Note with my example I used six letter words, but this works with the example you gave. Then the rest should be fairly easy as you loop through the total count and just compare each string

Re: Sørensen–Dice coefficient

Posted: 09 Oct 2018, 14:40
by Chunjee
nnnik wrote:Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.
This is what I was working with but I'm having trouble returning a number between 0 and 1 for similarity.

Re: Sørensen–Dice coefficient  Topic is solved

Posted: 09 Oct 2018, 14:40
by jeeswg
- Something like this perhaps.
- Note: the Wikipedia article didn't mention whether duplicate bigrams should be discarded. A Wikibooks link (below) mentions that if you did discard duplicates, then 'AA' and 'AAAA' would be considered identical. Thus duplicates should be maintained. Cheers.

Code: Select all

;Sørensen–Dice coefficient - Wikipedia
;https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

;note:
;duplicate bigrams in a word should be counted distinctly
;(per discussion), otherwise 'AA' and 'AAAA' would have a
;dice coefficient of 1
;Algorithm Implementation/Strings/Dice's coefficient - Wikibooks, open books for an open world
;https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient

q:: ;Sørensen-Dice coefficient
vText1 := "night", vText2 := "nacht"
;vText1 := "AA", vText2 := "AAAA"
vCount := 0
oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
Loop, % vCount1 := StrLen(vText1) - 1
	oArray["z" SubStr(vText1, A_Index, 2)]++
Loop, % vCount2 := StrLen(vText2) - 1
	if (oArray["z" SubStr(vText2, A_Index, 2)] > 0)
	{
		oArray["z" SubStr(vText2, A_Index, 2)]--
		vCount++
	}
vDSC := (2 * vCount) / (vCount1 + vCount2)
MsgBox, % vCount " " vCount1 " " vCount2 "`r`n" vDSC
return

Re: Sørensen–Dice coefficient

Posted: 09 Oct 2018, 14:55
by nnnik
Yeah but tbh this method is not worth it.
The words ste and set share no similarities according to this method.

@jeeswg:
while keeping the second word in a string format rather than turning it into an object might be faster when doing a single comparison.
However it is more likely that said comparison will be used to find an entry on a list of strings.
Once you have multiple entries arrays will be faster than strings.

Re: Sørensen–Dice coefficient

Posted: 09 Oct 2018, 15:01
by nnnik
For the levenshtein distance I could dig out the following:
https://stackoverflow.com/questions/387 ... make-sense

Re: Sørensen–Dice coefficient

Posted: 09 Oct 2018, 15:11
by jeeswg
@nnnik: Didn't you already write a function?
Damerau-Levenshtein Distance - Fuzzy searches - AutoHotkey Community
https://autohotkey.com/boards/viewtopic.php?f=6&t=39112

Re: Sørensen–Dice coefficient

Posted: 09 Oct 2018, 15:18
by nnnik
Yeah I did - but I didn't like the results.
The topic mentions one of the basic problems of the levenshtein distance.
The jaro winkler distance seems to be better for these tasks.

Re: Sørensen–Dice coefficient

Posted: 10 Oct 2018, 14:30
by Chunjee
nnnik wrote:The jaro winkler distance seems to be better for these tasks.
I'm interested in this as well but not smart enough to write it myself.



Here's what I came up with trying to duplicate the output of the javascript library string-similarity:

Code: Select all

Class stringsimilarity {
	; #Include ../../lib/sort_arrays/export.ahk

	__New() {
        this.info_Array
	}


    compareTwoStrings(para_string1,para_string2) {
        ;Sørensen-Dice coefficient
        vCount := 0
        oArray := {}
        oArray := {base:{__Get:Func("Abs").Bind(0)}} ;make default key value 0 instead of a blank string
        Loop, % vCount1 := StrLen(para_string1) - 1
            oArray["z" SubStr(para_string1, A_Index, 2)]++
        Loop, % vCount2 := StrLen(para_string2) - 1
            if (oArray["z" SubStr(para_string2, A_Index, 2)] > 0) {
                oArray["z" SubStr(para_string2, A_Index, 2)]--
                vCount++
            }
        vDSC := (2 * vCount) / (vCount1 + vCount2)
        ; MsgBox, % vCount " " vCount1 " " vCount2 "`r`n" vDSC
        return Round(vDSC,2)
    }


    findBestMatch(para_string,para_array) {
        if (!IsObject(para_array)) {
            return false
        }
        this.info_Array := []

        ; Score each option and save into a new array
        loop, % para_array.MaxIndex() {
            this.info_Array[A_Index, "rating"] := this.compareTwoStrings(para_string, para_array[A_Index])
            this.info_Array[A_Index, "target"] := para_array[A_Index]
        }

        ;sort the scored array and return the bestmatch
        sortedArray := Sort2DArrayFast(this.info_Array,"rating", false) ;false reverses the order so the highest scoring is at the top
        oObject := {bestMatch:sortedArray[1], ratings:sortedArray}
        return oObject
    }


    simpleBestMatch(para_string,para_array) {
        if (!IsObject(para_array)) {
            return false
        }

        l_array := this.findBestMatch(para_string,para_array)
        return l_array.bestMatch.target
    }
}
it has a dependency so I still need to figure out how to package all that up.

Re: Sørensen–Dice coefficient

Posted: 10 Oct 2018, 15:02
by FanaticGuru
nnnik wrote:Normally I would use the Levenshtein Distance.
For this one I would probably collect the bigrams of both strings in seperate arrays and then compare the arrays.

Code: Select all

sharedBigrams := 0
for bigram, _ in bigramsOfFirstWord {
	if bigramsofSecondWord.hasKey(bigram) {
		sharedBigrams++
	}
}
If people want to play around with Ngram, here is a function for it:
https://autohotkey.com/boards/viewtopic.php?t=7302

I have had great success using it as a fuzzy find to search for text in a large text haystack.

FG