Title: | String Diff, Match, and Patch Utilities |
---|---|
Description: | A wrapper for Google's 'diff-match-patch' library. It provides basic tools for computing diffs, finding fuzzy matches, and constructing / applying patches to strings. |
Authors: | Colin Rundel [aut, cre], Google Inc. [cph] (diff_match_patch.h), Neil Fraser [cph] (diff_match_patch.h), Mike Slemmer [cph] (diff_match_patch.h), Sergey Nozhenko [cph] (diff_match_patch.h), Christian Leutloff [cph] (diff_match_patch.h) |
Maintainer: | Colin Rundel <[email protected]> |
License: | Apache License (>= 2) |
Version: | 0.1.0 |
Built: | 2024-11-07 02:40:06 UTC |
Source: | https://github.com/rundel/diffmatchpatch |
The following functions are used to construct or work with diff(s) between text strings.
Specifically, diff_make()
computes the character level differences between
the source string (x
) and destination string (y
). These diffs can be made more human
friendly via a secondary cleaning process via the cleanup
argument.
Once computed, diffs are represented using diff_df
data frames, which consist of just
two columns: text
and op
. Basic convenience functions for pretty printing of these are
provided by the package.
The following helper functions are provided:
print()
- prints a diff using ANSI colors if available.
as.character()
- converts a diff (using ANSI colors if available) to a character vector.
diff_levenshtein()
calculates the Levenshtein distance of a diff.
diff_to_delta()
converts a diff to a delta string.
diff_from_delta()
creates a diff from a source string (x
) and a delta
string.
diff_to_html()
converts a diff to pretty HTML string.
diff_to_patch()
converts a diff to a patch string.
diff_text_source()
recovers the source string from a diff.
diff_text_dest()
recovers the destination string from a diff.
diff_make(x, y, cleanup = "semantic", checklines = TRUE) diff_levenshtein(diff) diff_to_delta(diff) diff_from_delta(x, delta) diff_to_html(diff) diff_to_patch(diff) diff_text_source(diff) diff_text_dest(diff)
diff_make(x, y, cleanup = "semantic", checklines = TRUE) diff_levenshtein(diff) diff_to_delta(diff) diff_from_delta(x, delta) diff_to_html(diff) diff_to_patch(diff) diff_text_source(diff) diff_text_dest(diff)
x |
The source string |
y |
The destination string |
cleanup |
Determines the cleanup method applied to the diffs. Allowed values include:
|
checklines |
Performance flag - if |
diff |
A |
delta |
A delta string. |
semantic
- Reduce the number of edits by eliminating semantically trivial equalities.
semantic lossless
- Look for single edits surrounded on both sides by equalities
which can be shifted sideways to align the edit to a word boundary.
e.g: The cat came. -> The **cat **came.
efficiency
- Reduce the number of edits by eliminating operationally trivial equalities.
merge
- Reorder and merge like edit sections. Merge equalities.
Any edit section can move as long as it doesn't cross an equality.
none
- Do not apply any cleanup methods to the diffs.
diff_make()
returns a diff_df
data frame containing the diffs.
diff_make()
returns the Levenshtein distance as an integer.
diff_to_delta()
returns an character string.
diff_from_delta()
returns a diff_df
data frame.
diff_to_html()
returns a character string.
diff_to_patch()
returns a character string.
diff_text_source()
returns a character string.
diff_text_dest()
returns a character string.
(d = diff_make("abcdef", "abchij")) diff_levenshtein(d) diff_to_html(d) diff_text_source(d) diff_text_dest(d) diff_to_patch(d) (delta = diff_to_delta(d)) diff_from_delta("abcdef", delta)
(d = diff_make("abcdef", "abchij")) diff_levenshtein(d) diff_to_html(d) diff_text_source(d) diff_text_dest(d) diff_to_patch(d) (delta = diff_to_delta(d)) diff_from_delta("abcdef", delta)
Allows for examining or setting options that affect the behavior of the diff, match, and patch related functions in this package.
dmp_options(...)
dmp_options(...)
... |
No arguments returns all current options and their values. Character values retrieve a subset of options and the current values. Options can be set, using name = value. However, only the options named below are used. Options can also be passed by giving a single unnamed argument which is a named list. |
diff_timeout
(float) - Number of seconds to map a diff before giving up (0 for infinity).
diff_edit_cost
(int) - Cost of an empty edit operation in terms of edit characters.
match_threshold
(float) - At what point is no match declared (0.0 = perfection, 1.0 = very loose).
match_distance
(int) - How far to search for a match (0 = exact location, 1000+ = broad match).
A match this many characters away from the expected location will add
1.0 to the score (0.0 is a perfect match).
patch_delete_threshold
(float) - When deleting a large block of text (over ~64 characters), how close does
the contents have to match the expected contents. (0.0 = perfection,
1.0 = very loose). Note that Match_Threshold controls how closely the
end points of a delete need to match.
patch_margin
(int) - Chunk size for context length.
match_max_bits
(int) - The number of bits in an int.
When getting options returns a named list of options and their current values, when setting options returns a named list of the previous value(s).
dmp_options() dmp_options("diff_timeout") prev = dmp_options(diff_timeout = 5) prev
dmp_options() dmp_options("diff_timeout") prev = dmp_options(diff_timeout = 5) prev
Locate the best instance of pattern
in the text
near loc
using the
Bitap algorithm.Returns -1
if no match found. Assumes R's typical 1-based indexing for loc
and the returned value.
This algorithm makes use of the match_distance
and match_threshold
options to determine
the match. If these values are not set explicitly via the threshold
and distance
arguments -
their value will use the currently set global option value.
Candidate matches are scored based on: a) the number of spelling differences between the pattern and the text and b) the distance between the candidate match and the expected location.
The match_distance
option determines the relative importance of these two metrics.
match_find(text, pattern, loc = 1L, threshold = NULL, distance = NULL)
match_find(text, pattern, loc = 1L, threshold = NULL, distance = NULL)
text |
The text to search. |
pattern |
The pattern to search for. |
loc |
The expected location of the pattern. |
threshold |
Threshold for determining a match (0 - perfect match, 1 - very loose). |
distance |
Distance from expected location scaling for score penalty. |
Index of best match or -1 for no match.
x = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." match_find(x, "Loren Ibsen") match_find(x, "Loren Ibsen", threshold = 0.1) match_find(x, "minimum") match_find(x, "minimum", threshold = 0.4)
x = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." match_find(x, "Loren Ibsen") match_find(x, "Loren Ibsen", threshold = 0.1) match_find(x, "minimum") match_find(x, "minimum", threshold = 0.4)
Patches are constructed via patch_make()
and applied using patch_apply()
.
patch_make(x, y) patch_apply(x, patch)
patch_make(x, y) patch_apply(x, patch)
x |
The source string |
y |
The destination string |
patch |
A string representation of the patch(es). |
patch_make()
returns a string representation of the patch(es).
patch_apply()
returns the patched version of the string x
,
the matches
attribute contains logical values indicating which patches
were successfully applied.
(p = patch_make("abcdef", "abchij")) patch_apply("abcdef", p) patch_apply("abc", p) patch_apply("def", p) patch_apply("hij", p)
(p = patch_make("abcdef", "abchij")) patch_apply("abcdef", p) patch_apply("abc", p) patch_apply("def", p) patch_apply("hij", p)