Delimiter

Delimiter
A stylistic depiction of a fragment from a CSV-formatted text file. The commas (shown in red) are used as field delimiters.

A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams.[1] An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.

Delimiters represent one of various means to specify boundaries in a data stream. Declarative notation, for example, is an alternate method that uses a length field at the start of a data stream to specify the number of characters that the data stream contains.[2]

Contents

Overview

Delimiters can be broken down into:

  • Field and record delimiters; and
  • Bracket delimiters.

Field and record delimiters

Field delimiters separate data fields. Record delimiters separate groups of fields.[3]

For example, the CSV file format uses a comma as the delimiter between fields, and an end-of-line indicator as the delimiter between records. For instance:

 
fname,lname,age,salary
nancy,davolio,33,$30000
erin,borakova,28,$25250
tony,raphael,35,$28700

specifies a simple flat file database table using the CSV file format.

Bracket delimiters

Bracket delimiters (also block delimiters, region delimiters or balanced delimiters) mark both the start and end of a region of text.[4][5]

Common examples of bracket delimiters include:[6]

Delimiters Description
( and ) Parentheses. The Lisp programming language syntax is cited as recognizable primarily by its use of parentheses.[7]
{ and } Curly brackets.[8]
< and > Angle brackets.[9]
" and " commonly used to denote string literals.[10]
' and ' commonly used to denote string literals.[10]
<? and ?> used to indicate XML processing instructions.[11]
/* and */ used to denote comments in some programming languages.[12]
<% and %> used in some web templates to specify language boundaries. These are also called template delimiters.[13]

Conventions

Computing platforms historically use certain delimiters by convention.[14] The following tables depict just a few examples for comparison.

Programming languages (See also, Comparison of programming languages (syntax)).

String Literal End of Statement
Pascal singlequote semicolon
C doublequote, singlequote semicolon

Field and Record delimiters (See also, ASCII, Control character).

End of Field End of Record End of File
Unix, Mac OS X, Amiga OS Tab LF none
Windows, MS-DOS, OS/2, CP/M Tab CRLF Control-Z[15]
Classic Mac OS, AppleDOS, ProDOS, GS/OS Tab CR none
ASCII/Unicode UNIT SEPARATOR
Position 31 (U+001F)
RECORD SEPARATOR
Position 30 (U+001E)
FILE SEPARATOR
Position 28 (U+001C)

Delimiter collision

Delimiter collision is a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions.[3][16] In the case of XML, for example, this can occur whenever an author attempts to specify an angle bracket character. In most file types there is both a field delimiter and a record delimiter, both of which are subject to collision. In the case of comma-separated values files, for example, field collision can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000"), and record delimiter collision would occur whenever a field contained multiple lines. Both record and field delimiter collision occur frequently in text files.

In some contexts, a malicious user or attacker may seek to exploit this problem intentionally. Consequently, delimiter collision can be the source of security vulnerabilities and exploits. Malicious users can take advantage of delimiter collision in languages such as SQL and HTML to deploy such well-known attacks as SQL injection and cross-site scripting, respectively.

Solutions

Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. This ad hoc approach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream, and offers no security against malicious collisions. Other, more formal conventions are therefore applied as well.

ASCII delimited text

The ASCII and Unicode character sets were designed to solve this problem by the provision of non-printing characters that can be used as delimiters. These are the range from ASCII 28 File Separator to ASCII 31 Unit Separator. The use of ASCII 31 Unit separator as a field separator and ASCII 30 Record separator solves the problem of both field and record delimiters that appear in a text data stream.[17]

Escape character

One method for avoiding delimiter collision is to use escape characters. From a language design standpoint, these are adequate, but they have drawbacks:

  • text can be rendered unreadable when littered with numerous escape characters, a problem referred to as leaning toothpick syndrome (due to use of \ to escape / in Perl regular expressions, leading to sequences such as "\/\/");
  • text becomes difficult to parse through regular expression
  • they require a mechanism to "escape the escapes" when not intended as escape characters; and
  • although easy to type, they can be cryptic to someone unfamiliar with the language.[18]
  • they do not protect against injection attacks

Escape sequence

Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is in string literals that include a doublequote (") character. For example in Perl, the code:

print "Nancy said \x22Hello World!\x22 to the crowd.";  ### use \x22

produces the same output as:

print "Nancy said \"Hello World!\" to the crowd.";      ### use escape char

One drawback of escape sequences, when used by people, is the need to memorize the codes that represent individual characters (see also: character entity reference, numeric character reference).

Dual quoting delimiters

In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a single quote (') or a double quote (") to specify a string literal. For example in Perl:

print 'Nancy said "Hello World!" to the crowd.';

produces the desired output without requiring escapes. This approach, however, only works when the string does not contain both types of quotation marks.

Padding quoting delimiters

In contrast to escape sequences and escape characters, padding delimiters provide yet another way to avoid delimiter collision. Visual Basic, for example, uses double quotes as delimiters. This is similar to escaping the delimiter.

print "Nancy said ""Hello World!"" to the crowd."

produces the desired output without requiring escapes. Like regular escaping it can, however, become confusing when many quotes are used. The code to print the above source code would look more confusing:

print "print ""Nancy said """"Hello World!"""" to the crowd."""

Multiple quoting delimiters

In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision.[19]

For example in Perl:

print qq^Nancy doesn't want to say "Hello World!" anymore.^;
print qq@Nancy doesn't want to say "Hello World!" anymore.@;
print qq(Nancy doesn't want to say "Hello World!" anymore.);

all produce the desired output through use of the quotelike operator, which allows any convenient character to act as a delimiter. Although this method is more flexible, few languages support it. Perl and Ruby are two that do.[20][21]

Content boundary

A content boundary is a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a sequence of characters that is guaranteed to always indicate a boundary between parts in a multi-part message, with no other possible interpretation.[22]

The delimiter is frequently generated from a random sequence of characters that is statistically improbable to occur in the content. This may be followed by an identifying mark such as a UUID, a timestamp, or some other distinguishing mark. Alternatively, the content may be scanned to guarantee that a delimiter does not appear in the text. This may allow the delimiter to be shorter or simpler, and increase the human readability of the document. (See e.g., MIME, Here documents).

Whitespace or indentation

Some programming and computer languages allow the use of whitespace delimiters or indentation as a means of specifying boundaries between independent regions in text.[23]

Regular expression syntax

In specifying a regular expression, alternate delimiters may also be used to simplify the syntax for match and substitution operations in Perl.[24]

For example, a simple match operation may be specified in Perl with the following syntax:

$string1 = 'Nancy said "Hello World!" to the crowd.';    # specify a target string
print $string1 =~ m/[aeiou]+/;                           # match one or more vowels

The syntax is flexible enough to specify match operations with alternate delimiters, making it easy to avoid delimiter collision:

$string1 = 'Nancy said "http://Hello/World.htm" is not a valid address.'; # target string
 
print $string1 =~ m@http://@;       # match using alternate regular expression delimiter
print $string1 =~ m{http://};       # same as previous, but different delimiter
print $string1 =~ m!http://!;       # same as previous, but different delimiter.

Here document

A Here document allows the inclusion of arbitrary content by describing a special end sequence. Many languages support this including PHP, bash scripts and perl. A here document starts by describing what the end sequence will be and continues until that sequence is seen at the start of a new line.[25]

Here is an example in perl:

print <<ENDOFHEREDOC;
It's very hard to encode a string with "certain characters".
 
Newlines, commas, and other characters can cause delimiter collisions.
ENDOFHEREDOC

This code would print:

It's very hard to encode a string with "certain characters".

Newlines, commas, and other characters can cause delimiter collisions.

By using a special end sequence all manner of characters are allowed in the string.

ASCII armor

Although principally used as a mechanism for text encoding of binary data, ASCII armoring is a programming and systems administration technique that also helps to avoid delimiter collision in some circumstances.[26][27] This technique is contrasted from the other approaches described above because it is more complicated, and therefore not suitable for small applications and simple data storage formats. The technique employs a special encoding scheme, such as base64, to ensure that delimiter characters do not appear in transmitted data.

This technique is used, for example, in Microsoft's ASP.NET web development technology, and is closely associated with the "VIEWSTATE" component of that system.[28]

Example

The following simplified example demonstrates how this technique works in practice.

The first code fragment shows a simple HTML tag in which the VIEWSTATE value contains characters that are incompatible with the delimiters of the HTML tag itself:

<input type="hidden" __VIEWSTATE="BookTitle:Nancy doesn't say "Hello World!" anymore." />

This first code fragment is not well-formed, and would therefore not work properly in a "real world" deployed system.

In contrast, the second code fragment shows the same HTML tag, except this time incompatible characters in the VIEWSTATE value are removed through the application of base64 encoding:

<input type="hidden" __VIEWSTATE="Qm9va1RpdGxlOk5hbmN5IGRvZXNuJ3Qgc2F5ICJIZWxsbyBXb3JsZCEiIGFueW1vcmUu" />

This prevents delimiter collision and ensures that incompatible characters will not appear inside the HTML code, regardless of what characters appear in the original (decoded) text.[28]

See also

Notes and references

  1. ^ Federal Standard 1037C delimiter
  2. ^ Science, By (1973). Programming in Fortran. Oxford Oxfordshire: Oxford University Press. ISBN 9780719005558.  describing the method in Hollerith notation under the Fortran programming language.
  3. ^ a b de Moor, Georges J. (1993). Progress in Standardization in Health Care Informatics. IOS Press. ISBN 9051991142.  p. 141
  4. ^ Friedl, Jeffrey E. F. (2002). Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools. O'Reilly. ISBN 0596002890.  p. 319
  5. ^ Scott, Michael Lee (1999). Programming Language Pragmatics. Morgan Kaufmann. ISBN 1558604421. 
  6. ^ Wall, Larry, Tom Christiansen and Jon Orwant (July 2000). Programming Perl, Third Edition. O'Reilly. ISBN 0-596-00027-8. 
  7. ^ Kaufmann, Matt (2000). Computer-Aided Reasoning: An Approach. Springer. ISBN 0792377443. p. 3
  8. ^ Meyer, Mark (2005). Explorations in Computer Science. Oxford Oxfordshire: Oxford University Press. ISBN 9780763738327.  references C-style programming languages prominently featuring curly-brackets and semicolons.
  9. ^ Dilligan, Robert (1998). Computing in the Web Age. Oxford Oxfordshire: Oxford University Press. ISBN 9780306459726. Describes syntax and delimiters used in HTML.
  10. ^ a b Schwartz, Randal (2005). Learning Perl. Oxford Oxfordshire: Oxford University Press. ISBN 9780596101053. Describes string literals.
  11. ^ Watt, Andrew (2003). Sams Teach Yourself Xml in 10 Minutes. Oxford Oxfordshire: Oxford University Press. ISBN 9780672324710.  Describes XML processing instruction. p. 21.
  12. ^ Cabrera, Harold (2002). C# for Java Programmers. Oxford Oxfordshire: Oxford University Press. ISBN 9781931836548.  Describes single-line and multi-line comments. p. 72.
  13. ^ "Smarty Template Documentation". http://www.smarty.net/manual/en/language.function.ldelim.php. Retrieved 2010-03-12.  See e.g., Smarty template system documentation, "escaping template delimiters".
  14. ^ International Organization for Standardization (December 1, 1975). "The set of control characters for ISO 646". Internet Assigned Numbers Authority Registry. Alternate U.S. version: [1]. Accessed August 7, 2005.
  15. ^ Lewine, Donald (1991). Posix Programmer's Guide. Oxford Oxfordshire: Oxford University Press. ISBN 9780937175736.  Describes use of control-z. p. 156,
  16. ^ Friedl, Jeffrey (2006). Mastering Regular Expressions. Oxford Oxfordshire: Oxford University Press. ISBN 9780596528126.  describing solutions for embedded-delimiter problems p. 472.
  17. ^ Discussion on ASCII Delimited Text vs CSV and Tab Delimited
  18. ^ Kahrel, Peter (2006). Automating InDesign with Regular Expressions. O'Reilly. ISBN 0596529376. p. 11
  19. ^ Wall, Larry, Tom Christiansen and Jon Orwant (July 2000). Programming Perl, Third Edition. O'Reilly. ISBN 0-596-00027-8.  p. 63.
  20. ^ Wall, Larry, Tom Christiansen and Jon Orwant (July 2000). Programming Perl, Third Edition. O'Reilly. ISBN 0-596-00027-8.  p. 62
  21. ^ Yukihiro, Matsumoto (2001). Ruby in a Nutshell. O'Reilly. ISBN 0596002149.  In Ruby, these are indicated as general delimited strings. p. 11
  22. ^ Javvin Technologies, Incorporated (2005). Network Protocols Handbook. Javvin Technologies Inc.. ISBN 0974094528.  p. 26
  23. ^ 200, Cicling (2001). Computational Linguistics and Intelligent Text Processing. Oxford Oxfordshire: Oxford University Press. ISBN 9783540416876.  Describes whitespace delimiters. p. 258.
  24. ^ Friedl, Jeffrey (2006). Mastering Regular Expressions. Oxford Oxfordshire: Oxford University Press. ISBN 9780596528126.  page 472.
  25. ^ Perl operators and precedence
  26. ^ Rhee, Man (2003). Internet Security: Cryptographic Principles, Algorithms and Protocols. John Wiley and Sons. ISBN 0470852852. (an example usage of ASCII armoring in encryption applications)
  27. ^ Gross, Christian (2005). Open Source for Windows Administrators. Charles River Media. ISBN 1584503475. (an example usage of ASCII armoring in encryption applications)
  28. ^ a b Kalani, Amit (2004). Developing and Implementing Web Applications with Visual C# . NET and Visual Studio . NET. Que. ISBN 0789729016. (describes the use of Base64 encoding and VIEWSTATE inside HTML source code)

Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • délimiter — [ delimite ] v. tr. <conjug. : 1> • 1773; lat. delimitare 1 ♦ (Sujet personne) Déterminer en traçant les limites. ⇒ limiter, marquer. Délimiter la frontière entre deux États. (Sujet chose) Former la limite de. ⇒ borner. Clôtures qui… …   Encyclopédie Universelle

  • Delimiter —   [dt. Trennzeichen] der, vor allem bei Datenbanken und Tabellenkalkulationen gebrauchte Bezeichnung für ein Zeichen oder eine Zeichenfolge, welche die Grenze zwischen einzelnen Datenfeldern in einem Datensatz markiert. Auch wenn die meisten… …   Universal-Lexikon

  • delimiter — 1960, in computing, agent noun from DELIMIT (Cf. delimit) …   Etymology dictionary

  • delimiter — [dē lim′it ər] n. Comput. a letter, symbol, etc. used to set off one string of characters or item of data from another …   English World dictionary

  • Delimiter — Mit dem Begriff Delimiter (engl. für Abgrenzer) wird ein Trennzeichen bezeichnet. Im folgenden Text werden Semikolons als Delimiter zwischen den Zahlen verwendet: 123;234;123;3454353;3453; Delimiter werden oft in Computer Dateien verwendet, um… …   Deutsch Wikipedia

  • DÉLIMITER — v. tr. Circonscrire dans des limites nettement déterminées. Les commissaires chargés de délimiter la frontière des deux états. Fig., Délimiter les attributions, les droits d’un corps, d’un conseil. Les fonctions de cet administrateur ne sont pas… …   Dictionnaire de l'Academie Francaise, 8eme edition (1935)

  • delimiter —    1. Any special character that separates individual items in a data set or file. For example, in a comma delimited file, the comma is placed between each data value as the delimiter.    2. In a token ring network, a delimiter is a bit pattern… …   Dictionary of networking

  • délimiter — vt. , établir // fixer délimiter les limites de (un champ, un pouvoir, un sujet...) : tèrmounâ <borner> (Cordon), DÉLIMITÂ (Albanais). E. : Borner. A1) délimiter un pré par un passage dans l herbe : fâre la flâ vi. (083). E. : Borne,… …   Dictionnaire Français-Savoyard

  • DÉLIMITER — v. a. Marquer, fixer, tracer des limites. Les commissaires chargés de délimiter la frontière des deux États. DÉLIMITÉ, ÉE. participe …   Dictionnaire de l'Academie Francaise, 7eme edition (1835)

  • delimiter — noun That which delimits, that separates. This comma delimited file has commas as the delimiter, separating each field of the file …   Wiktionary

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”