Notes on developing an XML language for music

Presented at EU sponsored MusicNetwork
Symbolic Music Representation workshop
MPEG4 standard extension

Jacques Steyn
Monash University

Abstract

XML is a very powerful language that can be used for various functions ranging from data description to transport mechanism. In this presentation issues that need to be addressed when developing an XML-based language for music, especially its symbolic representation, are higlighted.

The mandate for this workgroup concerns developing a specification for various systems of symbolic representation of music. If these different systems are to be interchangeable, I propose that an abstract reference layer be addressed, onto which the different conventions can be mapped. I call this abstract layer Music Space.

If we agree on this as the framework within which we work, there are implications for the type of XML language that will result from this. Before I get to that, we need to understand the role and function of XML.

In this presentation I explain what needs to happen before we can embark on developing an XML-based music markup language. Most important is to agree on a definition of the "object of investigation" (to use a bit of positivistic terminology), its major properties, and to distinguish between formal analysis of the object, the language used to express the formal analysis, and, in the case of music, the role of notation systems. The terms I introduce here will be biased toward the technologies of music.

XML was originally developed to extend the element set of HTML, which contained only about 80 elements. In that context XML was thus supposed to focus on marking data in order to manipulate it more intelligently than HTML made possible. However, it was soon discovered that XML allows for the introduction of other functions, such as to facilitate transportation of data. And interestingly enough, most XML applications today operate on the backend, with very few front-end applications in place. In fact, commonly used browsers make a bit of a mess with front-end XML documents.

XML is a computer markup language, which means that any human language words that are written as computer text can be tagged in order to be manipulated. It is possible to tag every single word (or concepts) in the entire universe. But that's not what we want to do - we just want to focus on tagging music concepts. It is also not very economical or practical to try to tag the entrire universe. Even to tag the entire music world may be an impossible task.

In the words of Harold:

"Designing an XML application is an exercise in modeling a problem domain, similar in many respects to designing a class hierarchy in an object-oriented programming language or defining the tables that make up a database schema. It involves mapping a real-world system into the constructs the language makes available. In the case of XML, you map real-world information into trees, elements, and attributes."
(Harold 2004:59)

As first step to develop an XML language for music, we thus have to decide exactly what we wish to tag (the problem domain). The trick here is to find a balance between different philosophical positions that define music events and objects differently, and to find a practical middle road. We have to map real music into the framework XML provides us with.

What is music?

By definition music has to do with soundwaves produced by some human organ or technological device (musical insgtruments) that reach our ears. Although what is regarded as music is cultural as well as depending on the idiosyncracies of individual tastes, I am convinced that all music can be defined in terms of physical sound waves.

The process of developing a markup language of music would thus begin with determining how the properties of soundwaves can be described. That is what I attempted to do with MML where I defined soundwaves as a function of frequency over time, and introduced a separate module for each - although sound would not be possible if only one of these modules is implemented. What is decscribed in these two modules is the base or fundamental frequency of a soundwave. In MML I did not attempt to describe visually presented soundwaves, such Fourier graphs, but approach soundwaves from a more abstract position.

Music is of course much more complex than mere fundamental soundwaves as the waves can be manipulated subtractively or additively - which I described and tagged in an Effects Module. And of course there is information about a specific piece of music that also need to be described - its metadata, address in the Organization Module.

We can communicate about music in spoken or in written words. Spoken human language also consists of soundwaves, but there is a huge advantage when that is captured in a visual format by using some or other writing system. The same is true of music. And this brings me to the mandate of this group: issues concerning the written expression (or symbolic representation) of music soundwaves.

Lessons from human language writing

Written human language has been implemented very successfuly in computer technology. The explosion of World Wide Web communication, which is based on a relatively primitive markup language called HTML, illustrates this. At the time of writing, the Google search engine lists more than 3 billion web pages, by far the majority of which have been written in the HTML format. But XML is much more powerful than HTML, and it has not even been deployed fully on the Web yet. The full impact of XML on the Web is still years away.

Spoken human language consists of soundwaves, while human language writing serve as graphic rendering of those soundwaves. Here is a visual representation of the phrase "Music Markup Language". This visual representation is of course quite abstract as sound waves do not look like this. It is a linear representation, while soundwaves propagate in 3D space. Thankfully, our human language writing systems do not look like this. I would not really like to read War and Peace in this writing system.

Here are the soundwaves of each of the three words:

"Music":

"Markup":

"Language":

The sounds are more or less continuous waves, yet human language writing makes snapshots of these continuous waves that can be expressed visually as written characters. The graphic displays of the soundwaves above are merely presentations, while the symbols of human writing is an abstraction on an even higher level. Below follows the soundwaves of some sections of the soundwaves that have specific character representations in alphabetical languages, such as within the Latin family.

The graphic representation of the soundwave of the "m" sound:

The "a" sound:

The "l" sound:

The "u" sound. There is no "u" sound in the above phrase. What is written as "u" is in terms of soundwave closer to "w". The "u" sound below is the sound as expressed in isolation, thus not as in the phrase above. Spelling conventions thus do not follow the distinction between phonemes.

Note first of all that the above are graphic representations of soundwaves. Soundwaves are expressed visually. Further, note that the human language visual characters do not express each possible peak and valley of the soundwaves. They are rough approximations. And there is not a one-to-one mapping between wave patterns and graphic representations. Yet, written human language is extremely powerful.

Another matter of importance is that the same visual glyph, eg. "a", when spoken, may represent different possible sets of soundwaves. When the visual glyph is transformed into an audio soundwave, the resulting soundwave may be different depending on the language or dialect that is used as look-up table, if I could put the process in computer terms. If we set up the system to use French, the resulting soundwave will be different from the one if set up as Russian.

Linguists distinguish between phonemes and phonetics. Phonemes are abstract theoretical constructs and are written in the format:

/phoneme/

Phonetic representations, ie. the actual "spoken" sounds, are written in this format:

[phonetic symbol]

There is no one-to-one correlation between the special glyphs (ie. the graphic symbols representing sounds) designed by linguists and the glyphs found in commonly written language. Alphabetic languages are very economical as a single graphic symbol can represent many different, although related, soundwaves. For example, the graphic symbol "a", when spoken, will produce different soundwaves depending on gender, age, language, dialect, and even the contexts in which the symbols are used.

Linguists distinguish between phonemes and phonetic sounds, which are kind of statistical averages depending on the different qualities mentioned above.

The point of all this is to demonstrate that a graphic representation does not need to be a true, a correct or absolute representation of an actual soundwave to be very useful. We can learn from human writing conventions and the efforts of linguists in our design of Music XML applications.

From the discussions within the MusicNetwork I get the impression that some participants require the group to be developing an XML notation standard that would be absolutely precise. But the idea of precision, such as a true representation of a waveform, or of the printed space between music symbols, is misplaced. Those issues are stylistic - and should be handled with style sheets.

In terms of the application of style issues, again we can learn from the implementation of the graphic representation of human language. I will address this further down. Here I will merely point to the fact that the W3C's activities make a strict distinction between the semantic structure of XML documents (which is marked with XML-based languages such as XHTML) and representation issues that are applied to the XML. Such stylistic issues are handled by XSL and CSS.

XML in two minutes

Having introduced some important background concepts about the representation of sound, I will now turn to some issues in the design of an XML language.

Let us quickly design an XML language that will describe the general features of this room we're in. Observe all the objects in this room. There is furniture, there is a structure, there is color, and there are even sounds. It is possible to '”mark” this room in the finest and minutest detail. We could, for example, develop a markup language to describe all the color hues in this room. Now look at any wall in this room. Notice, that if you look carefully enough, this wall is not evenly colored. It is possible to distinguish different color hues due to different angles of the light source reflected from the wall's surface, shade, and other factors. So, yes, it is possible to describe the room even at that detailed level. The question is, do we really want to do that? I guess not.

To simplify matters, the Room XML language we want to design in this exercise must address these categories: its basic structure (such as windows, entrances, etc), the furniture (chairs, tables, etc) and the infrastructure (lights, light switches, etc). In this system of description there are three major categories, which may result in three peers on the first level of hierarchy of an XML language.

Our room XML language will thus look like this:

<room> 
  <basics /> 
  <furniture /> 
  <infrastructure /> 
</room>

On the next level of analysis we need to decide which objects we wish to describe, and obviously categorizing them into these basic groups.

The point I want to make is that when developing a markup language, we need to be very sure of how we define the object that will be “marked”. I am not convinced everybody in the MPEG SMR presently agrees exactly which aspects of music, specifically its notation, we need to address.

Possible systems of description

We can decsribe most objects in the universe on several levels, each with its own XML-based language, but of course we should rather have a single one for a specific object. For this discussion I distinguish between these systems:

System of Explanation
System of Description
System of Formal Symbolic Expression
System of Markup Expressions

System of Explanation

In the System of Explanation the ontology of the object that needs to be described is analalyzed. This is on a meta-level, operates within a particular philosophical framework, and makes a host of assumptions. Academics may debate this for centuries. In real world applications, we need to be pragmatic, make a quick decision and stick to it.

In terms of our discussion on music, on this level I assume that music is defined as physical sound waves.

In this presentation I will use the convention of forward slashes to express a concept on this level - like the linguists' level of phonemes:

/room/

/music/ would be how we define our object, how we analyse it on a meta-level. I define it as physical sound waves. There is obviously a cultural element as well as psychological element here as the sets of sound waves that are regarded as music depends on such sub-systems as tuning conventions and scaling systems.

System of Description

In the System of Description human communication is used to express the system of explanation. The paragraphs above are in English, and express the thoughts I have about the object we call music. I could communicate those concepts in any human language. This is not yet a formal system of description. In practical terms, I can discuss these concepts around the watercan in a passage without any reference to a formal expression system. I could, for example, discuss the geometry of Euclid and how it compares with the fractal geometry of Mandelbrot without reference to any formulae.

“Natural” language is used on this level. Any such “natural” language is also based on cultural conventions, which is why I hesitate to call it “natural”.

I will use the convention of curly brackets to express this system:

{room}

{music} would be how we talk about /music/.

System of Formal Symbolic Expression

In the System of Formal Symbolic Expression some formal system of expression is developed. The resulting system is also a convention, but this convention is much more restricted than general human language conventions. Mathematical formulae are examples of formal symbolic expressions, and so are symbolic systems that represent music objects and events.

I will use the square brackets as convention to express this system - like the linguists phonetic representation level:

[room]

[music] could be translated into any of the many possible symbolic representation systems. A particular music event may be represented in CWN with a quarter time blob-and-stick symbol.

System of Markup Expressions

The System of Markup Expressions consists of the set of element names that describe the object to be marked. An XML-based language can be used to express any of the above systems. The MPEG mandate for SMR focuses mainly on the System of Formal Symbolic Expression. But it is also possible to describe music in much more abstract terms.

We use human language communication symbols on all these four levels, but these symbols do not have the same functions.

To get back to the room example, I will distinguish between the different concepts with typographical symbols and I will use these symbols idiosyncratically only for the purpose of this presentation.

It is at this level where XML functions. The graphic representation convention for this level is the SGML format, or strictly speaking, the EBNF -- for Extended Backus-Naur Form.

<room>

<music> would obviously refer to the XML element name, which may be as simple as <note>.

Although the English ASCII character sequence of “room” or of "music" is the same in each instance, they function differently in each system:

/room/ refers to how a room is analysed on a meta-level
/music/ refers to how music is analysed on a meta-level
{room} is the symbol-set used to talk about the /room/
{music} is the human language symbol set used to talk about /music/
[room] - we could develop a formal language where [room] would be used in a formulaic sense
[music] is the formal symbols within a music writing system , such as the symbols of CWN. So [a] could be a written symbol for the note A.
<room> - an XML-based language where /room/ is expressed as <room>
<music> would be the actual XML-tag that marks some of the music concepts

These functions should not be confused with one another. For example, the grouping function in music notation may have different roles in different contexts. Both the bar line and the bind line serve to group notes. So we may have a concepts such as /group/, which may be expressed as {bind}, but which may have the XML element name of <NoteBind>.

Given the above distinctions, we should be careful to understand exactly what it is we want to describe in our attempt to write a standard. Do we want to be able to mark the abstract meta-music concepts, or the formal symbols of music writing systems, or both?

Hopefully this overview puts matters in perspective and we can now turn our attention to SMR specific issues.

From music soundwaves to visuals and vice versa

There are two processes that the MPEG SMR group needs to address:

expressing soundwaves with graphics,
thus from audio to visual
translating graphic symbols into soundwaves,
thus from visual to audio

On the surface these two processes may seem to be merely reversed from one another, but the matter is much more complex than that.

When a composer has a piece of music in his head, he can express that by either playing some instrument, or use a human communication system to make his ideas available to other musicians. The ultimate goal is still to produce physical sound waves. To communicate his music he typically uses some conventional notation system, of which the blobs-and-sticks of CWN is probably the most widely used system.

The above process is from the point of view of the composer and can be expressed diagrammatically as follows:

This process involves translating audio soundwaves into visual symbols.

From the point of view of a musician who reads a score, or a musicologist who studies music, the process may start from the opposite end: from the notation system. Except for the eccentric whims of the musicologist who may be interested only in the mathematical ratios, or some other abstract concept derived from the symbolic representation, even when the process is started from this end, the end result would be physical sound waves. The musician reads the score not for fun, but to be able to express something with his instrument. It is important that the musician knows the conventions of the particular writing system, and years of practice to be able to read it very quickly and fluently. And then still he needs to spend a long time analysing the score and practising during rehearsals (except of course in free styles, such as jazz). At performance time the score merely serves as reminder of cues.

This process can be presented diagrammatically as follows:

This process is a translation of visual graphics into audio soundwaves.

Music is about creating sound waves, and symbolic representation systems function to communicate to others how to recreate the intended soundwaves.

When we look at a Fourier analysis of sound waves, or a graphic representation of a piece of music, it can be viewed as a sine wave, which can be plotted graphically on two axes. We can describe that resulting graph in the words of human language. And it is quite possible to develop a markup language to “mark” all those concepts we use in our analysis. That would be one possible starting point.

There is a practical problem with this approach. A possible standard may force us to always begin with the audio part of music, but there is already a huge body of music scores for which we do not have corresponding audio for.

To accommodate this, it must be possible to mark only the written notation part of music, which in my model is done in Notation Space, and in which case the Audio Space would be empty. The core aspects of frequency and time nevertheless still need to be described in Music Space, which can never be empty, otherwise it will be much more difficult to translate music into different notation systems. Music Space thus allows for the core and common events and objects shared by most notation systems. The alternative strategy would be to design complex filters between each possible music notation system. That is quite cumbersome. The Music Space concept thus allows for a much more economical approach.

Music objects and events

This brings me to a distinction between music events and music objects. By definitions the word “event” is commonly used for a process, something that has a beginning and end in terms of time. In music the only “things” that comply with this definition are physical sound waves, or events such as users interfacing with devices or instruments.

Musical [notes] are on the level of symbolic expression (i.e. System of Formal Symbolic Expression). They are not events. Some symbols may be used to express events, but they themselves are graphic symbolic objects and not events.

In Music Space then, Music Events are audio events, while Music Objects consist of different classes ranging from the musical instruments and devices used to generate audio, as well as the symbols used in music writing to express audio events.

Here follows a tentative discussion of the distinction between possible music events and music objects. I will assume that an event has a beginning and end in time, while objects have beginnings and ends in graphic space. Objects have dimensions that can be plotted onto a 2-D space, such as a piece of paper, or a computer screen. Events cannot be plotted visually in this way, as they are processes, more specifically, audio processes. Only their particular graphic representations can be plotted visually.

We first need to distinguish between extrinsic and intrinsic music events.

Extrinsic music events: the control interfaces on devices such as CD, DVD and magnetic tape players or computer storage devices control whether a song is played, stopped, paused, or whether the user wishes to travel through the volume to either a specific other song or specific position within a song.
Intrinsic music events: within the piece of music, different sounds created by different instruments can be distinguished. Each instrument has one or several start-stop events.

Just in terms of starting and stopping, there are many permutations. A volume of music items (such as on a specific CD disk) can be started and stopped, a particular song on the disk can be started and stopped - thus extrinsic music events. But on a CD a specific instrument in a song cannot be started and stopped - it would be an intrinsic music event. That is only possible within a MIDI or multi track channel audio device; or during performance with human musicians. Each of these events eventually needs to be expressed with a markup element.

Other possible music events are effects that are applied to the basic (or fundamental) music sound wave. For example, a musician may apply /vibrato/ to a particular instrument. We can define vibrato formally and use a human language to communicate this without any reference to a Formal Symbolic Expression.

This is the level of description which I maintain we should focus on when designing an SMR system. This should be obvious, as different music description systems may use different methods to indicate vibrato, if indicated at all. The SMR symbol to express vibrato is an object. It has a certain conventional graphic representation within a specific music notation system, such as CWN. In terms of rendering this symbol graphically, it also has a visual start and stop position - ie. where the glyph representing it is placed within the surface (ie. screen or paper) coordinates. But these are space issues, not time issues.

There are thus several layers of start and stop concepts that need to be expressed with a markup language. They are not to be confused, and it would make sense to use different terms, each on its own layer.

If this explanation can be accepted as a guideline, it implies that graphic symbols (such as bar, bind, slur) may indeed have start and stop points that can be mapped onto a coordinate system. But these are objects that are expressions of how events are executed. They are this merely graphical properties of more abstract event properties. And if several graphical systems can be used to express the same events, these graphic symbols should be kept seperate, and be applied as a layer on top of the basic music descriptions.

This further implies that we should look at a more abstract system of description for a music core. Once we have determined and described this abstract system, it would be relatively easy to attach as many different graphic notation systems (System of Formal Symbolic Expression) as we like to this core. Moreover, by focusing on such a core, it would not be such a big deal to express this level of description either as audio or graphically.

The Music Space which I propose has several layers: an audio level and a graphic level, which should be conceptualized as two separate visually represented layers on top of one another. They may be related to one another, and to express or link that relationship the grid coordinates of the Music Space itself is used.

Music Space provides a coordinate grid system onto which Notation Space is mapped, and independently thereof, Audio Space. Music Space thus serves as the go-between between these two systems. As there is already such a bulk of available printed music material, and separately from this, a magnitude of compositions for which there is no notation, I propose that this approach is the only practical solution for a music markup language. This proposed system allows a composition's score to be mapped onto Music Space without bothering about its audio rendering. It also allows for the audio rendering of a composition without any bother about music notation. And of course, when the two need to be synchronized, it is done via Music Space.

The porposed Music Space is the underlying abstract layer of description onto which elements of Audio Space and of Visual Space are mapped.

Visual Space itself consists of several possible representation layers:

These layers represent the abstract Music Space in some or other visual format. In a sense, even Braille is a "visual" format, as sighted people can see the dots.

The Graphic Representation Space consists of all the possible music writing notation systems (illustrated in the beginning of this presentation). This includes MIDI events, which are usually written in alphanumeric symbols in user-friendly interfaces.

To bring the two domains of Music Notation Space and Audio Space together, a complex web of alignment is necessary. Issues that need to be considered range from audio aspects such as Tuning Systems, Reference Notes, frequency, time, and Scales to Notation issues such as which written convention to use, stylistic issues such as the graphic space between two notes and many more.

Music Space provides neutral ground where these two systems meet.

Many different writing systems can be mapped onto the Audio Space (eg. express the same waveform using different symbolic representation systems). It is theoretically possible to map the same audio to the notation conventions of CWN, Chinese, Neumes. The result will be far from "perfect" as it may not be pleasing to purists, but it is possible. Anyway, no conventional human language system of expression is a true representation of spoken language. Linguists had to design other systems. In SMR we need not develop a system for musicologists, but a practical system that can assist in the exchange of music users. Msuciologists may obtain their precise systems by linking style sheet to such a core system.

Conversely, it is also possible to map different audio expressions onto the same Notation Space (eg. the same notation played by different instruments).

I propose that we look at music much more abstractly, for the simple reason that the System for Formal Symbolic Expression is on the lower level of a descriptive analysis. A few years down the line MPEG may wish to address other aspects of music, and I guarantee that there will be issues if we only address SMR. In fact, there is already an issue as MPEG already has standards in place for sound waves, although not perhaps described on all the levels distinguished above. In my view SMR is a sub-section of a much larger music markup language. So in terms of my explanation above, the <music> that we develop now, may be [music], but needs to fit into both {music} and /music/.

Music structure and visual style

Now let me return to the possibility of describing a music sound wave in detail. We could do that, but just as we do not wish to describe all the differences in hue of the colors visible on a wall in this room, for the puspose of developed a general music markup language I do not think it is necessary. A paint specialist may indeed be interested in a markup language that does indeed describe all the color hues. And a musicologist interested in the physics of music may indeed require a markup language for describing sine waves. But my personal interest in a markup language for music is to exchange music with music lovers ranging from composers and musicians to listeners. So although I do begin with a definition of music as physical sound waves, I would begin at a much higher level of generalization, just as we did by distinguishing three child categories for the parent “room”.

In music there is, fortunately, a very good abstraction, or generalization already available, and it goes by the name of “note”. In this context I use this term as /note/, on the level of a System of Explanation, so do not confuse it with System of Formal Symbolic Expression, that is, a blob-and-stick symbol of CWN - or [note].

In the world of SGML there is a strict distinction between the semantic structure of a document, and the style in which that content can be displayed. The SMR group should carefully distinguish between the meaningful abstract semantic objects in the SMR system, and their actual graphic rendering.

A general XML-based language should not really address minute stylistic issues such as how far the space between two graphically displayed music objects should appear in print. That sould be a stylistic issue. An XML-based language should focus on merely describing the relevant object sufficiently, and leave the stylistic details to end-user applications.

That any attempt at describing stylistic issues is a futile process should be self-evident -- consider the magnitude classes of available devices out there, and the degree of differences within a class. Computer screens, TV screens, cell-phone screens and PDA screens all have different dimensions and resolutions. Even within the computer screen market there is a very wide variety.

Also consider this. The reason why a printed score is broken up into pieces that fit horizontally onto a sheet of paper has nothing to do with some intrinsic music requirement, but is due to the requirement of paper printing. The score of a particular part could just be one single long pentagraph on a very wide piece of paper, such as found in medieval scrolls. In practice it may be difficult to scroll to the next section. It is probably easier to turn a page, which is why that format is not widely used. But that requirement is one required by paper printing technology, not by the intrinsic music characteristics. On a computer screen the score can scroll horizontally across the screen ad infinitum without any breaks at all. Such issues are thus stylistic, and not intrinsic to music.

An XML language for the above would focus on music-related events and describe the objects and leave their implementation to the specific technological abilities of the rendering device. In any case, the trend in the world of computing is to put the user in control of how they would like to view the information. This includes stylistic issues. It is thus not a worthwhile attempt to try to describe issues such as the length of a barline, the actual size of a blob and stick, the distances between them, etc.

Standardized Vector Graphics

The W3C published recommendations for a Standardized Vector Graphics language called SVG, which is an XML-based language. It should fit in quite easily with the attempts of SMR. SVG documents can be displayed using any standard W3C compliant browser. If SVG compliancy is not native to the browser, there are browser plug-ins available from companies such as Adobe. This technology is already there. The MPEG SMR group should not try to reinvent the wheel in this respect.

SVG allows one to build classes of pre-defined objects that may be pulled into a markup document and re-use them. It is thus possible to design music objects as SVG classes.

Here are some very basic examples of SVG music notation objects. In order to view the SVG data, you need to have a plug-in for your browser. Note that you can zoom in or out without losing any quality in the images.

One A couple of notes on a pentagram
Two Various symbols
Three Notes with different duration values
Four Some notes with different styles
You need to have an SVG plug-in for your browser to view the examples.

As this technology is already in use, I would think that one of the tasks of the MPEG SMR group would be to identify sets of graphic musical symbols, to define their structures and assign XML tags to them, while leaving stylistic issues to implementation application programs and devices. In other words, as an initial phase, develop an XML-based SMR that merely classifies music graphic symbols and describe their characteristics.

Conclusion

The MPEG SMR group should decide exactly which music events and objects should be defined and described using an XML application. XML is a very powerful concept and be used to describe anything in the universe that can be expressed with human language.

I propose that a more abstract layer of description should be used as basis, such as Music Space. This would allow the easy adaptation of any future music-related MPEG projects, as well as the translation of a piece of music into various music notation systems.

And lastly I propose that existing XML-related recommendations, such as SVG, be considered for the graphic rendering of music audio.

References

Adobe SVG Viewer http://www.adobe.com/svg/viewer/install/

Harold ER, 2004 Effective XML : 50 specific ways to improve your XML. Boston : Addison Wesley

Steyn J MML (Music Markup Language) http://www.musicmarkup.info/

Steyn J Music Space http://www.musicmarkup.info/papers/musicspace/musicspace.html

Top