Parsing and Formatting Text using Java

hanszhu

1733

收藏 2015-06-05

Parsing and Formatting Text

Parsing and formatting text is a large, open-ended topic. So far in this chapter, we’ve looked at only primitive operations on strings—creation, basic editing, searching, and turning simple values into strings. Now we’d like to move on to more structured forms of text. Java has a rich set of APIs for parsing and printing formatted strings, including numbers, dates, times, and currency values. We’ll cover most of these topics in this chapter, but we’ll wait to discuss date and time formatting until Chapter 11.

We’ll start with parsing—reading primitive numbers and values as strings and chopping long strings into tokens. Then we’ll go the other way and look at formatting strings and the java.text package. We’ll revisit the topic of internationalization to see how Java can localize parsing and formatting of text, numbers, and dates for particular locales. Finally, we’ll take a detailed look at regular expressions, the most powerful text-parsing tool Java offers. Regular expressions let you define your own patterns of arbitrary complexity, search for them, and parse them from text.

We should mention that you’re going to see a great deal of overlap between the new formatting and parsing APIs (printf and Scanner) introduced in Java 5.0 and the older APIs of the java.textpackage. The new APIs effectively replace much of the old ones and in some ways are easier to use. Nonetheless, it’s good to know about both because so much existing code uses the older APIs.

Parsing Primitive Numbers

复制代码

Working with alternate bases

复制代码

Number formats

复制代码

Tokenizing Text

A common programming task involves parsing a string of text into words or “tokens” that are separated by some set of delimiter characters, such as spaces or commas. The first example contains words separated by single spaces. The second, more realistic problem involves comma-delimited fields.

复制代码

Java has several (unfortunately overlapping) APIs for handling situations like this. The most powerful and useful are the String split() and Scanner APIs. Both utilize regular expressions to allow you to break the string on arbitrary patterns. We haven’t talked about regular expressions yet, but in order to show you how this works we’ll just give you the necessary magic and explain in detail later in this chapter. We’ll also mention a legacy utility, java.util.StringTokenizer, which uses simple character sets to split a string. StringTokenizer is not as powerful, but doesn’t require an understanding of regular expressions.

The String split() method accepts a regular expression that describes a delimiter and uses it to chop the string into an array of Strings:

复制代码

In the first example, we used the regular expression \\s, which matches a single whitespace character (space, tab, or carriage return). The split() method returned an array of eight strings. In the second example, we used a more complicated regular expression, \\s*,\\s*, which matches a comma surrounded by any number of contiguous spaces (possibly zero). This reduced our text to three nice, tidy fields.

With the new Scanner API, we could go a step further and parse the numbers of our second example as we extract them:

复制代码

Here, we’ve told the Scanner to use our regular expression as the delimiter and then called it repeatedly to parse each field as its corresponding type. The Scanner is convenient because it can read not only from Strings but directly from stream sources, such as InputStreams, Files, and Channels:

复制代码

Another thing that you can do with the Scanner is to look ahead with the “hasNext” methods to see if another item is coming:

复制代码

StringTokenizer

复制代码

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群